30 Apr, 2008

40 commits

  • Fix a bug that Werner Baumann reported: fuse can send a bigger write request
    than the maximum specified. This only affected direct_io operation.

    In addition set a sane minimum for the max_read and max_write tunables, so I/O
    always makes some progress.

    Signed-off-by: Miklos Szeredi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Miklos Szeredi
     
  • If the READ request returned a short count, then either

    - cached size is incorrect
    - filesystem is buggy, as short reads are only allowed on EOF

    So assume that the size is wrong and refresh it, so that cached read() doesn't
    zero fill the missing chunk.

    Signed-off-by: Miklos Szeredi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Miklos Szeredi
     
  • Introduce fuse_perform_write. With fusexmp (a passthrough filesystem), large
    (1MB) writes into a backing tmpfs filesystem are sped up by almost 4 times
    (256MB/s vs 71MB/s).

    [mszeredi@suse.cz]:

    - split into smaller functions
    - testing
    - duplicate generic_file_aio_write(), so that there's no need to add a
    new ->perform_write() a_op. Comment from hch.

    Signed-off-by: Nick Piggin
    Signed-off-by: Miklos Szeredi
    Cc: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Extract common code for setting i_size in write functions into a common
    helper.

    Signed-off-by: Miklos Szeredi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Miklos Szeredi
     
  • Quoting Linus (3 years ago, FUSE inclusion discussions):

    "User-space filesystems are hard to get right. I'd claim that they
    are almost impossible, unless you limit them somehow (shared
    writable mappings are the nastiest part - if you don't have those,
    you can reasonably limit your problems by limiting the number of
    dirty pages you accept through normal "write()" calls)."

    Instead of attempting the impossible, I've just waited for the dirty page
    accounting infrastructure to materialize (thanks to Peter Zijlstra and
    others). This nicely solved the biggest problem: limiting the number of pages
    used for write caching.

    Some small details remained, however, which this largish patch attempts to
    address. It provides a page writeback implementation for fuse, which is
    completely safe against VM related deadlocks. Performance may not be very
    good for certain usage patterns, but generally it should be acceptable.

    It has been tested extensively with fsx-linux and bash-shared-mapping.

    Fuse page writeback design
    --------------------------

    fuse_writepage() allocates a new temporary page with GFP_NOFS|__GFP_HIGHMEM.
    It copies the contents of the original page, and queues a WRITE request to the
    userspace filesystem using this temp page.

    The writeback is finished instantly from the MM's point of view: the page is
    removed from the radix trees, and the PageDirty and PageWriteback flags are
    cleared.

    For the duration of the actual write, the NR_WRITEBACK_TEMP counter is
    incremented. The per-bdi writeback count is not decremented until the actual
    write completes.

    On dirtying the page, fuse waits for a previous write to finish before
    proceeding. This makes sure, there can only be one temporary page used at a
    time for one cached page.

    This approach is wasteful in both memory and CPU bandwidth, so why is this
    complication needed?

    The basic problem is that there can be no guarantee about the time in which
    the userspace filesystem will complete a write. It may be buggy or even
    malicious, and fail to complete WRITE requests. We don't want unrelated parts
    of the system to grind to a halt in such cases.

    Also a filesystem may need additional resources (particularly memory) to
    complete a WRITE request. There's a great danger of a deadlock if that
    allocation may wait for the writepage to finish.

    Currently there are several cases where the kernel can block on page
    writeback:

    - allocation order is larger than PAGE_ALLOC_COSTLY_ORDER
    - page migration
    - throttle_vm_writeout (through NR_WRITEBACK)
    - sync(2)

    Of course in some cases (fsync, msync) we explicitly want to allow blocking.
    So for these cases new code has to be added to fuse, since the VM is not
    tracking writeback pages for us any more.

    As an extra safetly measure, the maximum dirty ratio allocated to a single
    fuse filesystem is set to 1% by default. This way one (or several) buggy or
    malicious fuse filesystems cannot slow down the rest of the system by hogging
    dirty memory.

    With appropriate privileges, this limit can be raised through
    '/sys/class/bdi//max_ratio'.

    Signed-off-by: Miklos Szeredi
    Cc: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Miklos Szeredi
     
  • A few fields in /proc/meminfo were not documented. Fix.

    Signed-off-by: Miklos Szeredi
    Cc: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Miklos Szeredi
     
  • Fuse will use temporary buffers to write back dirty data from memory mappings
    (normal writes are done synchronously). This is needed, because there cannot
    be any guarantee about the time in which a write will complete.

    By using temporary buffers, from the MM's point if view the page is written
    back immediately. If the writeout was due to memory pressure, this
    effectively migrates data from a full zone to a less full zone.

    This patch adds a new counter (NR_WRITEBACK_TEMP) for the number of pages used
    as temporary buffers.

    [Lee.Schermerhorn@hp.com: add vmstat_text for NR_WRITEBACK_TEMP]
    Signed-off-by: Miklos Szeredi
    Cc: Christoph Lameter
    Signed-off-by: Lee Schermerhorn
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Miklos Szeredi
     
  • Fuse needs this for writable mmap support.

    Signed-off-by: Miklos Szeredi
    Cc: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Miklos Szeredi
     
  • Add a new BDI capability flag: BDI_CAP_NO_ACCT_WB. If this flag is
    set, then don't update the per-bdi writeback stats from
    test_set_page_writeback() and test_clear_page_writeback().

    Misc cleanups:

    - convert bdi_cap_writeback_dirty() and friends to static inline functions
    - create a flag that includes all three dirty/writeback related flags,
    since almst all users will want to have them toghether

    Signed-off-by: Miklos Szeredi
    Cc: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Miklos Szeredi
     
  • Move BDI statistics to debugfs:

    /sys/kernel/debug/bdi//stats

    Use postcore_initcall() to initialize the sysfs class and debugfs,
    because debugfs is initialized in core_initcall().

    Update descriptions in ABI documentation.

    Signed-off-by: Miklos Szeredi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Miklos Szeredi
     
  • Add "max_ratio" to /sys/class/bdi. This indicates the maximum percentage of
    the global dirty threshold allocated to this bdi.

    [mszeredi@suse.cz]

    - fix parsing in max_ratio_store().
    - export bdi_set_max_ratio() to modules
    - limit bdi_dirty with bdi->max_ratio
    - document new sysfs attribute

    Signed-off-by: Peter Zijlstra
    Signed-off-by: Miklos Szeredi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Zijlstra
     
  • Under normal circumstances each device is given a part of the total write-back
    cache that relates to its current avg writeout speed in relation to the other
    devices.

    min_ratio - allows one to assign a minimum portion of the write-back cache to
    a particular device. This is useful in situations where you might want to
    provide a minimum QoS. (One request for this feature came from flash based
    storage people who wanted to avoid writing out at all costs - they of course
    needed some pdflush hacks as well)

    max_ratio - allows one to assign a maximum portion of the dirty limit to a
    particular device. This is useful in situations where you want to avoid one
    device taking all or most of the write-back cache. Eg. an NFS mount that is
    prone to get stuck, or a FUSE mount which you don't trust to play fair.

    Add "min_ratio" to /sys/class/bdi. This indicates the minimum percentage of
    the global dirty threshold allocated to this bdi.

    [mszeredi@suse.cz]

    - fix parsing in min_ratio_store()
    - document new sysfs attribute

    Signed-off-by: Peter Zijlstra
    Signed-off-by: Miklos Szeredi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Zijlstra
     
  • Register FUSE's backing_dev_info under sysfs with the name "fuse-MAJOR:MINOR"

    Make the fuse control filesystem use s_dev instead of a fuse specific ID.
    This makes it easier to match directories under /sys/fs/fuse/connections/ with
    directories under /sys/class/bdi, and with actual mounts.

    Signed-off-by: Miklos Szeredi
    Cc: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Miklos Szeredi
     
  • Register NFS' backing_dev_info under sysfs with the name "nfs-MAJOR:MINOR"

    Signed-off-by: Miklos Szeredi
    Cc: Peter Zijlstra
    Cc: Trond Myklebust
    Cc: "J. Bruce Fields"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Miklos Szeredi
     
  • Provide a place in sysfs (/sys/class/bdi) for the backing_dev_info object.
    This allows us to see and set the various BDI specific variables.

    In particular this properly exposes the read-ahead window for all relevant
    users and /sys/block//queue/read_ahead_kb should be deprecated.

    With patient help from Kay Sievers and Greg KH

    [mszeredi@suse.cz]

    - split off NFS and FUSE changes into separate patches
    - document new sysfs attributes under Documentation/ABI
    - do bdi_class_init as a core_initcall, otherwise the "default" BDI
    won't be initialized
    - remove bdi_init_fmt macro, it's not used very much

    [akpm@linux-foundation.org: fix ia64 warning]
    Signed-off-by: Peter Zijlstra
    Cc: Kay Sievers
    Acked-by: Greg KH
    Cc: Trond Myklebust
    Signed-off-by: Miklos Szeredi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Zijlstra
     
  • These values represent the nesting level of a namespace and pids living in it,
    and it's always non-negative.

    Turning this from int to unsigned int saves some space in pid.c (11 bytes on
    x86 and 64 on ia64) by letting the compiler optimize the pid_nr_ns a bit.
    E.g. on ia64 this removes the sign extension calls, which compiler adds to
    optimize access to pid->nubers[ns->level].

    Signed-off-by: Pavel Emelyanov
    Cc: "Eric W. Biederman"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pavel Emelyanov
     
  • With the needlessly global marker_debug being static gcc can optimize the
    unused code away.

    Signed-off-by: Adrian Bunk
    Acked-by: Mathieu Desnoyers
    Cc: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Adrian Bunk
     
  • Signed-off-by: Christoph Hellwig
    Signed-off-by: Jiri Slaby
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Hellwig
     
  • 1. sys_getpgid() needs rcu_read_lock() to derive the pgrp _nr, even if
    the task is current, otherwise we can race with another thread which
    does sys_setpgid().

    2. Use rcu_read_lock() instead of tasklist_lock when pid != 0, make sure
    that we don't use the NULL pid if the task exits right after successful
    find_task_by_vpid().

    Signed-off-by: Oleg Nesterov
    Cc: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • 1. sys_getsid() needs rcu_read_lock() to derive the session _nr, even if
    the task is current, otherwise we can race with another thread which
    does sys_setsid().

    2. The task can exit between find_task_by_vpid() and task_session_vnr(),
    in that unlikely case sys_getsid() returns 0 instead of -ESRCH.

    Signed-off-by: Oleg Nesterov
    Cc: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Use change_pid() instead of detach_pid() + attach_pid() in
    __set_special_pids().

    This way task_session() is not NULL in between.

    Signed-off-by: Oleg Nesterov
    Cc: "Eric W. Biederman"
    Cc: Pavel Emelyanov
    Cc: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Use change_pid() instead of detach_pid() + attach_pid() in sys_setpgid().

    This way task_pgrp() is not NULL in between.

    Signed-off-by: Oleg Nesterov
    Cc: "Eric W. Biederman"
    Cc: Pavel Emelyanov
    Cc: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Based on Eric W. Biederman's idea.

    Without tasklist_lock held task_session()/task_pgrp() can return NULL if the
    caller races with setprgp()/setsid() which does detach_pid() + attach_pid().
    This can happen even if task == current.

    Intoduce the new helper, change_pid(), which should be used instead. This way
    the caller always sees the special pid != NULL, either old or new.

    Also change the prototype of attach_pid(), it always returns 0 and nobody
    check the returned value.

    Signed-off-by: Oleg Nesterov
    Cc: "Eric W. Biederman"
    Cc: Pavel Emelyanov
    Cc: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Based on Eric W. Biederman's idea.

    Unless task == current, without tasklist_lock held task_session()/task_pgrp()
    can return NULL if the caller races with de_thread() which switches the group
    leader.

    Change transfer_pid() to not clear old->pids[type].pid for the old leader.
    This means that its .pid can point to "nowhere", but this is already true for
    sub-threads, and the old leader is not group_leader() any longer. IOW, with
    or without this change we can't trust task's special pids unless it is the
    group leader.

    With this change the following code

    rcu_read_lock();
    task = find_task_by_xxx();
    do_something(task_pgrp(task), task_session(task));
    rcu_read_unlock();

    can't race with exec and hit the NULL pid.

    Signed-off-by: Oleg Nesterov
    Cc: "Eric W. Biederman"
    Cc: Pavel Emelyanov
    Cc: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • There are some places that are known to operate on tasks'
    global pids only:

    * the rest_init() call (called on boot)
    * the kgdb's getthread
    * the create_kthread() (since the kthread is run in init ns)

    So use the find_task_by_pid_ns(..., &init_pid_ns) there
    and schedule the find_task_by_pid for removal.

    [sukadev@us.ibm.com: Fix warning in kernel/pid.c]
    Signed-off-by: Pavel Emelyanov
    Cc: "Eric W. Biederman"
    Signed-off-by: Sukadev Bhattiprolu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pavel Emelyanov
     
  • The pid to lookup a task by is passed inside taskstats code via genetlink
    message.

    Since netlink packets are now processed in the context of the sending task,
    this is correct to lookup the task with find_task_by_vpid() here.

    Besides, I fix the call to fill_pid() from taskstats_exit(), since the
    tsk->pid is not required in fill_pid() in this case, and the pid field on
    task_struct is going to be deprecated as well.

    Signed-off-by: Pavel Emelyanov
    Cc: "Eric W. Biederman"
    Cc: Balbir Singh
    Cc: Jay Lan
    Cc: Jonathan Lim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pavel Emelyanov
     
  • The callers of free_pidmap() pass 2 members of "struct upid", we can just
    pass "struct upid *" instead. Shaves off 10 bytes from pid.o.

    Also, simplify the alloc_pid's "out_free:" error path a little bit. This
    way it looks more clear which subset of pid->numbers[] we are freeing.

    Signed-off-by: Oleg Nesterov
    Cc: Pavel Emelyanov
    Cc: "Eric W. Biederman"
    Cc :Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Factor out the code used to allocate/free a pts index into new interfaces,
    devpts_new_index() and devpts_kill_index(). This localizes the external data
    structures used in managing the pts indices.

    [akpm@linux-foundation.org: undo accidental mutex2sem conversion]
    Signed-off-by: Sukadev Bhattiprolu
    Signed-off-by: Serge Hallyn
    Signed-off-by: Matt Helsley
    Acked-by: H. Peter Anvin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sukadev Bhattiprolu
     
  • Have ptmx_open() propagate any error code returned by devpts_pty_new()
    (which returns either 0 or -ENOMEM anyway).

    Signed-off-by: Sukadev Bhattiprolu
    Acked-by: Serge Hallyn
    Acked-by: H. Peter Anvin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sukadev Bhattiprolu
     
  • At ptmx_open(), the 2nd parameter for check_tty_count() should
    be "ptmx_open".

    Signed-off-by: Hiroshi Shimamoto
    Acked-by: Alan Cox
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hiroshi Shimamoto
     
  • Simple search/replace except for synclink.c where I noticed a real bug and
    fixed it too. It was doing NULL + offset, then checking for NULL if the remap
    failed.

    Signed-off-by: Alan Cox
    Cc: Paul Fulghum
    Acked-by: Jiri Slaby
    Cc: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alan Cox
     
  • Signed-off-by: Alan Cox

    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alan Cox
     
  • Something Arjan suggested which allows us to clean up the code nicely

    Signed-off-by: Alan Cox
    Cc: Arjan van de Ven
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alan Cox
     
  • Fix the rather strange buffer management on open that turned up while auditing
    for BKL dependencies.

    Signed-off-by: Alan Cox
    Cc: Jiri Slaby
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alan Cox
     
  • Signed-off-by: Alan Cox
    Cc: Jiri Slaby
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alan Cox
     
  • Clean up the epca driver

    Signed-off-by: Alan Cox
    Cc: Jiri Slaby
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alan Cox
     
  • Signed-off-by: Alan Cox
    Cc: Jiri Slaby
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alan Cox
     
  • - Use the tty baud functions
    - Call driver termios methods directly holding the right locking
    - Check for a write method

    Signed-off-by: Alan Cox
    Cc: David S. Miller
    Cc: Jeff Garzik
    Cc: "John W. Linville"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alan Cox
     
  • - Operations are now a shared const function block as with most other Linux
    objects

    - Introduce wrappers for some optional functions to get consistent behaviour

    - Wrap put_char which used to be patched by the tty layer

    - Document which functions are needed/optional

    - Make put_char report success/fail

    - Cache the driver->ops pointer in the tty as tty->ops

    - Remove various surplus lock calls we no longer need

    - Remove proc_write method as noted by Alexey Dobriyan

    - Introduce some missing sanity checks where certain driver/ldisc
    combinations would oops as they didn't check needed methods were present

    [akpm@linux-foundation.org: fix fs/compat_ioctl.c build]
    [akpm@linux-foundation.org: fix isicom]
    [akpm@linux-foundation.org: fix arch/ia64/hp/sim/simserial.c build]
    [akpm@linux-foundation.org: fix kgdb]
    Signed-off-by: Alan Cox
    Acked-by: Greg Kroah-Hartman
    Cc: Jason Wessel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alan Cox
     
  • [akpm@linux-foundation.org: fix arm, cleanups]
    Signed-off-by: Alan Cox
    Cc: Jiri Slaby
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alan Cox