29 Apr, 2011

1 commit

  • Azurit reports large increases in system time after 2.6.36 when running
    Apache. It was bisected down to a892e2d7dcdfa6c76e6 ("vfs: use kmalloc()
    to allocate fdmem if possible").

    That patch caused the vfs to use kmalloc() for very large allocations and
    this is causing excessive work (and presumably excessive reclaim) within
    the page allocator.

    Fix it by falling back to vmalloc() earlier - when the allocation attempt
    would have been considered "costly" by reclaim.

    Reported-by: azurIt
    Tested-by: azurIt
    Acked-by: Changli Gao
    Cc: Americo Wang
    Cc: Jiri Slaby
    Acked-by: Eric Dumazet
    Cc: Mel Gorman
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     

11 Aug, 2010

1 commit

  • Use kmalloc() to allocate fdmem if possible.

    vmalloc() is used as a fallback solution for fdmem allocation. A new
    helper function __free_fdtable() is introduced to reduce the lines of
    code.

    A potential bug, vfree() a memory allocated by kmalloc(), is fixed.

    [akpm@linux-foundation.org: use __GFP_NOWARN, uninline alloc_fdmem() and free_fdmem()]
    Signed-off-by: Changli Gao
    Cc: Alexander Viro
    Cc: Jiri Slaby
    Cc: "Paul E. McKenney"
    Cc: Alexey Dobriyan
    Cc: Ingo Molnar
    Cc: Peter Zijlstra
    Cc: Avi Kivity
    Cc: Tetsuo Handa
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Changli Gao
     

15 Jun, 2010

1 commit


07 Mar, 2010

1 commit

  • Make sure compiler won't do weird things with limits. E.g. fetching them
    twice may return 2 different values after writable limits are implemented.

    I.e. either use rlimit helpers added in commit 3e10e716abf3 ("resource:
    add helpers for fetching rlimits") or ACCESS_ONCE if not applicable.

    Signed-off-by: Jiri Slaby
    Cc: Alexander Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jiri Slaby
     

25 Feb, 2010

1 commit

  • Add lockdep-ified RCU primitives to alloc_fd(), files_fdtable()
    and fcheck_files().

    Cc: Alexander Viro
    Signed-off-by: Paul E. McKenney
    Cc: laijs@cn.fujitsu.com
    Cc: dipankar@in.ibm.com
    Cc: mathieu.desnoyers@polymtl.ca
    Cc: josh@joshtriplett.org
    Cc: dvhltc@us.ibm.com
    Cc: niv@us.ibm.com
    Cc: peterz@infradead.org
    Cc: rostedt@goodmis.org
    Cc: Valdis.Kletnieks@vt.edu
    Cc: dhowells@redhat.com
    Cc: Alexander Viro
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Paul E. McKenney
     

12 Oct, 2009

1 commit


01 Aug, 2008

1 commit


27 Jul, 2008

1 commit

  • * dup2() should return -EBADF on exceeded sysctl_nr_open
    * dup() should *not* return -EINVAL even if you have rlimit set to 0;
    it should get -EMFILE instead.

    Check for orig_start exceeding rlimit taken to sys_fcntl().
    Failing expand_files() in dup{2,3}() now gets -EMFILE remapped to -EBADF.
    Consequently, remaining checks for rlimit are taken to expand_files().

    Signed-off-by: Al Viro

    Al Viro
     

17 May, 2008

6 commits


02 May, 2008

2 commits


07 Feb, 2008

1 commit

  • NR_OPEN (historically set to 1024*1024) actually forbids processes to open
    more than 1024*1024 handles.

    Unfortunatly some production servers hit the not so 'ridiculously high
    value' of 1024*1024 file descriptors per process.

    Changing NR_OPEN is not considered safe because of vmalloc space potential
    exhaust.

    This patch introduces a new sysctl (/proc/sys/fs/nr_open) wich defaults to
    1024*1024, so that admins can decide to change this limit if their workload
    needs it.

    [akpm@linux-foundation.org: export it for sparc64]
    Signed-off-by: Eric Dumazet
    Cc: Alan Cox
    Cc: Richard Henderson
    Cc: Ivan Kokshaysky
    Cc: "David S. Miller"
    Cc: Ralf Baechle
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric Dumazet
     

23 Dec, 2006

1 commit

  • Christoph Hellwig has expressed concerns that the recent fdtable changes
    expose the details of the RCU methodology used to release no-longer-used
    fdtable structures to the rest of the kernel. The trivial patch below
    addresses these concerns by introducing the appropriate free_fdtable()
    calls, which simply wrap the release RCU usage. Since free_fdtable() is a
    one-liner, it makes sense to promote it to an inline helper.

    Signed-off-by: Vadim Lobanov
    Cc: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vadim Lobanov
     

11 Dec, 2006

3 commits

  • This patch provides an improved fdtable allocation scheme, useful for
    expanding fdtable file descriptor entries. The main focus is on the fdarray,
    as its memory usage grows 128 times faster than that of an fdset.

    The allocation algorithm sizes the fdarray in such a way that its memory usage
    increases in easy page-sized chunks. The overall algorithm expands the allowed
    size in powers of two, in order to amortize the cost of invoking vmalloc() for
    larger allocation sizes. Namely, the following sizes for the fdarray are
    considered, and the smallest that accommodates the requested fd count is
    chosen:

    pagesize / 4
    pagesize / 2
    pagesize open_fds is now used as the anchor for the
    fdset memory allocation.

    Signed-off-by: Vadim Lobanov
    Cc: Christoph Hellwig
    Cc: Al Viro
    Cc: Dipankar Sarma
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vadim Lobanov
     
  • An fdtable can either be embedded inside a files_struct or standalone (after
    being expanded). When an fdtable is being discarded after all RCU references
    to it have expired, we must either free it directly, in the standalone case,
    or free the files_struct it is contained within, in the embedded case.

    Currently the free_files field controls this behavior, but we can get rid of
    it entirely, as all the necessary information is already recorded. We can
    distinguish embedded and standalone fdtables using max_fds, and if it is
    embedded we can divine the relevant files_struct using container_of().

    Signed-off-by: Vadim Lobanov
    Cc: Christoph Hellwig
    Cc: Al Viro
    Cc: Dipankar Sarma
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vadim Lobanov
     
  • Currently, each fdtable supports three dynamically-sized arrays of data: the
    fdarray and two fdsets. The code allows the number of fds supported by the
    fdarray (fdtable->max_fds) to differ from the number of fds supported by each
    of the fdsets (fdtable->max_fdset).

    In practice, it is wasteful for these two sizes to differ: whenever we hit a
    limit on the smaller-capacity structure, we will reallocate the entire fdtable
    and all the dynamic arrays within it, so any delta in the memory used by the
    larger-capacity structure will never be touched at all.

    Rather than hogging this excess, we shouldn't even allocate it in the first
    place, and keep the capacities of the fdarray and the fdsets equal. This
    patch removes fdtable->max_fdset. As an added bonus, most of the supporting
    code becomes simpler.

    Signed-off-by: Vadim Lobanov
    Cc: Christoph Hellwig
    Cc: Al Viro
    Cc: Dipankar Sarma
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vadim Lobanov
     

08 Dec, 2006

1 commit

  • free_fdtable_rc() schedules timer to reschedule fddef->wq if
    schedule_work() on it returns 0. However, schedule_work() guarantees that
    the target work is executed at least once after the scheduling regardless
    of its return value. 0 return simply means that the work was already
    pending and thus no further action was required.

    Another problem is that it used contant '5' as @expires argument to
    mod_timer().

    Kill unnecessary fddef->timer.

    Signed-off-by: Tejun Heo
    Cc: Dipankar Sarma
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tejun Heo
     

22 Nov, 2006

1 commit

  • Pass the work_struct pointer to the work function rather than context data.
    The work function can use container_of() to work out the data.

    For the cases where the container of the work_struct may go away the moment the
    pending bit is cleared, it is made possible to defer the release of the
    structure by deferring the clearing of the pending bit.

    To make this work, an extra flag is introduced into the management side of the
    work_struct. This governs auto-release of the structure upon execution.

    Ordinarily, the work queue executor would release the work_struct for further
    scheduling or deallocation by clearing the pending bit prior to jumping to the
    work function. This means that, unless the driver makes some guarantee itself
    that the work_struct won't go away, the work function may not access anything
    else in the work_struct or its container lest they be deallocated.. This is a
    problem if the auxiliary data is taken away (as done by the last patch).

    However, if the pending bit is *not* cleared before jumping to the work
    function, then the work function *may* access the work_struct and its container
    with no problems. But then the work function must itself release the
    work_struct by calling work_release().

    In most cases, automatic release is fine, so this is the default. Special
    initiators exist for the non-auto-release case (ending in _NAR).

    Signed-Off-By: David Howells

    David Howells
     

30 Sep, 2006

2 commits


27 Sep, 2006

1 commit


13 Jul, 2006

2 commits

  • We're supposed to go the next power of two if nfds==nr.

    Of `nr', not of `nfsd'.

    Spotted by Rene Scharfe

    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • When found, it is obvious. nfds calculated when allocating fdsets is
    rewritten by calculation of size of fdtable, and when we are unlucky, we
    try to free fdsets of wrong size.

    Found due to OpenVZ resource management (User Beancounters).

    Signed-off-by: Alexey Kuznetsov
    Signed-off-by: Kirill Korotaev
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill Korotaev
     

11 Jul, 2006

1 commit


29 Mar, 2006

1 commit


23 Mar, 2006

1 commit

  • 1) Reduce the size of (struct fdtable) to exactly 64 bytes on 32bits
    platforms, lowering kmalloc() allocated space by 50%.

    2) Reduce the size of (files_struct), using a special 32 bits (or
    64bits) embedded_fd_set, instead of a 1024 bits fd_set for the
    close_on_exec_init and open_fds_init fields. This save some ram (248
    bytes per task) as most tasks dont open more than 32 files. D-Cache
    footprint for such tasks is also reduced to the minimum.

    3) Reduce size of allocated fdset. Currently two full pages are
    allocated, that is 32768 bits on x86 for example, and way too much. The
    minimum is now L1_CACHE_BYTES.

    UP and SMP should benefit from this patch, because most tasks will touch
    only one cache line when open()/close() stdin/stdout/stderr (0/1/2),
    (next_fd, close_on_exec_init, open_fds_init, fd_array[0 .. 2] being in the
    same cache line)

    Signed-off-by: Eric Dumazet
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric Dumazet
     

06 Feb, 2006

1 commit

  • percpu_data blindly allocates bootmem memory to store NR_CPUS instances of
    cpudata, instead of allocating memory only for possible cpus.

    As a preparation for changing that, we need to convert various 0 -> NR_CPUS
    loops to use for_each_cpu().

    (The above only applies to users of asm-generic/percpu.h. powerpc has gone it
    alone and is presently only allocating memory for present CPUs, so it's
    currently corrupting memory).

    Signed-off-by: Eric Dumazet
    Cc: "David S. Miller"
    Cc: James Bottomley
    Acked-by: Ingo Molnar
    Cc: Jens Axboe
    Cc: Anton Blanchard
    Acked-by: William Irwin
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric Dumazet
     

15 Sep, 2005

1 commit

  • Noted by David Miller:

    "The bug is that free_fd_array() takes a "num" argument, but when
    calling it from __free_fdtable() we're instead passing in the size in
    bytes (ie. "num * sizeof(struct file *)")."

    Yes it is a bug. I think I messed it up while merging newer
    changes with an older version where I was using size in bytes
    to optimize.

    Signed-off-by: Dipankar Sarma
    Signed-off-by: Linus Torvalds

    Dipankar Sarma
     

10 Sep, 2005

2 commits

  • Patch to eliminate struct files_struct.file_lock spinlock on the reader side
    and use rcu refcounting rcuref_xxx api for the f_count refcounter. The
    updates to the fdtable are done by allocating a new fdtable structure and
    setting files->fdt to point to the new structure. The fdtable structure is
    protected by RCU thereby allowing lock-free lookup. For fd arrays/sets that
    are vmalloced, we use keventd to free them since RCU callbacks can't sleep. A
    global list of fdtable to be freed is not scalable, so we use a per-cpu list.
    If keventd is already handling the current cpu's work, we use a timer to defer
    queueing of that work.

    Since the last publication, this patch has been re-written to avoid using
    explicit memory barriers and use rcu_assign_pointer(), rcu_dereference()
    premitives instead. This required that the fd information is kept in a
    separate structure (fdtable) and updated atomically.

    Signed-off-by: Dipankar Sarma
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dipankar Sarma
     
  • In order for the RCU to work, the file table array, sets and their sizes must
    be updated atomically. Instead of ensuring this through too many memory
    barriers, we put the arrays and their sizes in a separate structure. This
    patch takes the first step of putting the file table elements in a separate
    structure fdtable that is embedded withing files_struct. It also changes all
    the users to refer to the file table using files_fdtable() macro. Subsequent
    applciation of RCU becomes easier after this.

    Signed-off-by: Dipankar Sarma
    Signed-Off-By: David Howells
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dipankar Sarma
     

17 Apr, 2005

1 commit

  • Initial git repository build. I'm not bothering with the full history,
    even though we have it. We can create a separate "historical" git
    archive of that later if we want to, and in the meantime it's about
    3.2GB when imported into git - space that would just make the early
    git days unnecessarily complicated, when we don't have a lot of good
    infrastructure for it.

    Let it rip!

    Linus Torvalds