30 Mar, 2012

1 commit

  • Pull x32 support for x86-64 from Ingo Molnar:
    "This tree introduces the X32 binary format and execution mode for x86:
    32-bit data space binaries using 64-bit instructions and 64-bit kernel
    syscalls.

    This allows applications whose working set fits into a 32 bits address
    space to make use of 64-bit instructions while using a 32-bit address
    space with shorter pointers, more compressed data structures, etc."

    Fix up trivial context conflicts in arch/x86/{Kconfig,vdso/vma.c}

    * 'x86-x32-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (71 commits)
    x32: Fix alignment fail in struct compat_siginfo
    x32: Fix stupid ia32/x32 inversion in the siginfo format
    x32: Add ptrace for x32
    x32: Switch to a 64-bit clock_t
    x32: Provide separate is_ia32_task() and is_x32_task() predicates
    x86, mtrr: Use explicit sizing and padding for the 64-bit ioctls
    x86/x32: Fix the binutils auto-detect
    x32: Warn and disable rather than error if binutils too old
    x32: Only clear TIF_X32 flag once
    x32: Make sure TS_COMPAT is cleared for x32 tasks
    fs: Remove missed ->fds_bits from cessation use of fd_set structs internally
    fs: Fix close_on_exec pointer in alloc_fdtable
    x32: Drop non-__vdso weak symbols from the x32 VDSO
    x32: Fix coding style violations in the x32 VDSO code
    x32: Add x32 VDSO support
    x32: Allow x32 to be configured
    x32: If configured, add x32 system calls to system call tables
    x32: Handle process creation
    x32: Signal-related system calls
    x86: Add #ifdef CONFIG_COMPAT to
    ...

    Linus Torvalds
     

29 Feb, 2012

1 commit


24 Feb, 2012

1 commit

  • alloc_fdtable allocates space for the open_fds and close_on_exec
    bitfields together, as 2 * nr / BITS_PER_BYTE. close_on_exec needs to
    point to open_fds + nr / BITS_PER_BYTE, not open_fds + nr /
    BITS_PER_LONG, as introducted in 1fd36adc: Replace the fd_sets in
    struct fdtable with an array of unsigned longs.

    Signed-off-by: Bobby Powers
    Link: http://lkml.kernel.org/r/1329888587-3087-1-git-send-email-bobbypowers@gmail.com
    Acked-by: David Howells
    Signed-off-by: H. Peter Anvin

    Bobby Powers
     

20 Feb, 2012

2 commits

  • Replace the fd_sets in struct fdtable with an array of unsigned longs and then
    use the standard non-atomic bit operations rather than the FD_* macros.

    This:

    (1) Removes the abuses of struct fd_set:

    (a) Since we don't want to allocate a full fd_set the vast majority of the
    time, we actually, in effect, just allocate a just-big-enough array of
    unsigned longs and cast it to an fd_set type - so why bother with the
    fd_set at all?

    (b) Some places outside of the core fdtable handling code (such as
    SELinux) want to look inside the array of unsigned longs hidden inside
    the fd_set struct for more efficient iteration over the entire set.

    (2) Eliminates the use of FD_*() macros in the kernel completely.

    (3) Permits the __FD_*() macros to be deleted entirely where not exposed to
    userspace.

    Signed-off-by: David Howells
    Link: http://lkml.kernel.org/r/20120216174954.23314.48147.stgit@warthog.procyon.org.uk
    Signed-off-by: H. Peter Anvin
    Cc: Al Viro

    David Howells
     
  • Wrap accesses to the fd_sets in struct fdtable (for recording open files and
    close-on-exec flags) so that we can move away from using fd_sets since we
    abuse the fd_set structs by not allocating the full-sized structure under
    normal circumstances and by non-core code looking at the internals of the
    fd_sets.

    The first abuse means that use of FD_ZERO() on these fd_sets is not permitted,
    since that cannot be told about their abnormal lengths.

    This introduces six wrapper functions for setting, clearing and testing
    close-on-exec flags and fd-is-open flags:

    void __set_close_on_exec(int fd, struct fdtable *fdt);
    void __clear_close_on_exec(int fd, struct fdtable *fdt);
    bool close_on_exec(int fd, const struct fdtable *fdt);
    void __set_open_fd(int fd, struct fdtable *fdt);
    void __clear_open_fd(int fd, struct fdtable *fdt);
    bool fd_is_open(int fd, const struct fdtable *fdt);

    Note that I've prepended '__' to the names of the set/clear functions because
    they require the caller to hold a lock to use them.

    Note also that I haven't added wrappers for looking behind the scenes at the
    the array. Possibly that should exist too.

    Signed-off-by: David Howells
    Link: http://lkml.kernel.org/r/20120216174942.23314.1364.stgit@warthog.procyon.org.uk
    Signed-off-by: H. Peter Anvin
    Cc: Al Viro

    David Howells
     

29 Apr, 2011

1 commit

  • Azurit reports large increases in system time after 2.6.36 when running
    Apache. It was bisected down to a892e2d7dcdfa6c76e6 ("vfs: use kmalloc()
    to allocate fdmem if possible").

    That patch caused the vfs to use kmalloc() for very large allocations and
    this is causing excessive work (and presumably excessive reclaim) within
    the page allocator.

    Fix it by falling back to vmalloc() earlier - when the allocation attempt
    would have been considered "costly" by reclaim.

    Reported-by: azurIt
    Tested-by: azurIt
    Acked-by: Changli Gao
    Cc: Americo Wang
    Cc: Jiri Slaby
    Acked-by: Eric Dumazet
    Cc: Mel Gorman
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     

11 Aug, 2010

1 commit

  • Use kmalloc() to allocate fdmem if possible.

    vmalloc() is used as a fallback solution for fdmem allocation. A new
    helper function __free_fdtable() is introduced to reduce the lines of
    code.

    A potential bug, vfree() a memory allocated by kmalloc(), is fixed.

    [akpm@linux-foundation.org: use __GFP_NOWARN, uninline alloc_fdmem() and free_fdmem()]
    Signed-off-by: Changli Gao
    Cc: Alexander Viro
    Cc: Jiri Slaby
    Cc: "Paul E. McKenney"
    Cc: Alexey Dobriyan
    Cc: Ingo Molnar
    Cc: Peter Zijlstra
    Cc: Avi Kivity
    Cc: Tetsuo Handa
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Changli Gao
     

15 Jun, 2010

1 commit


07 Mar, 2010

1 commit

  • Make sure compiler won't do weird things with limits. E.g. fetching them
    twice may return 2 different values after writable limits are implemented.

    I.e. either use rlimit helpers added in commit 3e10e716abf3 ("resource:
    add helpers for fetching rlimits") or ACCESS_ONCE if not applicable.

    Signed-off-by: Jiri Slaby
    Cc: Alexander Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jiri Slaby
     

25 Feb, 2010

1 commit

  • Add lockdep-ified RCU primitives to alloc_fd(), files_fdtable()
    and fcheck_files().

    Cc: Alexander Viro
    Signed-off-by: Paul E. McKenney
    Cc: laijs@cn.fujitsu.com
    Cc: dipankar@in.ibm.com
    Cc: mathieu.desnoyers@polymtl.ca
    Cc: josh@joshtriplett.org
    Cc: dvhltc@us.ibm.com
    Cc: niv@us.ibm.com
    Cc: peterz@infradead.org
    Cc: rostedt@goodmis.org
    Cc: Valdis.Kletnieks@vt.edu
    Cc: dhowells@redhat.com
    Cc: Alexander Viro
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Paul E. McKenney
     

12 Oct, 2009

1 commit


01 Aug, 2008

1 commit


27 Jul, 2008

1 commit

  • * dup2() should return -EBADF on exceeded sysctl_nr_open
    * dup() should *not* return -EINVAL even if you have rlimit set to 0;
    it should get -EMFILE instead.

    Check for orig_start exceeding rlimit taken to sys_fcntl().
    Failing expand_files() in dup{2,3}() now gets -EMFILE remapped to -EBADF.
    Consequently, remaining checks for rlimit are taken to expand_files().

    Signed-off-by: Al Viro

    Al Viro
     

17 May, 2008

6 commits


02 May, 2008

2 commits


07 Feb, 2008

1 commit

  • NR_OPEN (historically set to 1024*1024) actually forbids processes to open
    more than 1024*1024 handles.

    Unfortunatly some production servers hit the not so 'ridiculously high
    value' of 1024*1024 file descriptors per process.

    Changing NR_OPEN is not considered safe because of vmalloc space potential
    exhaust.

    This patch introduces a new sysctl (/proc/sys/fs/nr_open) wich defaults to
    1024*1024, so that admins can decide to change this limit if their workload
    needs it.

    [akpm@linux-foundation.org: export it for sparc64]
    Signed-off-by: Eric Dumazet
    Cc: Alan Cox
    Cc: Richard Henderson
    Cc: Ivan Kokshaysky
    Cc: "David S. Miller"
    Cc: Ralf Baechle
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric Dumazet
     

23 Dec, 2006

1 commit

  • Christoph Hellwig has expressed concerns that the recent fdtable changes
    expose the details of the RCU methodology used to release no-longer-used
    fdtable structures to the rest of the kernel. The trivial patch below
    addresses these concerns by introducing the appropriate free_fdtable()
    calls, which simply wrap the release RCU usage. Since free_fdtable() is a
    one-liner, it makes sense to promote it to an inline helper.

    Signed-off-by: Vadim Lobanov
    Cc: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vadim Lobanov
     

11 Dec, 2006

3 commits

  • This patch provides an improved fdtable allocation scheme, useful for
    expanding fdtable file descriptor entries. The main focus is on the fdarray,
    as its memory usage grows 128 times faster than that of an fdset.

    The allocation algorithm sizes the fdarray in such a way that its memory usage
    increases in easy page-sized chunks. The overall algorithm expands the allowed
    size in powers of two, in order to amortize the cost of invoking vmalloc() for
    larger allocation sizes. Namely, the following sizes for the fdarray are
    considered, and the smallest that accommodates the requested fd count is
    chosen:

    pagesize / 4
    pagesize / 2
    pagesize open_fds is now used as the anchor for the
    fdset memory allocation.

    Signed-off-by: Vadim Lobanov
    Cc: Christoph Hellwig
    Cc: Al Viro
    Cc: Dipankar Sarma
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vadim Lobanov
     
  • An fdtable can either be embedded inside a files_struct or standalone (after
    being expanded). When an fdtable is being discarded after all RCU references
    to it have expired, we must either free it directly, in the standalone case,
    or free the files_struct it is contained within, in the embedded case.

    Currently the free_files field controls this behavior, but we can get rid of
    it entirely, as all the necessary information is already recorded. We can
    distinguish embedded and standalone fdtables using max_fds, and if it is
    embedded we can divine the relevant files_struct using container_of().

    Signed-off-by: Vadim Lobanov
    Cc: Christoph Hellwig
    Cc: Al Viro
    Cc: Dipankar Sarma
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vadim Lobanov
     
  • Currently, each fdtable supports three dynamically-sized arrays of data: the
    fdarray and two fdsets. The code allows the number of fds supported by the
    fdarray (fdtable->max_fds) to differ from the number of fds supported by each
    of the fdsets (fdtable->max_fdset).

    In practice, it is wasteful for these two sizes to differ: whenever we hit a
    limit on the smaller-capacity structure, we will reallocate the entire fdtable
    and all the dynamic arrays within it, so any delta in the memory used by the
    larger-capacity structure will never be touched at all.

    Rather than hogging this excess, we shouldn't even allocate it in the first
    place, and keep the capacities of the fdarray and the fdsets equal. This
    patch removes fdtable->max_fdset. As an added bonus, most of the supporting
    code becomes simpler.

    Signed-off-by: Vadim Lobanov
    Cc: Christoph Hellwig
    Cc: Al Viro
    Cc: Dipankar Sarma
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vadim Lobanov
     

08 Dec, 2006

1 commit

  • free_fdtable_rc() schedules timer to reschedule fddef->wq if
    schedule_work() on it returns 0. However, schedule_work() guarantees that
    the target work is executed at least once after the scheduling regardless
    of its return value. 0 return simply means that the work was already
    pending and thus no further action was required.

    Another problem is that it used contant '5' as @expires argument to
    mod_timer().

    Kill unnecessary fddef->timer.

    Signed-off-by: Tejun Heo
    Cc: Dipankar Sarma
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tejun Heo
     

22 Nov, 2006

1 commit

  • Pass the work_struct pointer to the work function rather than context data.
    The work function can use container_of() to work out the data.

    For the cases where the container of the work_struct may go away the moment the
    pending bit is cleared, it is made possible to defer the release of the
    structure by deferring the clearing of the pending bit.

    To make this work, an extra flag is introduced into the management side of the
    work_struct. This governs auto-release of the structure upon execution.

    Ordinarily, the work queue executor would release the work_struct for further
    scheduling or deallocation by clearing the pending bit prior to jumping to the
    work function. This means that, unless the driver makes some guarantee itself
    that the work_struct won't go away, the work function may not access anything
    else in the work_struct or its container lest they be deallocated.. This is a
    problem if the auxiliary data is taken away (as done by the last patch).

    However, if the pending bit is *not* cleared before jumping to the work
    function, then the work function *may* access the work_struct and its container
    with no problems. But then the work function must itself release the
    work_struct by calling work_release().

    In most cases, automatic release is fine, so this is the default. Special
    initiators exist for the non-auto-release case (ending in _NAR).

    Signed-Off-By: David Howells

    David Howells
     

30 Sep, 2006

2 commits


27 Sep, 2006

1 commit


13 Jul, 2006

2 commits

  • We're supposed to go the next power of two if nfds==nr.

    Of `nr', not of `nfsd'.

    Spotted by Rene Scharfe

    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • When found, it is obvious. nfds calculated when allocating fdsets is
    rewritten by calculation of size of fdtable, and when we are unlucky, we
    try to free fdsets of wrong size.

    Found due to OpenVZ resource management (User Beancounters).

    Signed-off-by: Alexey Kuznetsov
    Signed-off-by: Kirill Korotaev
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill Korotaev
     

11 Jul, 2006

1 commit


29 Mar, 2006

1 commit


23 Mar, 2006

1 commit

  • 1) Reduce the size of (struct fdtable) to exactly 64 bytes on 32bits
    platforms, lowering kmalloc() allocated space by 50%.

    2) Reduce the size of (files_struct), using a special 32 bits (or
    64bits) embedded_fd_set, instead of a 1024 bits fd_set for the
    close_on_exec_init and open_fds_init fields. This save some ram (248
    bytes per task) as most tasks dont open more than 32 files. D-Cache
    footprint for such tasks is also reduced to the minimum.

    3) Reduce size of allocated fdset. Currently two full pages are
    allocated, that is 32768 bits on x86 for example, and way too much. The
    minimum is now L1_CACHE_BYTES.

    UP and SMP should benefit from this patch, because most tasks will touch
    only one cache line when open()/close() stdin/stdout/stderr (0/1/2),
    (next_fd, close_on_exec_init, open_fds_init, fd_array[0 .. 2] being in the
    same cache line)

    Signed-off-by: Eric Dumazet
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric Dumazet
     

06 Feb, 2006

1 commit

  • percpu_data blindly allocates bootmem memory to store NR_CPUS instances of
    cpudata, instead of allocating memory only for possible cpus.

    As a preparation for changing that, we need to convert various 0 -> NR_CPUS
    loops to use for_each_cpu().

    (The above only applies to users of asm-generic/percpu.h. powerpc has gone it
    alone and is presently only allocating memory for present CPUs, so it's
    currently corrupting memory).

    Signed-off-by: Eric Dumazet
    Cc: "David S. Miller"
    Cc: James Bottomley
    Acked-by: Ingo Molnar
    Cc: Jens Axboe
    Cc: Anton Blanchard
    Acked-by: William Irwin
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric Dumazet
     

15 Sep, 2005

1 commit

  • Noted by David Miller:

    "The bug is that free_fd_array() takes a "num" argument, but when
    calling it from __free_fdtable() we're instead passing in the size in
    bytes (ie. "num * sizeof(struct file *)")."

    Yes it is a bug. I think I messed it up while merging newer
    changes with an older version where I was using size in bytes
    to optimize.

    Signed-off-by: Dipankar Sarma
    Signed-off-by: Linus Torvalds

    Dipankar Sarma
     

10 Sep, 2005

2 commits

  • Patch to eliminate struct files_struct.file_lock spinlock on the reader side
    and use rcu refcounting rcuref_xxx api for the f_count refcounter. The
    updates to the fdtable are done by allocating a new fdtable structure and
    setting files->fdt to point to the new structure. The fdtable structure is
    protected by RCU thereby allowing lock-free lookup. For fd arrays/sets that
    are vmalloced, we use keventd to free them since RCU callbacks can't sleep. A
    global list of fdtable to be freed is not scalable, so we use a per-cpu list.
    If keventd is already handling the current cpu's work, we use a timer to defer
    queueing of that work.

    Since the last publication, this patch has been re-written to avoid using
    explicit memory barriers and use rcu_assign_pointer(), rcu_dereference()
    premitives instead. This required that the fd information is kept in a
    separate structure (fdtable) and updated atomically.

    Signed-off-by: Dipankar Sarma
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dipankar Sarma
     
  • In order for the RCU to work, the file table array, sets and their sizes must
    be updated atomically. Instead of ensuring this through too many memory
    barriers, we put the arrays and their sizes in a separate structure. This
    patch takes the first step of putting the file table elements in a separate
    structure fdtable that is embedded withing files_struct. It also changes all
    the users to refer to the file table using files_fdtable() macro. Subsequent
    applciation of RCU becomes easier after this.

    Signed-off-by: Dipankar Sarma
    Signed-Off-By: David Howells
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dipankar Sarma