13 Oct, 2014

1 commit

  • Pull RCU updates from Ingo Molnar:
    "The main changes in this cycle were:

    - changes related to No-CBs CPUs and NO_HZ_FULL

    - RCU-tasks implementation

    - torture-test updates

    - miscellaneous fixes

    - locktorture updates

    - RCU documentation updates"

    * 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (81 commits)
    workqueue: Use cond_resched_rcu_qs macro
    workqueue: Add quiescent state between work items
    locktorture: Cleanup header usage
    locktorture: Cannot hold read and write lock
    locktorture: Fix __acquire annotation for spinlock irq
    locktorture: Support rwlocks
    rcu: Eliminate deadlock between CPU hotplug and expedited grace periods
    locktorture: Document boot/module parameters
    rcutorture: Rename rcutorture_runnable parameter
    locktorture: Add test scenario for rwsem_lock
    locktorture: Add test scenario for mutex_lock
    locktorture: Make torture scripting account for new _runnable name
    locktorture: Introduce torture context
    locktorture: Support rwsems
    locktorture: Add infrastructure for torturing read locks
    torture: Address race in module cleanup
    locktorture: Make statistics generic
    locktorture: Teach about lock debugging
    locktorture: Support mutexes
    locktorture: Add documentation
    ...

    Linus Torvalds
     

09 Oct, 2014

1 commit


08 Sep, 2014

1 commit

  • RCU-tasks requires the occasional voluntary context switch
    from CPU-bound in-kernel tasks. In some cases, this requires
    instrumenting cond_resched(). However, there is some reluctance
    to countenance unconditionally instrumenting cond_resched() (see
    http://lwn.net/Articles/603252/), so this commit creates a separate
    cond_resched_rcu_qs() that may be used in place of cond_resched() in
    locations prone to long-duration in-kernel looping.

    This commit currently instruments only RCU-tasks. Future possibilities
    include also instrumenting RCU, RCU-bh, and RCU-sched in order to reduce
    IPI usage.

    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     

07 May, 2014

1 commit


13 Apr, 2014

1 commit

  • Pull vfs updates from Al Viro:
    "The first vfs pile, with deep apologies for being very late in this
    window.

    Assorted cleanups and fixes, plus a large preparatory part of iov_iter
    work. There's a lot more of that, but it'll probably go into the next
    merge window - it *does* shape up nicely, removes a lot of
    boilerplate, gets rid of locking inconsistencie between aio_write and
    splice_write and I hope to get Kent's direct-io rewrite merged into
    the same queue, but some of the stuff after this point is having
    (mostly trivial) conflicts with the things already merged into
    mainline and with some I want more testing.

    This one passes LTP and xfstests without regressions, in addition to
    usual beating. BTW, readahead02 in ltp syscalls testsuite has started
    giving failures since "mm/readahead.c: fix readahead failure for
    memoryless NUMA nodes and limit readahead pages" - might be a false
    positive, might be a real regression..."

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (63 commits)
    missing bits of "splice: fix racy pipe->buffers uses"
    cifs: fix the race in cifs_writev()
    ceph_sync_{,direct_}write: fix an oops on ceph_osdc_new_request() failure
    kill generic_file_buffered_write()
    ocfs2_file_aio_write(): switch to generic_perform_write()
    ceph_aio_write(): switch to generic_perform_write()
    xfs_file_buffered_aio_write(): switch to generic_perform_write()
    export generic_perform_write(), start getting rid of generic_file_buffer_write()
    generic_file_direct_write(): get rid of ppos argument
    btrfs_file_aio_write(): get rid of ppos
    kill the 5th argument of generic_file_buffered_write()
    kill the 4th argument of __generic_file_aio_write()
    lustre: don't open-code kernel_recvmsg()
    ocfs2: don't open-code kernel_recvmsg()
    drbd: don't open-code kernel_recvmsg()
    constify blk_rq_map_user_iov() and friends
    lustre: switch to kernel_sendmsg()
    ocfs2: don't open-code kernel_sendmsg()
    take iov_iter stuff to mm/iov_iter.c
    process_vm_access: tidy up a bit
    ...

    Linus Torvalds
     

02 Apr, 2014

1 commit


01 Apr, 2014

1 commit

  • Pull RCU updates from Ingo Molnar:
    "Main changes:

    - Torture-test changes, including refactoring of rcutorture and
    introduction of a vestigial locktorture.

    - Real-time latency fixes.

    - Documentation updates.

    - Miscellaneous fixes"

    * 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (77 commits)
    rcu: Provide grace-period piggybacking API
    rcu: Ensure kernel/rcu/rcu.h can be sourced/used stand-alone
    rcu: Fix sparse warning for rcu_expedited from kernel/ksysfs.c
    notifier: Substitute rcu_access_pointer() for rcu_dereference_raw()
    Documentation/memory-barriers.txt: Clarify release/acquire ordering
    rcutorture: Save kvm.sh output to log
    rcutorture: Add a lock_busted to test the test
    rcutorture: Place kvm-test-1-run.sh output into res directory
    rcutorture: Rename TREE_RCU-Kconfig.txt
    locktorture: Add kvm-recheck.sh plug-in for locktorture
    rcutorture: Gracefully handle NULL cleanup hooks
    locktorture: Add vestigial locktorture configuration
    rcutorture: Introduce "rcu" directory level underneath configs
    rcutorture: Rename kvm-test-1-rcu.sh
    rcutorture: Remove RCU dependencies from ver_functions.sh API
    rcutorture: Create CFcommon file for common Kconfig parameters
    rcutorture: Create config files for scripted test-the-test testing
    rcutorture: Add an rcu_busted to test the test
    locktorture: Add a lock-torture kernel module
    rcutorture: Abstract kvm-recheck.sh
    ...

    Linus Torvalds
     

23 Mar, 2014

1 commit

  • Commit bd2a31d522344 ("get rid of fget_light()") introduced the
    __fdget_pos() function, which returns the resulting file pointer and
    fdput flags combined in an 'unsigned long'. However, it also changed the
    behavior to return files with FMODE_PATH set, which shouldn't happen
    because read(), write(), lseek(), etc. aren't allowed on such files.
    This commit restores the old behavior.

    This regression actually had no effect on read() and write() since
    FMODE_READ and FMODE_WRITE are not set on file descriptors opened with
    O_PATH, but it did cause lseek() on a file descriptor opened with O_PATH
    to fail with ESPIPE rather than EBADF.

    Signed-off-by: Eric Biggers
    Signed-off-by: Al Viro

    Eric Biggers
     

10 Mar, 2014

1 commit

  • instead of returning the flags by reference, we can just have the
    low-level primitive return those in lower bits of unsigned long,
    with struct file * derived from the rest.

    Signed-off-by: Al Viro

    Al Viro
     

18 Feb, 2014

1 commit

  • (Trivial patch.)

    If the code is looking at the RCU-protected pointer itself, but not
    dereferencing it, the rcu_dereference() functions can be downgraded to
    rcu_access_pointer(). This commit makes this downgrade in __alloc_fd(),
    which simply compares the RCU-protected pointer against NULL with no
    dereferencing.

    Signed-off-by: Paul E. McKenney
    Cc: Alexander Viro
    Cc: linux-fsdevel@vger.kernel.org
    Reviewed-by: Josh Triplett

    Paul E. McKenney
     

11 Feb, 2014

1 commit

  • Recently due to a spike in connections per second memcached on 3
    separate boxes triggered the OOM killer from accept. At the time the
    OOM killer was triggered there was 4GB out of 36GB free in zone 1. The
    problem was that alloc_fdtable was allocating an order 3 page (32KiB) to
    hold a bitmap, and there was sufficient fragmentation that the largest
    page available was 8KiB.

    I find the logic that PAGE_ALLOC_COSTLY_ORDER can't fail pretty dubious
    but I do agree that order 3 allocations are very likely to succeed.

    There are always pathologies where order > 0 allocations can fail when
    there are copious amounts of free memory available. Using the pigeon
    hole principle it is easy to show that it requires 1 page more than 50%
    of the pages being free to guarantee an order 1 (8KiB) allocation will
    succeed, 1 page more than 75% of the pages being free to guarantee an
    order 2 (16KiB) allocation will succeed and 1 page more than 87.5% of
    the pages being free to guarantee an order 3 allocate will succeed.

    A server churning memory with a lot of small requests and replies like
    memcached is a common case that if anything can will skew the odds
    against large pages being available.

    Therefore let's not give external applications a practical way to kill
    linux server applications, and specify __GFP_NORETRY to the kmalloc in
    alloc_fdmem. Unless I am misreading the code and by the time the code
    reaches should_alloc_retry in __alloc_pages_slowpath (where
    __GFP_NORETRY becomes signification). We have already tried everything
    reasonable to allocate a page and the only thing left to do is wait. So
    not waiting and falling back to vmalloc immediately seems like the
    reasonable thing to do even if there wasn't a chance of triggering the
    OOM killer.

    Signed-off-by: "Eric W. Biederman"
    Cc: Eric Dumazet
    Acked-by: David Rientjes
    Cc: Cong Wang
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric W. Biederman
     

25 Jan, 2014

5 commits

  • The slow path in __fget_light() can use __fget() to avoid the
    code duplication. Saves 232 bytes.

    Signed-off-by: Oleg Nesterov
    Signed-off-by: Al Viro

    Oleg Nesterov
     
  • Apart from FMODE_PATH check fget_light() and fget_raw_light() are
    identical, shift the code into the new helper, __fget_light(fd, mask).
    Saves 208 bytes.

    Signed-off-by: Oleg Nesterov
    Signed-off-by: Al Viro

    Oleg Nesterov
     
  • Apart from FMODE_PATH check fget() and fget_raw() are identical,
    shift the code into the new simple helper, __fget(fd, mask). Saves
    160 bytes.

    Signed-off-by: Oleg Nesterov
    Signed-off-by: Al Viro

    Oleg Nesterov
     
  • put_files_struct() and close_files() do rcu_read_lock() to make
    rcu_dereference_check_fdtable() happy.

    This looks a bit ugly, files_fdtable() just reads the pointer,
    we can simply use rcu_dereference_raw() to avoid the warning.

    The patch also changes close_files() to return fdt, this avoids
    another rcu_read_lock()/files_fdtable() in put_files_struct().

    I think close_files() needs more cleanups:

    - we do not need xchg() exactly because we are the last
    user of this files_struct

    - "if (file)" should be turned into WARN_ON(!file)

    Signed-off-by: Oleg Nesterov
    Signed-off-by: Al Viro

    Oleg Nesterov
     
  • rcu_dereference_check_fdtable() looks very wrong,

    1. rcu_my_thread_group_empty() was added by 844b9a8707f1 "vfs: fix
    RCU-lockdep false positive due to /proc" but it doesn't really
    fix the problem. A CLONE_THREAD (without CLONE_FILES) task can
    hit the same race with get_files_struct().

    And otoh rcu_my_thread_group_empty() can suppress the correct
    warning if the caller is the CLONE_FILES (without CLONE_THREAD)
    task.

    2. files->count == 1 check is not really right too. Even if this
    files_struct is not shared it is not safe to access it lockless
    unless the caller is the owner.

    Otoh, this check is sub-optimal. files->count == 0 always means
    it is safe to use it lockless even if files != current->files,
    but put_files_struct() has to take rcu_read_lock(). See the next
    patch.

    This patch removes the buggy checks and turns fcheck_files() into
    __fcheck_files() which uses rcu_dereference_raw(), the "unshared"
    callers, fget_light() and fget_raw_light(), can use it to avoid
    the warning from RCU-lockdep.

    fcheck_files() is trivially reimplemented as rcu_lockdep_assert()
    plus __fcheck_files().

    Signed-off-by: Oleg Nesterov
    Signed-off-by: Al Viro

    Oleg Nesterov
     

02 May, 2013

1 commit


19 Feb, 2013

1 commit


04 Jan, 2013

1 commit

  • CONFIG_HOTPLUG is going away as an option. As a result, the __dev*
    markings need to be removed.

    This change removes the last of the __dev* markings from the kernel from
    a variety of different, tiny, places.

    Based on patches originally written by Bill Pemberton, but redone by me
    in order to handle some of the coding style issues better, by hand.

    Cc: Bill Pemberton
    Signed-off-by: Greg Kroah-Hartman

    Greg Kroah-Hartman
     

13 Dec, 2012

1 commit

  • Pull big execve/kernel_thread/fork unification series from Al Viro:
    "All architectures are converted to new model. Quite a bit of that
    stuff is actually shared with architecture trees; in such cases it's
    literally shared branch pulled by both, not a cherry-pick.

    A lot of ugliness and black magic is gone (-3KLoC total in this one):

    - kernel_thread()/kernel_execve()/sys_execve() redesign.

    We don't do syscalls from kernel anymore for either kernel_thread()
    or kernel_execve():

    kernel_thread() is essentially clone(2) with callback run before we
    return to userland, the callbacks either never return or do
    successful do_execve() before returning.

    kernel_execve() is a wrapper for do_execve() - it doesn't need to
    do transition to user mode anymore.

    As a result kernel_thread() and kernel_execve() are
    arch-independent now - they live in kernel/fork.c and fs/exec.c
    resp. sys_execve() is also in fs/exec.c and it's completely
    architecture-independent.

    - daemonize() is gone, along with its parts in fs/*.c

    - struct pt_regs * is no longer passed to do_fork/copy_process/
    copy_thread/do_execve/search_binary_handler/->load_binary/do_coredump.

    - sys_fork()/sys_vfork()/sys_clone() unified; some architectures
    still need wrappers (ones with callee-saved registers not saved in
    pt_regs on syscall entry), but the main part of those suckers is in
    kernel/fork.c now."

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal: (113 commits)
    do_coredump(): get rid of pt_regs argument
    print_fatal_signal(): get rid of pt_regs argument
    ptrace_signal(): get rid of unused arguments
    get rid of ptrace_signal_deliver() arguments
    new helper: signal_pt_regs()
    unify default ptrace_signal_deliver
    flagday: kill pt_regs argument of do_fork()
    death to idle_regs()
    don't pass regs to copy_process()
    flagday: don't pass regs to copy_thread()
    bfin: switch to generic vfork, get rid of pointless wrappers
    xtensa: switch to generic clone()
    openrisc: switch to use of generic fork and clone
    unicore32: switch to generic clone(2)
    score: switch to generic fork/vfork/clone
    c6x: sanitize copy_thread(), get rid of clone(2) wrapper, switch to generic clone()
    take sys_fork/sys_vfork/sys_clone prototypes to linux/syscalls.h
    mn10300: switch to generic fork/vfork/clone
    h8300: switch to generic fork/vfork/clone
    tile: switch to generic clone()
    ...

    Conflicts:
    arch/microblaze/include/asm/Kbuild

    Linus Torvalds
     

30 Nov, 2012

1 commit


29 Nov, 2012

1 commit


19 Nov, 2012

1 commit


12 Nov, 2012

1 commit

  • It can be legitimately triggered via procfs access. Now, at least
    2 of 3 of get_files_struct() callers in procfs are useless, but
    when and if we get rid of those we can always add WARN_ON() here.
    BUG_ON() at that spot is simply wrong.

    Signed-off-by: Al Viro

    Al Viro
     

31 Oct, 2012

1 commit

  • Jack Lin reports that the error return from dup3() for the RLIMIT_NOFILE
    case changed incorrectly after 3.6.

    The culprit is commit f33ff9927f42 ("take rlimit check to callers of
    expand_files()") which when it moved the "return -EMFILE" out to the
    caller, didn't notice that the dup3() had special code to turn the
    EMFILE return into EBADF.

    The replace_fd() helper that got added later then inherited the bug too.

    Reported-by: Jack Lin
    Signed-off-by: Al Viro
    [ Noted more bugs, wrote proper changelog, fixed up typos - Linus ]
    Signed-off-by: Linus Torvalds

    Al Viro
     

10 Oct, 2012

1 commit

  • I have tested the attached patch to fix the dup3 regression.

    Rich.

    From 0944e30e12dec6544b3602626b60ff412375c78f Mon Sep 17 00:00:00 2001
    From: "Richard W.M. Jones"
    Date: Tue, 9 Oct 2012 14:42:45 +0100
    Subject: [PATCH] dup3: Return an error when oldfd == newfd.

    The following commit:

    commit fe17f22d7fd0e344ef6447238f799bb49f670c6f
    Author: Al Viro
    Date: Tue Aug 21 11:48:11 2012 -0400

    take purely descriptor-related stuff from fcntl.c to file.c

    was supposed to be just code motion, but it dropped the following two
    lines:

    if (unlikely(oldfd == newfd))
    return -EINVAL;

    from the dup3 system call. dup3 is not specified by POSIX, so Linux
    can do what it likes. However the POSIX proposal for dup3 [1] states
    that it should return an error if oldfd == newfd.

    [1] http://austingroupbugs.net/view.php?id=411

    Signed-off-by: Richard W.M. Jones
    Tested-by: Richard W.M. Jones
    Signed-off-by: Al Viro

    Richard W.M. Jones
     

27 Sep, 2012

14 commits