02 May, 2013

2 commits

  • Commit f91eb62f71b3 ("init: scream bloody murder if interrupts are
    enabled too early") added three new warnings. The first two seemed
    reasonable, but the third included a warning when an initcall returned
    non-zero. Although, the third WARN() does include an imbalanced preempt
    disabled, or irqs disable, it shouldn't warn if it only had an initcall
    that just returns non-zero.

    In fact, according to Linus, it shouldn't print at all. As it only
    prints with initcall_debug set, and that already shows enough
    information to fix things.

    Link: http://lkml.kernel.org/r/CA+55aFzaBC5SFi7=F2mfm+KWY5qTsBmOqgbbs8E+LUS8JK-sBg@mail.gmail.com

    Suggested-by: Linus Torvalds
    Reported-by: Konrad Rzeszutek Wilk
    Signed-off-by: Steven Rostedt
    Signed-off-by: Linus Torvalds

    Steven Rostedt
     
  • Pull omap3isp clk support from Mauro Carvalho Chehab:
    "This patch were sent in separate as it depends on a merge from clock
    framework, that you merged in commit 362ed48dee50"

    * 'topic/omap3isp' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media:
    [media] omap3isp: Use the common clock framework

    Linus Torvalds
     

01 May, 2013

38 commits

  • Merge IPC cleanup and scalability patches from Andrew Morton.

    This cleans up many of the oddities in the IPC code, uses the list
    iterator helpers, splits out locking and adds per-semaphore locks for
    greater scalability of the IPC semaphore code.

    Most normal user-level locking by now uses futexes (ie pthreads, but
    also a lot of specialized locks), but SysV IPC semaphores are apparently
    still used in some big applications, either for portability reasons, or
    because they offer tracking and undo (and you don't need to have a
    special shared memory area for them).

    Our IPC semaphore scalability was pitiful. We used to lock much too big
    ranges, and we used to have a single ipc lock per ipc semaphore array.
    Most loads never cared, but some do. There are some numbers in the
    individual commits.

    * ipc-scalability:
    ipc: sysv shared memory limited to 8TiB
    ipc/msg.c: use list_for_each_entry_[safe] for list traversing
    ipc,sem: fine grained locking for semtimedop
    ipc,sem: have only one list in struct sem_queue
    ipc,sem: open code and rename sem_lock
    ipc,sem: do not hold ipc lock more than necessary
    ipc: introduce lockless pre_down ipcctl
    ipc: introduce obtaining a lockless ipc object
    ipc: remove bogus lock comment for ipc_checkid
    ipc/msgutil.c: use linux/uaccess.h
    ipc: refactor msg list search into separate function
    ipc: simplify msg list search
    ipc: implement MSG_COPY as a new receive mode
    ipc: remove msg handling from queue scan
    ipc: set EFAULT as default error in load_msg()
    ipc: tighten msg copy loops
    ipc: separate msg allocation from userspace copy
    ipc: clamp with min()

    Linus Torvalds
     
  • Trying to run an application which was trying to put data into half of
    memory using shmget(), we found that having a shmall value below 8EiB-8TiB
    would prevent us from using anything more than 8TiB. By setting
    kernel.shmall greater than 8EiB-8TiB would make the job work.

    In the newseg() function, ns->shm_tot which, at 8TiB is INT_MAX.

    ipc/shm.c:
    458 static int newseg(struct ipc_namespace *ns, struct ipc_params *params)
    459 {
    ...
    465 int numpages = (size + PAGE_SIZE -1) >> PAGE_SHIFT;
    ...
    474 if (ns->shm_tot + numpages > ns->shm_ctlall)
    475 return -ENOSPC;

    [akpm@linux-foundation.org: make ipc/shm.c:newseg()'s numpages size_t, not int]
    Signed-off-by: Robin Holt
    Reported-by: Alex Thorlton
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Robin Holt
     
  • The ipc/msg.c code does its list operations by hand and it open-codes the
    accesses, instead of using for_each_entry_[safe].

    Signed-off-by: Nikola Pajkovsky
    Cc: Stanislav Kinsbursky
    Cc: "Eric W. Biederman"
    Cc: Peter Hurley
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nikola Pajkovsky
     
  • Introduce finer grained locking for semtimedop, to handle the common case
    of a program wanting to manipulate one semaphore from an array with
    multiple semaphores.

    If the call is a semop manipulating just one semaphore in an array with
    multiple semaphores, only take the lock for that semaphore itself.

    If the call needs to manipulate multiple semaphores, or another caller is
    in a transaction that manipulates multiple semaphores, the sem_array lock
    is taken, as well as all the locks for the individual semaphores.

    On a 24 CPU system, performance numbers with the semop-multi
    test with N threads and N semaphores, look like this:

    vanilla Davidlohr's Davidlohr's + Davidlohr's +
    threads patches rwlock patches v3 patches
    10 610652 726325 1783589 2142206
    20 341570 365699 1520453 1977878
    30 288102 307037 1498167 2037995
    40 290714 305955 1612665 2256484
    50 288620 312890 1733453 2650292
    60 289987 306043 1649360 2388008
    70 291298 306347 1723167 2717486
    80 290948 305662 1729545 2763582
    90 290996 306680 1736021 2757524
    100 292243 306700 1773700 3059159

    [davidlohr.bueso@hp.com: do not call sem_lock when bogus sma]
    [davidlohr.bueso@hp.com: make refcounter atomic]
    Signed-off-by: Rik van Riel
    Suggested-by: Linus Torvalds
    Acked-by: Davidlohr Bueso
    Cc: Chegu Vinod
    Cc: Jason Low
    Reviewed-by: Michel Lespinasse
    Cc: Peter Hurley
    Cc: Stanislav Kinsbursky
    Tested-by: Emmanuel Benisty
    Tested-by: Sedat Dilek
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rik van Riel
     
  • Having only one list in struct sem_queue, and only queueing simple
    semaphore operations on the list for the semaphore involved, allows us to
    introduce finer grained locking for semtimedop.

    Signed-off-by: Rik van Riel
    Acked-by: Davidlohr Bueso
    Cc: Chegu Vinod
    Cc: Emmanuel Benisty
    Cc: Jason Low
    Cc: Michel Lespinasse
    Cc: Peter Hurley
    Cc: Stanislav Kinsbursky
    Tested-by: Sedat Dilek
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rik van Riel
     
  • Rename sem_lock() to sem_obtain_lock(), so we can introduce a sem_lock()
    later that only locks the sem_array and does nothing else.

    Open code the locking from ipc_lock() in sem_obtain_lock() so we can
    introduce finer grained locking for the sem_array in the next patch.

    [akpm@linux-foundation.org: propagate the ipc_obtain_object() errno out of sem_obtain_lock()]
    Signed-off-by: Rik van Riel
    Acked-by: Davidlohr Bueso
    Cc: Chegu Vinod
    Cc: Emmanuel Benisty
    Cc: Jason Low
    Cc: Michel Lespinasse
    Cc: Peter Hurley
    Cc: Stanislav Kinsbursky
    Tested-by: Sedat Dilek
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rik van Riel
     
  • Instead of holding the ipc lock for permissions and security checks, among
    others, only acquire it when necessary.

    Some numbers....

    1) With Rik's semop-multi.c microbenchmark we can see the following
    results:

    Baseline (3.9-rc1):
    cpus 4, threads: 256, semaphores: 128, test duration: 30 secs
    total operations: 151452270, ops/sec 5048409

    + 59.40% a.out [kernel.kallsyms] [k] _raw_spin_lock
    + 6.14% a.out [kernel.kallsyms] [k] sys_semtimedop
    + 3.84% a.out [kernel.kallsyms] [k] avc_has_perm_flags
    + 3.64% a.out [kernel.kallsyms] [k] __audit_syscall_exit
    + 2.06% a.out [kernel.kallsyms] [k] copy_user_enhanced_fast_string
    + 1.86% a.out [kernel.kallsyms] [k] ipc_lock

    With this patchset:
    cpus 4, threads: 256, semaphores: 128, test duration: 30 secs
    total operations: 273156400, ops/sec 9105213

    + 18.54% a.out [kernel.kallsyms] [k] _raw_spin_lock
    + 11.72% a.out [kernel.kallsyms] [k] sys_semtimedop
    + 7.70% a.out [kernel.kallsyms] [k] ipc_has_perm.isra.21
    + 6.58% a.out [kernel.kallsyms] [k] avc_has_perm_flags
    + 6.54% a.out [kernel.kallsyms] [k] __audit_syscall_exit
    + 4.71% a.out [kernel.kallsyms] [k] ipc_obtain_object_check

    2) While on an Oracle swingbench DSS (data mining) workload the
    improvements are not as exciting as with Rik's benchmark, we can see
    some positive numbers. For an 8 socket machine the following are the
    percentages of %sys time incurred in the ipc lock:

    Baseline (3.9-rc1):
    100 swingbench users: 8,74%
    400 swingbench users: 21,86%
    800 swingbench users: 84,35%

    With this patchset:
    100 swingbench users: 8,11%
    400 swingbench users: 19,93%
    800 swingbench users: 77,69%

    [riel@redhat.com: fix two locking bugs]
    [sasha.levin@oracle.com: prevent releasing RCU read lock twice in semctl_main]
    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Davidlohr Bueso
    Signed-off-by: Rik van Riel
    Reviewed-by: Chegu Vinod
    Acked-by: Michel Lespinasse
    Cc: Rik van Riel
    Cc: Jason Low
    Cc: Emmanuel Benisty
    Cc: Peter Hurley
    Cc: Stanislav Kinsbursky
    Tested-by: Sedat Dilek
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     
  • Various forms of ipc use ipcctl_pre_down() to retrieve an ipc object and
    check permissions, mostly for IPC_RMID and IPC_SET commands.

    Introduce ipcctl_pre_down_nolock(), a lockless version of this function.
    The locking version is retained, yet modified to call the nolock version
    without affecting its semantics, thus transparent to all ipc callers.

    Signed-off-by: Davidlohr Bueso
    Signed-off-by: Rik van Riel
    Suggested-by: Linus Torvalds
    Cc: Chegu Vinod
    Cc: Emmanuel Benisty
    Cc: Jason Low
    Cc: Michel Lespinasse
    Cc: Peter Hurley
    Cc: Stanislav Kinsbursky
    Tested-by: Sedat Dilek
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     
  • Through ipc_lock() and therefore ipc_lock_check() we currently return the
    locked ipc object. This is not necessary for all situations and can,
    therefore, cause unnecessary ipc lock contention.

    Introduce analogous ipc_obtain_object() and ipc_obtain_object_check()
    functions that only lookup and return the ipc object.

    Both these functions must be called within the RCU read critical section.

    [akpm@linux-foundation.org: propagate the ipc_obtain_object() errno from ipc_lock()]
    Signed-off-by: Davidlohr Bueso
    Signed-off-by: Rik van Riel
    Reviewed-by: Chegu Vinod
    Acked-by: Michel Lespinasse
    Cc: Emmanuel Benisty
    Cc: Jason Low
    Cc: Peter Hurley
    Cc: Stanislav Kinsbursky
    Tested-by: Sedat Dilek
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     
  • This series makes the sysv semaphore code more scalable, by reducing the
    time the semaphore lock is held, and making the locking more scalable for
    semaphore arrays with multiple semaphores.

    The first four patches were written by Davidlohr Buesso, and reduce the
    hold time of the semaphore lock.

    The last three patches change the sysv semaphore code locking to be more
    fine grained, providing a performance boost when multiple semaphores in a
    semaphore array are being manipulated simultaneously.

    On a 24 CPU system, performance numbers with the semop-multi
    test with N threads and N semaphores, look like this:

    vanilla Davidlohr's Davidlohr's + Davidlohr's +
    threads patches rwlock patches v3 patches
    10 610652 726325 1783589 2142206
    20 341570 365699 1520453 1977878
    30 288102 307037 1498167 2037995
    40 290714 305955 1612665 2256484
    50 288620 312890 1733453 2650292
    60 289987 306043 1649360 2388008
    70 291298 306347 1723167 2717486
    80 290948 305662 1729545 2763582
    90 290996 306680 1736021 2757524
    100 292243 306700 1773700 3059159

    This patch:

    There is no reason to be holding the ipc lock while reading ipcp->seq,
    hence remove misleading comment.

    Also simplify the return value for the function.

    Signed-off-by: Davidlohr Bueso
    Signed-off-by: Rik van Riel
    Cc: Chegu Vinod
    Cc: Emmanuel Benisty
    Cc: Jason Low
    Cc: Michel Lespinasse
    Cc: Peter Hurley
    Cc: Stanislav Kinsbursky
    Tested-by: Sedat Dilek
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     
  • Signed-off-by: HoSung Jung
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    HoSung Jung
     
  • [fengguang.wu@intel.com: find_msg can be static]
    Signed-off-by: Peter Hurley
    Cc: Fengguang Wu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Hurley
     
  • Signed-off-by: Peter Hurley
    Acked-by: Stanislav Kinsbursky
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Hurley
     
  • Teach the helper routines about MSG_COPY so that msgtyp is preserved as
    the message number to copy.

    The security functions affected by this change were audited and no
    additional changes are necessary.

    Signed-off-by: Peter Hurley
    Acked-by: Stanislav Kinsbursky
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Hurley
     
  • In preparation for refactoring the queue scan into a separate
    function, relocate msg copying.

    Signed-off-by: Peter Hurley
    Acked-by: Stanislav Kinsbursky
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Hurley
     
  • Signed-off-by: Peter Hurley
    Acked-by: Stanislav Kinsbursky
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Hurley
     
  • Signed-off-by: Peter Hurley
    Acked-by: Stanislav Kinsbursky
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Hurley
     
  • Separating msg allocation enables single-block vmalloc
    allocation instead.

    Signed-off-by: Peter Hurley
    Acked-by: Stanislav Kinsbursky
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Hurley
     
  • Signed-off-by: Peter Hurley
    Acked-by: Stanislav Kinsbursky
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Hurley
     
  • Pull ext4 updates from Ted Ts'o:
    "Mostly performance and bug fixes, plus some cleanups. The one new
    feature this merge window is a new ioctl EXT4_IOC_SWAP_BOOT which
    allows installation of a hidden inode designed for boot loaders."

    * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (50 commits)
    ext4: fix type-widening bug in inode table readahead code
    ext4: add check for inodes_count overflow in new resize ioctl
    ext4: fix Kconfig documentation for CONFIG_EXT4_DEBUG
    ext4: fix online resizing for ext3-compat file systems
    jbd2: trace when lock_buffer in do_get_write_access takes a long time
    ext4: mark metadata blocks using bh flags
    buffer: add BH_Prio and BH_Meta flags
    ext4: mark all metadata I/O with REQ_META
    ext4: fix readdir error in case inline_data+^dir_index.
    ext4: fix readdir error in the case of inline_data+dir_index
    jbd2: use kmem_cache_zalloc instead of kmem_cache_alloc/memset
    ext4: mext_insert_extents should update extent block checksum
    ext4: move quota initialization out of inode allocation transaction
    ext4: reserve xattr index for Rich ACL support
    jbd2: reduce journal_head size
    ext4: clear buffer_uninit flag when submitting IO
    ext4: use io_end for multiple bios
    ext4: make ext4_bio_write_page() use BH_Async_Write flags
    ext4: Use kstrtoul() instead of parse_strtoul()
    ext4: defragmentation code cleanup
    ...

    Linus Torvalds
     
  • Pull dma-buf updates from Sumit Semwal:
    "Added debugfs support to dma-buf"

    * tag 'tag-for-linus-3.10' of git://git.linaro.org/people/sumitsemwal/linux-dma-buf:
    dma-buf: Add debugfs support
    dma-buf: replace dma_buf_export() with dma_buf_export_named()

    Linus Torvalds
     
  • Pull Hexagon fixes from Richard Kuo:
    "Changes for the Hexagon architecture (and one touching OpenRISC).

    They include various fixes to make use of additional arch features and
    cleanups. The largest functional change is a cleanup of the signal
    and event return paths"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rkuo/linux-hexagon-kernel: (32 commits)
    Hexagon: add v4 CS regs to core copyout macro
    Hexagon: use correct translation for VMALLOC_START
    Hexagon: use correct translations for DMA mappings
    Hexagon: fix return value for notify_resume case in do_work_pending
    Hexagon: fix signal number for user mem faults
    Hexagon: remove two Kconfig entries
    arch: remove CONFIG_GENERIC_FIND_NEXT_BIT again
    Hexagon: update copyright dates
    Hexagon: add translation types for __vmnewmap
    Hexagon: fix signal.c compile error
    Hexagon: break up user fn/arg register setting
    Hexagon: use generic sys_fork, sys_vfork, and sys_clone
    Hexagon: fix psp/sp macro
    Hexagon: fix up int enable/disable at ret_from_fork
    Hexagon: add IOMEM and _relaxed IO macros
    Hexagon: switch to using the device type for IO mappings
    Hexagon: don't print info for offline CPU's
    Hexagon: add support for single-stepping (v4+)
    Hexagon: use correct work mask when checking for more work
    Hexagon: add support for additional exceptions
    ...

    Linus Torvalds
     
  • We first tried to avoid updating atime/mtime entirely (commit
    b0de59b5733d: "TTY: do not update atime/mtime on read/write"), and then
    limited it to only update it occasionally (commit 37b7f3c76595: "TTY:
    fix atime/mtime regression"), but it turns out that this was both
    insufficient and overkill.

    It was insufficient because we let people attach to the shared ptmx node
    to see activity without even reading atime/mtime, and it was overkill
    because the "only once a minute" means that you can't really tell an
    idle person from an active one with 'w'.

    So this tries to fix the problem properly. It marks the shared ptmx
    node as un-notifiable, and it lowers the "only once a minute" to a few
    seconds instead - still long enough that you can't time individual
    keystrokes, but short enough that you can tell whether somebody is
    active or not.

    Reported-by: Simon Kirby
    Acked-by: Jiri Slaby
    Cc: Greg Kroah-Hartman
    Cc: stable@vger.kernel.org
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     
  • Pull compat cleanup from Al Viro:
    "Mostly about syscall wrappers this time; there will be another pile
    with patches in the same general area from various people, but I'd
    rather push those after both that and vfs.git pile are in."

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal:
    syscalls.h: slightly reduce the jungles of macros
    get rid of union semop in sys_semctl(2) arguments
    make do_mremap() static
    sparc: no need to sign-extend in sync_file_range() wrapper
    ppc compat wrappers for add_key(2) and request_key(2) are pointless
    x86: trim sys_ia32.h
    x86: sys32_kill and sys32_mprotect are pointless
    get rid of compat_sys_semctl() and friends in case of ARCH_WANT_OLD_COMPAT_IPC
    merge compat sys_ipc instances
    consolidate compat lookup_dcookie()
    convert vmsplice to COMPAT_SYSCALL_DEFINE
    switch getrusage() to COMPAT_SYSCALL_DEFINE
    switch epoll_pwait to COMPAT_SYSCALL_DEFINE
    convert sendfile{,64} to COMPAT_SYSCALL_DEFINE
    switch signalfd{,4}() to COMPAT_SYSCALL_DEFINE
    make SYSCALL_DEFINE-generated wrappers do asmlinkage_protect
    make HAVE_SYSCALL_WRAPPERS unconditional
    consolidate cond_syscall and SYSCALL_ALIAS declarations
    teach SYSCALL_DEFINE how to deal with long long/unsigned long long
    get rid of duplicate logics in __SC_....[1-6] definitions

    Linus Torvalds
     
  • Add debugfs support to make it easier to print debug information
    about the dma-buf buffers.

    Cc: Dave Airlie
    [minor fixes on init and warning fix]
    Cc: Dan Carpenter
    [remove double unlock in fail case]
    Signed-off-by: Sumit Semwal

    Sumit Semwal
     
  • For debugging purposes, it is useful to have a name-string added
    while exporting buffers. Hence, dma_buf_export() is replaced with
    dma_buf_export_named(), which additionally takes 'exp_name' as a
    parameter.

    For backward compatibility, and for lazy exporters who don't wish to
    name themselves, a #define dma_buf_export() is also made available,
    which adds a __FILE__ instead of 'exp_name'.

    Cc: Daniel Vetter
    [Thanks for the idea!]
    Reviewed-by: Daniel Vetter
    Signed-off-by: Sumit Semwal

    Sumit Semwal
     
  • Signed-off-by: Richard Kuo

    Richard Kuo
     
  • Signed-off-by: Richard Kuo

    Richard Kuo
     
  • With physical offsets, pava translations aren't just based
    on PAGE_OFFSET anymore.

    Signed-off-by: Richard Kuo

    Richard Kuo
     
  • Signed-off-by: Richard Kuo

    Richard Kuo
     
  • Signed-off-by: Richard Kuo

    Richard Kuo
     
  • The Kconfig entries for HEXAGON_VM and HEXAGON_ANGEL_TRAPS were added,
    together with the configuration and makefiles for the Hexagon
    architecture, in v3.2. They have never been used. They can safely be
    removed.

    Signed-off-by: Paul Bolle
    [rkuo@codeaurora.org: adjust for line changes in Kconfig]
    Signed-off-by: Richard Kuo

    Paul Bolle
     
  • CONFIG_GENERIC_FIND_NEXT_BIT was removed in v3.0, but reappeared in two
    architectures. Remove it again.

    Signed-off-by: Paul Bolle
    Acked-by: Jonas Bonn
    Signed-off-by: Richard Kuo

    Paul Bolle
     
  • Signed-off-by: Richard Kuo

    Richard Kuo
     
  • Signed-off-by: Richard Kuo

    Richard Kuo
     
  • Signed-off-by: Richard Kuo

    Richard Kuo
     
  • Signed-off-by: Richard Kuo

    Richard Kuo
     
  • Signed-off-by: Richard Kuo

    Richard Kuo