26 May, 2016

1 commit


25 May, 2016

1 commit

  • Pull ext4 updates from Ted Ts'o:
    "Fix a number of bugs, most notably a potential stale data exposure
    after a crash and a potential BUG_ON crash if a file has the data
    journalling flag enabled while it has dirty delayed allocation blocks
    that haven't been written yet. Also fix a potential crash in the new
    project quota code and a maliciously corrupted file system.

    In addition, fix some DAX-specific bugs, including when there is a
    transient ENOSPC situation and races between writes via direct I/O and
    an mmap'ed segment that could lead to lost I/O.

    Finally the usual set of miscellaneous cleanups"

    * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (23 commits)
    ext4: pre-zero allocated blocks for DAX IO
    ext4: refactor direct IO code
    ext4: fix race in transient ENOSPC detection
    ext4: handle transient ENOSPC properly for DAX
    dax: call get_blocks() with create == 1 for write faults to unwritten extents
    ext4: remove unmeetable inconsisteny check from ext4_find_extent()
    jbd2: remove excess descriptions for handle_s
    ext4: remove unnecessary bio get/put
    ext4: silence UBSAN in ext4_mb_init()
    ext4: address UBSAN warning in mb_find_order_for_block()
    ext4: fix oops on corrupted filesystem
    ext4: fix check of dqget() return value in ext4_ioctl_setproject()
    ext4: clean up error handling when orphan list is corrupted
    ext4: fix hang when processing corrupted orphaned inode list
    ext4: remove trailing \n from ext4_warning/ext4_error calls
    ext4: fix races between changing inode journal mode and ext4_writepages
    ext4: handle unwritten or delalloc buffers before enabling data journaling
    ext4: fix jbd2 handle extension in ext4_ext_truncate_extend_restart()
    ext4: do not ask jbd2 to write data for delalloc buffers
    jbd2: add support for avoiding data writes during transaction commits
    ...

    Linus Torvalds
     

17 May, 2016

2 commits

  • Pull scheduler updates from Ingo Molnar:

    - massive CPU hotplug rework (Thomas Gleixner)

    - improve migration fairness (Peter Zijlstra)

    - CPU load calculation updates/cleanups (Yuyang Du)

    - cpufreq updates (Steve Muckle)

    - nohz optimizations (Frederic Weisbecker)

    - switch_mm() micro-optimization on x86 (Andy Lutomirski)

    - ... lots of other enhancements, fixes and cleanups.

    * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (66 commits)
    ARM: Hide finish_arch_post_lock_switch() from modules
    sched/core: Provide a tsk_nr_cpus_allowed() helper
    sched/core: Use tsk_cpus_allowed() instead of accessing ->cpus_allowed
    sched/loadavg: Fix loadavg artifacts on fully idle and on fully loaded systems
    sched/fair: Correct unit of load_above_capacity
    sched/fair: Clean up scale confusion
    sched/nohz: Fix affine unpinned timers mess
    sched/fair: Fix fairness issue on migration
    sched/core: Kill sched_class::task_waking to clean up the migration logic
    sched/fair: Prepare to fix fairness problems on migration
    sched/fair: Move record_wakee()
    sched/core: Fix comment typo in wake_q_add()
    sched/core: Remove unused variable
    sched: Make hrtick_notifier an explicit call
    sched/fair: Make ilb_notifier an explicit call
    sched/hotplug: Make activate() the last hotplug step
    sched/hotplug: Move migration CPU_DYING to sched_cpu_dying()
    sched/migration: Move CPU_ONLINE into scheduler state
    sched/migration: Move calc_load_migrate() into CPU_DYING
    sched/migration: Move prepare transition to SCHED_STARTING state
    ...

    Linus Torvalds
     
  • Pull support for killable rwsems from Ingo Molnar:
    "This, by Michal Hocko, implements down_write_killable().

    The main usecase will be to update mm_sem usage sites to use this new
    API, to allow the mm-reaper introduced in commit aac453635549 ("mm,
    oom: introduce oom reaper") to tear down oom victim address spaces
    asynchronously with minimum latencies and without deadlock worries"

    [ The vfs will want it too as the inode lock is changed from a mutex to
    a rwsem due to the parallel lookup and readdir updates ]

    * 'locking-rwsem-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    locking/rwsem: Fix comment on register clobbering
    locking/rwsem: Fix down_write_killable()
    locking/rwsem, x86: Add frame annotation for call_rwsem_down_write_failed_killable()
    locking/rwsem: Provide down_write_killable()
    locking/rwsem, x86: Provide __down_write_killable()
    locking/rwsem, s390: Provide __down_write_killable()
    locking/rwsem, ia64: Provide __down_write_killable()
    locking/rwsem, alpha: Provide __down_write_killable()
    locking/rwsem: Introduce basis for down_write_killable()
    locking/rwsem, sparc: Drop superfluous arch specific implementation
    locking/rwsem, sh: Drop superfluous arch specific implementation
    locking/rwsem, xtensa: Drop superfluous arch specific implementation
    locking/rwsem: Drop explicit memory barriers
    locking/rwsem: Get rid of __down_write_nested()

    Linus Torvalds
     

16 May, 2016

1 commit

  • The new signal_pending exit path in __rwsem_down_write_failed_common()
    was fingered as breaking his kernel by Tetsuo Handa.

    Upon inspection it was found that there are two things wrong with it;

    - it forgets to remove WAITING_BIAS if it leaves the list empty, or
    - it forgets to wake further waiters that were blocked on the now
    removed waiter.

    Especially the first issue causes new lock attempts to block and stall
    indefinitely, as the code assumes that pending waiters mean there is
    an owner that will wake when it releases the lock.

    Reported-by: Tetsuo Handa
    Tested-by: Tetsuo Handa
    Tested-by: Michal Hocko
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Alexander Shishkin
    Cc: Andrew Morton
    Cc: Arnaldo Carvalho de Melo
    Cc: Chris Zankel
    Cc: David S. Miller
    Cc: Davidlohr Bueso
    Cc: H. Peter Anvin
    Cc: Jiri Olsa
    Cc: Linus Torvalds
    Cc: Max Filippov
    Cc: Peter Zijlstra
    Cc: Stephane Eranian
    Cc: Thomas Gleixner
    Cc: Tony Luck
    Cc: Vince Weaver
    Cc: Waiman Long
    Link: http://lkml.kernel.org/r/20160512115745.GP3192@twins.programming.kicks-ass.net
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

12 May, 2016

1 commit


05 May, 2016

4 commits

  • Specifically around the debugfs file creation calls,
    I have no idea if they could ever possibly fail, but
    this is core code (debug aside) so lets at least
    check the return value and inform anything fishy.

    Signed-off-by: Davidlohr Bueso
    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: Waiman Long
    Cc: Andrew Morton
    Cc: Linus Torvalds
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/20160420041725.GC3472@linux-uzut.site
    Signed-off-by: Ingo Molnar

    Davidlohr Bueso
     
  • ... remove the redundant second iteration, this is most
    likely a copy/past buglet.

    Signed-off-by: Davidlohr Bueso
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Andrew Morton
    Cc: Linus Torvalds
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: dave@stgolabs.net
    Cc: waiman.long@hpe.com
    Link: http://lkml.kernel.org/r/1460961103-24953-2-git-send-email-dave@stgolabs.net
    Signed-off-by: Ingo Molnar

    Davidlohr Bueso
     
  • Signed-off-by: Ingo Molnar

    Ingo Molnar
     
  • The problem with the existing lock pinning is that each pin is of
    value 1; this mean you can simply unpin if you know its pinned,
    without having any extra information.

    This scheme generates a random (16 bit) cookie for each pin and
    requires this same cookie to unpin. This means you have to keep the
    cookie in context.

    No objsize difference for !LOCKDEP kernels.

    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Andrew Morton
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

28 Apr, 2016

1 commit

  • This commit replaces an #ifdef with IS_ENABLED(), saving five lines.

    Signed-off-by: Paul E. McKenney
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: corbet@lwn.net
    Cc: dave@stgolabs.net
    Cc: dhowells@redhat.com
    Cc: linux-doc@vger.kernel.org
    Cc: will.deacon@arm.com
    Link: http://lkml.kernel.org/r/1461691328-5429-4-git-send-email-paulmck@linux.vnet.ibm.com
    Signed-off-by: Ingo Molnar

    Paul E. McKenney
     

26 Apr, 2016

1 commit

  • In ext4, there is a race condition between changing inode journal mode
    and ext4_writepages(). While ext4_writepages() is executed on a
    non-journalled mode inode, the inode's journal mode could be enabled
    by ioctl() and then, some pages dirtied after switching the journal
    mode will be still exposed to ext4_writepages() in non-journaled mode.
    To resolve this problem, we use fs-wide per-cpu rw semaphore by Jan
    Kara's suggestion because we don't want to waste ext4_inode_info's
    space for this extra rare case.

    Signed-off-by: Daeho Jeong
    Signed-off-by: Theodore Ts'o
    Reviewed-by: Jan Kara

    Daeho Jeong
     

23 Apr, 2016

2 commits

  • lock_chain::base is used to store an index into the chain_hlocks[]
    array, however that array contains more elements than can be indexed
    using the u16.

    Change the lock_chain structure to use a bitfield to encode the data
    it needs and add BUILD_BUG_ON() assertions to check the fields are
    wide enough.

    Also, for DEBUG_LOCKDEP, assert that we don't run out of elements of
    that array; as that would wreck the collision detectoring.

    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Alfredo Alvarez Fernandez
    Cc: Andrew Morton
    Cc: Linus Torvalds
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Sedat Dilek
    Cc: Theodore Ts'o
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/20160330093659.GS3408@twins.programming.kicks-ass.net
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • task_irq_context() returns the encoded irq_context of the task, the
    return value is encoded in the same as ->irq_context of held_lock.

    Always return 0 if !(CONFIG_TRACE_IRQFLAGS && CONFIG_PROVE_LOCKING)

    Signed-off-by: Boqun Feng
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Andrew Morton
    Cc: Josh Triplett
    Cc: Lai Jiangshan
    Cc: Linus Torvalds
    Cc: Mathieu Desnoyers
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Steven Rostedt
    Cc: Thomas Gleixner
    Cc: sasha.levin@oracle.com
    Link: http://lkml.kernel.org/r/1455602265-16490-2-git-send-email-boqun.feng@gmail.com
    Signed-off-by: Ingo Molnar

    Boqun Feng
     

22 Apr, 2016

1 commit

  • Now that all the architectures implement the necessary glue code
    we can introduce down_write_killable(). The only difference wrt. regular
    down_write() is that the slow path waits in TASK_KILLABLE state and the
    interruption by the fatal signal is reported as -EINTR to the caller.

    Signed-off-by: Michal Hocko
    Cc: Andrew Morton
    Cc: Chris Zankel
    Cc: David S. Miller
    Cc: Linus Torvalds
    Cc: Max Filippov
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Signed-off-by: Davidlohr Bueso
    Cc: Signed-off-by: Jason Low
    Cc: Thomas Gleixner
    Cc: Tony Luck
    Cc: linux-alpha@vger.kernel.org
    Cc: linux-arch@vger.kernel.org
    Cc: linux-ia64@vger.kernel.org
    Cc: linux-s390@vger.kernel.org
    Cc: linux-sh@vger.kernel.org
    Cc: linux-xtensa@linux-xtensa.org
    Cc: sparclinux@vger.kernel.org
    Link: http://lkml.kernel.org/r/1460041951-22347-12-git-send-email-mhocko@kernel.org
    Signed-off-by: Ingo Molnar

    Michal Hocko
     

19 Apr, 2016

1 commit

  • While playing with the qstat statistics (in /qlockstat/) I ran into
    the following splat on a VM when opening pv_hash_hops:

    divide error: 0000 [#1] SMP
    ...
    RIP: 0010:[] [] qstat_read+0x12e/0x1e0
    ...
    Call Trace:
    [] ? mem_cgroup_commit_charge+0x6c/0xd0
    [] ? page_add_new_anon_rmap+0x8c/0xd0
    [] ? handle_mm_fault+0x1439/0x1b40
    [] ? do_mmap+0x449/0x550
    [] ? __vfs_read+0x23/0xd0
    [] ? rw_verify_area+0x52/0xd0
    [] ? vfs_read+0x81/0x120
    [] ? SyS_read+0x42/0xa0
    [] ? entry_SYSCALL_64_fastpath+0x1e/0xa8

    Fix this by verifying that qstat_pv_kick_unlock is in fact non-zero,
    similarly to what the qstat_pv_latency_wake case does, as if nothing
    else, this can come from resetting the statistics, thus having 0 kicks
    should be quite valid in this context.

    Signed-off-by: Davidlohr Bueso
    Reviewed-by: Waiman Long
    Cc: Andrew Morton
    Cc: Linus Torvalds
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: dave@stgolabs.net
    Cc: waiman.long@hpe.com
    Link: http://lkml.kernel.org/r/1460961103-24953-1-git-send-email-dave@stgolabs.net
    Signed-off-by: Ingo Molnar

    Davidlohr Bueso
     

13 Apr, 2016

5 commits

  • Introduce a generic implementation necessary for down_write_killable().

    This is a trivial extension of the already existing down_write() call
    which can be interrupted by SIGKILL. This patch doesn't provide
    down_write_killable() yet because arches have to provide the necessary
    pieces before.

    rwsem_down_write_failed() which is a generic slow path for the
    write lock is extended to take a task state and renamed to
    __rwsem_down_write_failed_common(). The return value is either a valid
    semaphore pointer or ERR_PTR(-EINTR).

    rwsem_down_write_failed_killable() is exported as a new way to wait for
    the lock and be killable.

    For rwsem-spinlock implementation the current __down_write() it updated
    in a similar way as __rwsem_down_write_failed_common() except it doesn't
    need new exports just visible __down_write_killable().

    Architectures which are not using the generic rwsem implementation are
    supposed to provide their __down_write_killable() implementation and
    use rwsem_down_write_failed_killable() for the slow path.

    Signed-off-by: Michal Hocko
    Cc: Andrew Morton
    Cc: Chris Zankel
    Cc: David S. Miller
    Cc: Linus Torvalds
    Cc: Max Filippov
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Signed-off-by: Davidlohr Bueso
    Cc: Signed-off-by: Jason Low
    Cc: Thomas Gleixner
    Cc: Tony Luck
    Cc: linux-alpha@vger.kernel.org
    Cc: linux-arch@vger.kernel.org
    Cc: linux-ia64@vger.kernel.org
    Cc: linux-s390@vger.kernel.org
    Cc: linux-sh@vger.kernel.org
    Cc: linux-xtensa@linux-xtensa.org
    Cc: sparclinux@vger.kernel.org
    Link: http://lkml.kernel.org/r/1460041951-22347-7-git-send-email-mhocko@kernel.org
    Signed-off-by: Ingo Molnar

    Michal Hocko
     
  • This is no longer used anywhere and all callers (__down_write()) use
    0 as a subclass. Ditch __down_write_nested() to make the code easier
    to follow.

    This shouldn't introduce any functional change.

    Signed-off-by: Michal Hocko
    Acked-by: Davidlohr Bueso
    Cc: Andrew Morton
    Cc: Chris Zankel
    Cc: David S. Miller
    Cc: Linus Torvalds
    Cc: Max Filippov
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Signed-off-by: Davidlohr Bueso
    Cc: Signed-off-by: Jason Low
    Cc: Thomas Gleixner
    Cc: Tony Luck
    Cc: linux-alpha@vger.kernel.org
    Cc: linux-arch@vger.kernel.org
    Cc: linux-ia64@vger.kernel.org
    Cc: linux-s390@vger.kernel.org
    Cc: linux-sh@vger.kernel.org
    Cc: linux-xtensa@linux-xtensa.org
    Cc: sparclinux@vger.kernel.org
    Link: http://lkml.kernel.org/r/1460041951-22347-2-git-send-email-mhocko@kernel.org
    Signed-off-by: Ingo Molnar

    Michal Hocko
     
  • This function compiles to 1328 bytes of machine code. Three callsites.

    Registering a new lock class is definitely not *that* time-critical to inline it.

    Signed-off-by: Denys Vlasenko
    Cc: Andrew Morton
    Cc: Linus Torvalds
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Link: http://lkml.kernel.org/r/1460141926-13069-5-git-send-email-dvlasenk@redhat.com
    Signed-off-by: Ingo Molnar

    Denys Vlasenko
     
  • It has been found that paths that invoke cleanups through
    lock_torture_cleanup() can trigger NULL pointer dereferencing
    bugs during the statistics printing phase. This is mainly
    because we should not be calling into statistics before we are
    sure things have been set up correctly.

    Specifically, early checks (and the need for handling this in
    the cleanup call) only include parameter checks and basic
    statistics allocation. Once we start write/read kthreads
    we then consider the test as started. As such, update the function
    in question to check for cxt.lwsa writer stats, if not set,
    we either have a bogus parameter or -ENOMEM situation and
    therefore only need to deal with general torture calls.

    Reported-and-tested-by: Kefeng Wang
    Signed-off-by: Davidlohr Bueso
    Signed-off-by: Paul E. McKenney
    Cc: Andrew Morton
    Cc: Davidlohr Bueso
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: bobby.prani@gmail.com
    Cc: dhowells@redhat.com
    Cc: dipankar@in.ibm.com
    Cc: dvhart@linux.intel.com
    Cc: edumazet@google.com
    Cc: fweisbec@gmail.com
    Cc: jiangshanlai@gmail.com
    Cc: josh@joshtriplett.org
    Cc: mathieu.desnoyers@efficios.com
    Cc: oleg@redhat.com
    Cc: rostedt@goodmis.org
    Link: http://lkml.kernel.org/r/1460476038-27060-2-git-send-email-paulmck@linux.vnet.ibm.com
    [ Improved the changelog. ]
    Signed-off-by: Ingo Molnar

    Davidlohr Bueso
     
  • For the case of rtmutex torturing we will randomly call into the
    boost() handler, including upon module exiting when the tasks are
    deboosted before stopping. In such cases the task may or may not have
    already been boosted, and therefore the NULL being explicitly passed
    can occur anywhere. Currently we only assume that the task will is
    at a higher prio, and in consequence, dereference a NULL pointer.

    This patch fixes the case of a rmmod locktorture exploding while
    pounding on the rtmutex lock (partial trace):

    task: ffff88081026cf80 ti: ffff880816120000 task.ti: ffff880816120000
    RSP: 0018:ffff880816123eb0 EFLAGS: 00010206
    RAX: ffff88081026cf80 RBX: ffff880816bfa630 RCX: 0000000000160d1b
    RDX: 0000000000000000 RSI: 0000000000000202 RDI: 0000000000000000
    RBP: ffff88081026cf80 R08: 000000000000001f R09: ffff88017c20ca80
    R10: 0000000000000000 R11: 000000000048c316 R12: ffffffffa05d1840
    R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
    FS: 0000000000000000(0000) GS:ffff88203f880000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 0000000000000008 CR3: 0000000001c0a000 CR4: 00000000000406e0
    Stack:
    ffffffffa05d141d ffff880816bfa630 ffffffffa05d1922 ffff88081e70c2c0
    ffff880816bfa630 ffffffff81095fed 0000000000000000 ffffffff8107bf60
    ffff880816bfa630 ffffffff00000000 ffff880800000000 ffff880816123f08
    Call Trace:
    [] kthread+0xbd/0xe0
    [] ret_from_fork+0x3f/0x70

    This patch ensures that if the random state pointer is not NULL and current
    is not boosted, then do nothing.

    RIP: 0010:[] [] torture_random+0x5/0x60 [torture]
    [] torture_rtmutex_boost+0x1d/0x90 [locktorture]
    [] lock_torture_writer+0xe2/0x170 [locktorture]

    Signed-off-by: Davidlohr Bueso
    Signed-off-by: Paul E. McKenney
    Cc: Andrew Morton
    Cc: Davidlohr Bueso
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: bobby.prani@gmail.com
    Cc: dhowells@redhat.com
    Cc: dipankar@in.ibm.com
    Cc: dvhart@linux.intel.com
    Cc: edumazet@google.com
    Cc: fweisbec@gmail.com
    Cc: jiangshanlai@gmail.com
    Cc: josh@joshtriplett.org
    Cc: mathieu.desnoyers@efficios.com
    Cc: oleg@redhat.com
    Cc: rostedt@goodmis.org
    Link: http://lkml.kernel.org/r/1460476038-27060-1-git-send-email-paulmck@linux.vnet.ibm.com
    Signed-off-by: Ingo Molnar

    Davidlohr Bueso
     

04 Apr, 2016

1 commit

  • Fix this:

    kernel/locking/lockdep.c:2051:13: warning: ‘print_collision’ defined but not used [-Wunused-function]
    static void print_collision(struct task_struct *curr,
    ^

    Signed-off-by: Borislav Petkov
    Cc: Andrew Morton
    Cc: Linus Torvalds
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/1459759327-2880-1-git-send-email-bp@alien8.de
    Signed-off-by: Ingo Molnar

    Borislav Petkov
     

31 Mar, 2016

1 commit

  • A sequence of pairs [class_idx -> corresponding chain_key iteration]
    is printed for both the current held_lock chain and the cached chain.

    That exposes the two different class_idx sequences that led to that
    particular hash value.

    This helps with debugging hash chain collision reports.

    Signed-off-by: Alfredo Alvarez Fernandez
    Acked-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-fsdevel@vger.kernel.org
    Cc: sedat.dilek@gmail.com
    Cc: tytso@mit.edu
    Link: http://lkml.kernel.org/r/1459357416-19190-1-git-send-email-alfredoalvarezernandez@gmail.com
    Signed-off-by: Ingo Molnar

    Alfredo Alvarez Fernandez
     

23 Mar, 2016

1 commit

  • kcov provides code coverage collection for coverage-guided fuzzing
    (randomized testing). Coverage-guided fuzzing is a testing technique
    that uses coverage feedback to determine new interesting inputs to a
    system. A notable user-space example is AFL
    (http://lcamtuf.coredump.cx/afl/). However, this technique is not
    widely used for kernel testing due to missing compiler and kernel
    support.

    kcov does not aim to collect as much coverage as possible. It aims to
    collect more or less stable coverage that is function of syscall inputs.
    To achieve this goal it does not collect coverage in soft/hard
    interrupts and instrumentation of some inherently non-deterministic or
    non-interesting parts of kernel is disbled (e.g. scheduler, locking).

    Currently there is a single coverage collection mode (tracing), but the
    API anticipates additional collection modes. Initially I also
    implemented a second mode which exposes coverage in a fixed-size hash
    table of counters (what Quentin used in his original patch). I've
    dropped the second mode for simplicity.

    This patch adds the necessary support on kernel side. The complimentary
    compiler support was added in gcc revision 231296.

    We've used this support to build syzkaller system call fuzzer, which has
    found 90 kernel bugs in just 2 months:

    https://github.com/google/syzkaller/wiki/Found-Bugs

    We've also found 30+ bugs in our internal systems with syzkaller.
    Another (yet unexplored) direction where kcov coverage would greatly
    help is more traditional "blob mutation". For example, mounting a
    random blob as a filesystem, or receiving a random blob over wire.

    Why not gcov. Typical fuzzing loop looks as follows: (1) reset
    coverage, (2) execute a bit of code, (3) collect coverage, repeat. A
    typical coverage can be just a dozen of basic blocks (e.g. an invalid
    input). In such context gcov becomes prohibitively expensive as
    reset/collect coverage steps depend on total number of basic
    blocks/edges in program (in case of kernel it is about 2M). Cost of
    kcov depends only on number of executed basic blocks/edges. On top of
    that, kernel requires per-thread coverage because there are always
    background threads and unrelated processes that also produce coverage.
    With inlined gcov instrumentation per-thread coverage is not possible.

    kcov exposes kernel PCs and control flow to user-space which is
    insecure. But debugfs should not be mapped as user accessible.

    Based on a patch by Quentin Casasnovas.

    [akpm@linux-foundation.org: make task_struct.kcov_mode have type `enum kcov_mode']
    [akpm@linux-foundation.org: unbreak allmodconfig]
    [akpm@linux-foundation.org: follow x86 Makefile layout standards]
    Signed-off-by: Dmitry Vyukov
    Reviewed-by: Kees Cook
    Cc: syzkaller
    Cc: Vegard Nossum
    Cc: Catalin Marinas
    Cc: Tavis Ormandy
    Cc: Will Deacon
    Cc: Quentin Casasnovas
    Cc: Kostya Serebryany
    Cc: Eric Dumazet
    Cc: Alexander Potapenko
    Cc: Kees Cook
    Cc: Bjorn Helgaas
    Cc: Sasha Levin
    Cc: David Drysdale
    Cc: Ard Biesheuvel
    Cc: Andrey Ryabinin
    Cc: Kirill A. Shutemov
    Cc: Jiri Slaby
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: "H. Peter Anvin"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dmitry Vyukov
     

16 Mar, 2016

1 commit

  • $ make tags
    GEN tags
    ctags: Warning: drivers/acpi/processor_idle.c:64: null expansion of name pattern "\1"
    ctags: Warning: drivers/xen/events/events_2l.c:41: null expansion of name pattern "\1"
    ctags: Warning: kernel/locking/lockdep.c:151: null expansion of name pattern "\1"
    ctags: Warning: kernel/rcu/rcutorture.c:133: null expansion of name pattern "\1"
    ctags: Warning: kernel/rcu/rcutorture.c:135: null expansion of name pattern "\1"
    ctags: Warning: kernel/workqueue.c:323: null expansion of name pattern "\1"
    ctags: Warning: net/ipv4/syncookies.c:53: null expansion of name pattern "\1"
    ctags: Warning: net/ipv6/syncookies.c:44: null expansion of name pattern "\1"
    ctags: Warning: net/rds/page.c:45: null expansion of name pattern "\1"

    Which are all the result of the DEFINE_PER_CPU pattern:

    scripts/tags.sh:200: '/\
    Acked-by: David S. Miller
    Acked-by: Rafael J. Wysocki
    Cc: Tejun Heo
    Cc: "Paul E. McKenney"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Zijlstra
     

29 Feb, 2016

7 commits

  • Add detection for chain_key collision under CONFIG_DEBUG_LOCKDEP.
    When a collision is detected the problem is reported and all lock
    debugging is turned off.

    Tested using liblockdep and the added tests before and after
    applying the fix, confirming both that the code added for the
    detection correctly reports the problem and that the fix actually
    fixes it.

    Tested tweaking lockdep to generate false collisions and
    verified that the problem is reported and that lock debugging is
    turned off.

    Also tested with lockdep's test suite after applying the patch:

    [ 0.000000] Good, all 253 testcases passed! |

    Signed-off-by: Alfredo Alvarez Fernandez
    Cc: Alfredo Alvarez Fernandez
    Cc: Linus Torvalds
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: sasha.levin@oracle.com
    Link: http://lkml.kernel.org/r/1455864533-7536-4-git-send-email-alfredoalvarezernandez@gmail.com
    Signed-off-by: Ingo Molnar

    Ingo Molnar
     
  • The chain_key hashing macro iterate_chain_key(key1, key2) does not
    generate a new different value if both key1 and key2 are 0. In that
    case the generated value is again 0. This can lead to collisions which
    can result in lockdep not detecting deadlocks or circular
    dependencies.

    Avoid the problem by using class_idx (1-based) instead of class id
    (0-based) as an input for the hashing macro 'key2' in
    iterate_chain_key(key1, key2).

    The use of class id created collisions in cases like the following:

    1.- Consider an initial state in which no class has been acquired yet.
    Under these circumstances an AA deadlock will not be detected by
    lockdep:

    lock [key1,key2]->new key (key1=old chain_key, key2=id)
    --------------------------
    A [0,0]->0
    A [0,0]->0 (collision)

    The newly generated chain_key collides with the one used before and as
    a result the check for a deadlock is skipped

    A simple test using liblockdep and a pthread mutex confirms the
    problem: (omitting stack traces)

    new class 0xe15038: 0x7ffc64950f20
    acquire class [0xe15038] 0x7ffc64950f20
    acquire class [0xe15038] 0x7ffc64950f20
    hash chain already cached, key: 0000000000000000 tail class:
    [0xe15038] 0x7ffc64950f20

    2.- Consider an ABBA in 2 different tasks and no class yet acquired.

    T1 [key1,key2]->new key T2[key1,key2]->new key
    -- --
    A [0,0]->0

    B [0,1]->1

    B [0,1]->1 (collision)

    A

    In this case the collision prevents lockdep from creating the new
    dependency A->B. This in turn results in lockdep not detecting the
    circular dependency when T2 acquires A.

    Signed-off-by: Alfredo Alvarez Fernandez
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Andrew Morton
    Cc: Linus Torvalds
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: sasha.levin@oracle.com
    Link: http://lkml.kernel.org/r/1455147212-2389-4-git-send-email-alfredoalvarezernandez@gmail.com
    Signed-off-by: Ingo Molnar

    Alfredo Alvarez Fernandez
     
  • Make use of wake-queues and enable the wakeup to occur after releasing the
    wait_lock. This is similar to what we do with rtmutex top waiter,
    slightly shortening the critical region and allow other waiters to
    acquire the wait_lock sooner. In low contention cases it can also help
    the recently woken waiter to find the wait_lock available (fastpath)
    when it continues execution.

    Reviewed-by: Waiman Long
    Signed-off-by: Davidlohr Bueso
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Andrew Morton
    Cc: Ding Tianhong
    Cc: Jason Low
    Cc: Linus Torvalds
    Cc: Paul E. McKenney
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Tim Chen
    Cc: Waiman Long
    Cc: Will Deacon
    Link: http://lkml.kernel.org/r/20160125022343.GA3322@linux-uzut.site
    Signed-off-by: Ingo Molnar

    Davidlohr Bueso
     
  • This patch enables the tracking of the number of slowpath locking
    operations performed. This can be used to compare against the number
    of lock stealing operations to see what percentage of locks are stolen
    versus acquired via the regular slowpath.

    Signed-off-by: Waiman Long
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Andrew Morton
    Cc: Douglas Hatch
    Cc: Linus Torvalds
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Scott J Norton
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/1449778666-13593-2-git-send-email-Waiman.Long@hpe.com
    Signed-off-by: Ingo Molnar

    Waiman Long
     
  • The newly introduced smp_cond_acquire() was used to replace the
    slowpath lock acquisition loop. Similarly, the new function can also
    be applied to the pending bit locking loop. This patch uses the new
    function in that loop.

    Signed-off-by: Waiman Long
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Andrew Morton
    Cc: Douglas Hatch
    Cc: Linus Torvalds
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Scott J Norton
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/1449778666-13593-1-git-send-email-Waiman.Long@hpe.com
    Signed-off-by: Ingo Molnar

    Waiman Long
     
  • This patch moves the lock stealing count tracking code into
    pv_queued_spin_steal_lock() instead of via a jacket function simplifying
    the code.

    Signed-off-by: Waiman Long
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Andrew Morton
    Cc: Douglas Hatch
    Cc: Linus Torvalds
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Scott J Norton
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/1449778666-13593-3-git-send-email-Waiman.Long@hpe.com
    Signed-off-by: Ingo Molnar

    Waiman Long
     
  • Similar to commit b4b29f94856a ("locking/osq: Fix ordering of node
    initialisation in osq_lock") the use of xchg_acquire() is
    fundamentally broken with MCS like constructs.

    Furthermore, it turns out we rely on the global transitivity of this
    operation because the unlock path observes the pointer with a
    READ_ONCE(), not an smp_load_acquire().

    This is non-critical because the MCS code isn't actually used and
    mostly serves as documentation, a stepping stone to the more complex
    things we've build on top of the idea.

    Reported-by: Andrea Parri
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Andrew Morton
    Cc: Linus Torvalds
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Will Deacon
    Fixes: 3552a07a9c4a ("locking/mcs: Use acquire/release semantics")
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

09 Feb, 2016

3 commits

  • Lockdep is initialized at compile time now. Get rid of lockdep_init().

    Signed-off-by: Andrey Ryabinin
    Signed-off-by: Andrew Morton
    Cc: Linus Torvalds
    Cc: Mike Krinkin
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Cc: mm-commits@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Andrey Ryabinin
     
  • Mike said:

    : CONFIG_UBSAN_ALIGNMENT breaks x86-64 kernel with lockdep enabled, i.e.
    : kernel with CONFIG_UBSAN_ALIGNMENT=y fails to load without even any error
    : message.
    :
    : The problem is that ubsan callbacks use spinlocks and might be called
    : before lockdep is initialized. Particularly this line in the
    : reserve_ebda_region function causes problem:
    :
    : lowmem = *(unsigned short *)__va(BIOS_LOWMEM_KILOBYTES);
    :
    : If i put lockdep_init() before reserve_ebda_region call in
    : x86_64_start_reservations kernel loads well.

    Fix this ordering issue permanently: change lockdep so that it uses hlists
    for the hash tables. Unlike a list_head, an hlist_head is in its
    initialized state when it is all-zeroes, so lockdep is ready for operation
    immediately upon boot - lockdep_init() need not have run.

    The patch will also save some memory.

    Probably lockdep_init() and lockdep_initialized can be done away with now.

    Suggested-by: Mike Krinkin
    Reported-by: Mike Krinkin
    Signed-off-by: Andrew Morton
    Cc: Andrey Ryabinin
    Cc: Linus Torvalds
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Cc: mm-commits@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Andrew Morton
     
  • check_prev_add() caches saved stack trace in static trace variable
    to avoid duplicate save_trace() calls in dependencies involving trylocks.
    But that caching logic contains a bug. We may not save trace on first
    iteration due to early return from check_prev_add(). Then on the
    second iteration when we actually need the trace we don't save it
    because we think that we've already saved it.

    Let check_prev_add() itself control when stack is saved.

    There is another bug. Trace variable is protected by graph lock.
    But we can temporary release graph lock during printing.

    Fix this by invalidating cached stack trace when we release graph lock.

    Signed-off-by: Dmitry Vyukov
    Cc: Andrew Morton
    Cc: Linus Torvalds
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: glider@google.com
    Cc: kcc@google.com
    Cc: peter@hurleysoftware.com
    Cc: sasha.levin@oracle.com
    Link: http://lkml.kernel.org/r/1454593240-121647-1-git-send-email-dvyukov@google.com
    Signed-off-by: Ingo Molnar

    Dmitry Vyukov
     

26 Jan, 2016

1 commit

  • Sasha reported a lockdep splat about a potential deadlock between RCU boosting
    rtmutex and the posix timer it_lock.

    CPU0 CPU1

    rtmutex_lock(&rcu->rt_mutex)
    spin_lock(&rcu->rt_mutex.wait_lock)
    local_irq_disable()
    spin_lock(&timer->it_lock)
    spin_lock(&rcu->mutex.wait_lock)
    --> Interrupt
    spin_lock(&timer->it_lock)

    This is caused by the following code sequence on CPU1

    rcu_read_lock()
    x = lookup();
    if (x)
    spin_lock_irqsave(&x->it_lock);
    rcu_read_unlock();
    return x;

    We could fix that in the posix timer code by keeping rcu read locked across
    the spinlocked and irq disabled section, but the above sequence is common and
    there is no reason not to support it.

    Taking rt_mutex.wait_lock irq safe prevents the deadlock.

    Reported-by: Sasha Levin
    Signed-off-by: Thomas Gleixner
    Cc: Peter Zijlstra
    Cc: Paul McKenney

    Thomas Gleixner
     

12 Jan, 2016

1 commit

  • Pull locking updates from Ingo Molnar:
    "So we have a laundry list of locking subsystem changes:

    - continuing barrier API and code improvements

    - futex enhancements

    - atomics API improvements

    - pvqspinlock enhancements: in particular lock stealing and adaptive
    spinning

    - qspinlock micro-enhancements"

    * 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    futex: Allow FUTEX_CLOCK_REALTIME with FUTEX_WAIT op
    futex: Cleanup the goto confusion in requeue_pi()
    futex: Remove pointless put_pi_state calls in requeue()
    futex: Document pi_state refcounting in requeue code
    futex: Rename free_pi_state() to put_pi_state()
    futex: Drop refcount if requeue_pi() acquired the rtmutex
    locking/barriers, arch: Remove ambiguous statement in the smp_store_mb() documentation
    lcoking/barriers, arch: Use smp barriers in smp_store_release()
    locking/cmpxchg, arch: Remove tas() definitions
    locking/pvqspinlock: Queue node adaptive spinning
    locking/pvqspinlock: Allow limited lock stealing
    locking/pvqspinlock: Collect slowpath lock statistics
    sched/core, locking: Document Program-Order guarantees
    locking, sched: Introduce smp_cond_acquire() and use it
    locking/pvqspinlock, x86: Optimize the PV unlock code path
    locking/qspinlock: Avoid redundant read of next pointer
    locking/qspinlock: Prefetch the next node cacheline
    locking/qspinlock: Use _acquire/_release() versions of cmpxchg() & xchg()
    atomics: Add test for atomic operations with _relaxed variants

    Linus Torvalds
     

18 Dec, 2015

1 commit

  • The Cavium guys reported a soft lockup on their arm64 machine, caused by
    commit c55a6ffa6285 ("locking/osq: Relax atomic semantics"):

    mutex_optimistic_spin+0x9c/0x1d0
    __mutex_lock_slowpath+0x44/0x158
    mutex_lock+0x54/0x58
    kernfs_iop_permission+0x38/0x70
    __inode_permission+0x88/0xd8
    inode_permission+0x30/0x6c
    link_path_walk+0x68/0x4d4
    path_openat+0xb4/0x2bc
    do_filp_open+0x74/0xd0
    do_sys_open+0x14c/0x228
    SyS_openat+0x3c/0x48
    el0_svc_naked+0x24/0x28

    This is because in osq_lock we initialise the node for the current CPU:

    node->locked = 0;
    node->next = NULL;
    node->cpu = curr;

    and then publish the current CPU in the lock tail:

    old = atomic_xchg_acquire(&lock->tail, curr);

    Once the update to lock->tail is visible to another CPU, the node is
    then live and can be both read and updated by concurrent lockers.

    Unfortunately, the ACQUIRE semantics of the xchg operation mean that
    there is no guarantee the contents of the node will be visible before
    lock tail is updated. This can lead to lock corruption when, for
    example, a concurrent locker races to set the next field.

    Fixes: c55a6ffa6285 ("locking/osq: Relax atomic semantics"):
    Reported-by: David Daney
    Reported-by: Andrew Pinski
    Tested-by: Andrew Pinski
    Acked-by: Davidlohr Bueso
    Signed-off-by: Will Deacon
    Signed-off-by: Peter Zijlstra (Intel)
    Link: http://lkml.kernel.org/r/1449856001-21177-1-git-send-email-will.deacon@arm.com
    Signed-off-by: Linus Torvalds

    Will Deacon
     

04 Dec, 2015

2 commits

  • In an overcommitted guest where some vCPUs have to be halted to make
    forward progress in other areas, it is highly likely that a vCPU later
    in the spinlock queue will be spinning while the ones earlier in the
    queue would have been halted. The spinning in the later vCPUs is then
    just a waste of precious CPU cycles because they are not going to
    get the lock soon as the earlier ones have to be woken up and take
    their turn to get the lock.

    This patch implements an adaptive spinning mechanism where the vCPU
    will call pv_wait() if the previous vCPU is not running.

    Linux kernel builds were run in KVM guest on an 8-socket, 4
    cores/socket Westmere-EX system and a 4-socket, 8 cores/socket
    Haswell-EX system. Both systems are configured to have 32 physical
    CPUs. The kernel build times before and after the patch were:

    Westmere Haswell
    Patch 32 vCPUs 48 vCPUs 32 vCPUs 48 vCPUs
    ----- -------- -------- -------- --------
    Before patch 3m02.3s 5m00.2s 1m43.7s 3m03.5s
    After patch 3m03.0s 4m37.5s 1m43.0s 2m47.2s

    For 32 vCPUs, this patch doesn't cause any noticeable change in
    performance. For 48 vCPUs (over-committed), there is about 8%
    performance improvement.

    Signed-off-by: Waiman Long
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Andrew Morton
    Cc: Davidlohr Bueso
    Cc: Douglas Hatch
    Cc: H. Peter Anvin
    Cc: Linus Torvalds
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Scott J Norton
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/1447114167-47185-8-git-send-email-Waiman.Long@hpe.com
    Signed-off-by: Ingo Molnar

    Waiman Long
     
  • This patch allows one attempt for the lock waiter to steal the lock
    when entering the PV slowpath. To prevent lock starvation, the pending
    bit will be set by the queue head vCPU when it is in the active lock
    spinning loop to disable any lock stealing attempt. This helps to
    reduce the performance penalty caused by lock waiter preemption while
    not having much of the downsides of a real unfair lock.

    The pv_wait_head() function was renamed as pv_wait_head_or_lock()
    as it was modified to acquire the lock before returning. This is
    necessary because of possible lock stealing attempts from other tasks.

    Linux kernel builds were run in KVM guest on an 8-socket, 4
    cores/socket Westmere-EX system and a 4-socket, 8 cores/socket
    Haswell-EX system. Both systems are configured to have 32 physical
    CPUs. The kernel build times before and after the patch were:

    Westmere Haswell
    Patch 32 vCPUs 48 vCPUs 32 vCPUs 48 vCPUs
    ----- -------- -------- -------- --------
    Before patch 3m15.6s 10m56.1s 1m44.1s 5m29.1s
    After patch 3m02.3s 5m00.2s 1m43.7s 3m03.5s

    For the overcommited case (48 vCPUs), this patch is able to reduce
    kernel build time by more than 54% for Westmere and 44% for Haswell.

    Signed-off-by: Waiman Long
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Andrew Morton
    Cc: Davidlohr Bueso
    Cc: Douglas Hatch
    Cc: H. Peter Anvin
    Cc: Linus Torvalds
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Scott J Norton
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/1447190336-53317-1-git-send-email-Waiman.Long@hpe.com
    Signed-off-by: Ingo Molnar

    Waiman Long