23 Feb, 2015

2 commits

  • Pull more vfs updates from Al Viro:
    "Assorted stuff from this cycle. The big ones here are multilayer
    overlayfs from Miklos and beginning of sorting ->d_inode accesses out
    from David"

    * 'for-linus-2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (51 commits)
    autofs4 copy_dev_ioctl(): keep the value of ->size we'd used for allocation
    procfs: fix race between symlink removals and traversals
    debugfs: leave freeing a symlink body until inode eviction
    Documentation/filesystems/Locking: ->get_sb() is long gone
    trylock_super(): replacement for grab_super_passive()
    fanotify: Fix up scripted S_ISDIR/S_ISREG/S_ISLNK conversions
    Cachefiles: Fix up scripted S_ISDIR/S_ISREG/S_ISLNK conversions
    VFS: (Scripted) Convert S_ISLNK/DIR/REG(dentry->d_inode) to d_is_*(dentry)
    SELinux: Use d_is_positive() rather than testing dentry->d_inode
    Smack: Use d_is_positive() rather than testing dentry->d_inode
    TOMOYO: Use d_is_dir() rather than d_inode and S_ISDIR()
    Apparmor: Use d_is_positive/negative() rather than testing dentry->d_inode
    Apparmor: mediated_filesystem() should use dentry->d_sb not inode->i_sb
    VFS: Split DCACHE_FILE_TYPE into regular and special types
    VFS: Add a fallthrough flag for marking virtual dentries
    VFS: Add a whiteout dentry type
    VFS: Introduce inode-getting helpers for layered/unioned fs environments
    Infiniband: Fix potential NULL d_inode dereference
    posix_acl: fix reference leaks in posix_acl_create
    autofs4: Wrong format for printing dentry
    ...

    Linus Torvalds
     
  • Convert the following where appropriate:

    (1) S_ISLNK(dentry->d_inode) to d_is_symlink(dentry).

    (2) S_ISREG(dentry->d_inode) to d_is_reg(dentry).

    (3) S_ISDIR(dentry->d_inode) to d_is_dir(dentry). This is actually more
    complicated than it appears as some calls should be converted to
    d_can_lookup() instead. The difference is whether the directory in
    question is a real dir with a ->lookup op or whether it's a fake dir with
    a ->d_automount op.

    In some circumstances, we can subsume checks for dentry->d_inode not being
    NULL into this, provided we the code isn't in a filesystem that expects
    d_inode to be NULL if the dirent really *is* negative (ie. if we're going to
    use d_inode() rather than d_backing_inode() to get the inode pointer).

    Note that the dentry type field may be set to something other than
    DCACHE_MISS_TYPE when d_inode is NULL in the case of unionmount, where the VFS
    manages the fall-through from a negative dentry to a lower layer. In such a
    case, the dentry type of the negative union dentry is set to the same as the
    type of the lower dentry.

    However, if you know d_inode is not NULL at the call site, then you can use
    the d_is_xxx() functions even in a filesystem.

    There is one further complication: a 0,0 chardev dentry may be labelled
    DCACHE_WHITEOUT_TYPE rather than DCACHE_SPECIAL_TYPE. Strictly, this was
    intended for special directory entry types that don't have attached inodes.

    The following perl+coccinelle script was used:

    use strict;

    my @callers;
    open($fd, 'git grep -l \'S_IS[A-Z].*->d_inode\' |') ||
    die "Can't grep for S_ISDIR and co. callers";
    @callers = ;
    close($fd);
    unless (@callers) {
    print "No matches\n";
    exit(0);
    }

    my @cocci = (
    '@@',
    'expression E;',
    '@@',
    '',
    '- S_ISLNK(E->d_inode->i_mode)',
    '+ d_is_symlink(E)',
    '',
    '@@',
    'expression E;',
    '@@',
    '',
    '- S_ISDIR(E->d_inode->i_mode)',
    '+ d_is_dir(E)',
    '',
    '@@',
    'expression E;',
    '@@',
    '',
    '- S_ISREG(E->d_inode->i_mode)',
    '+ d_is_reg(E)' );

    my $coccifile = "tmp.sp.cocci";
    open($fd, ">$coccifile") || die $coccifile;
    print($fd "$_\n") || die $coccifile foreach (@cocci);
    close($fd);

    foreach my $file (@callers) {
    chomp $file;
    print "Processing ", $file, "\n";
    system("spatch", "--sp-file", $coccifile, $file, "--in-place", "--no-show-diff") == 0 ||
    die "spatch failed";
    }

    [AV: overlayfs parts skipped]

    Signed-off-by: David Howells
    Signed-off-by: Al Viro

    David Howells
     

20 Feb, 2015

1 commit

  • Pull kconfig updates from Michal Marek:
    "Yann E Morin was supposed to take over kconfig maintainership, but
    this hasn't happened. So I'm sending a few kconfig patches that I
    collected:

    - Fix for missing va_end in kconfig
    - merge_config.sh displays used if given too few arguments
    - s/boolean/bool/ in Kconfig files for consistency, with the plan to
    only support bool in the future"

    * 'kconfig' of git://git.kernel.org/pub/scm/linux/kernel/git/mmarek/kbuild:
    kconfig: use va_end to match corresponding va_start
    merge_config.sh: Display usage if given too few arguments
    kconfig: use bool instead of boolean for type definition attributes

    Linus Torvalds
     

18 Feb, 2015

2 commits


17 Feb, 2015

7 commits

  • All callers of get_xip_mem() are now gone. Remove checks for it,
    initialisers of it, documentation of it and the only implementation of it.
    Also remove mm/filemap_xip.c as it is now empty. Also remove
    documentation of the long-gone get_xip_page().

    Signed-off-by: Matthew Wilcox
    Cc: Andreas Dilger
    Cc: Boaz Harrosh
    Cc: Christoph Hellwig
    Cc: Dave Chinner
    Cc: Jan Kara
    Cc: Jens Axboe
    Cc: Kirill A. Shutemov
    Cc: Mathieu Desnoyers
    Cc: Randy Dunlap
    Cc: Ross Zwisler
    Cc: Theodore Ts'o
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     
  • It takes a get_block parameter just like nobh_truncate_page() and
    block_truncate_page()

    Signed-off-by: Matthew Wilcox
    Reviewed-by: Mathieu Desnoyers
    Cc: Andreas Dilger
    Cc: Boaz Harrosh
    Cc: Christoph Hellwig
    Cc: Dave Chinner
    Cc: Jan Kara
    Cc: Jens Axboe
    Cc: Kirill A. Shutemov
    Cc: Randy Dunlap
    Cc: Ross Zwisler
    Cc: Theodore Ts'o
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     
  • Instead of calling aops->get_xip_mem from the fault handler, the
    filesystem passes a get_block_t that is used to find the appropriate
    blocks.

    This requires that all architectures implement copy_user_page(). At the
    time of writing, mips and arm do not. Patches exist and are in progress.

    [akpm@linux-foundation.org: remap_file_pages went away]
    Signed-off-by: Matthew Wilcox
    Reviewed-by: Jan Kara
    Cc: Andreas Dilger
    Cc: Boaz Harrosh
    Cc: Christoph Hellwig
    Cc: Dave Chinner
    Cc: Jens Axboe
    Cc: Kirill A. Shutemov
    Cc: Mathieu Desnoyers
    Cc: Randy Dunlap
    Cc: Ross Zwisler
    Cc: Theodore Ts'o
    Cc: Russell King
    Cc: Ralf Baechle
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     
  • Use the generic AIO infrastructure instead of custom read and write
    methods. In addition to giving us support for AIO, this adds the missing
    locking between read() and truncate().

    Signed-off-by: Matthew Wilcox
    Reviewed-by: Ross Zwisler
    Reviewed-by: Jan Kara
    Cc: Andreas Dilger
    Cc: Boaz Harrosh
    Cc: Christoph Hellwig
    Cc: Dave Chinner
    Cc: Jens Axboe
    Cc: Kirill A. Shutemov
    Cc: Mathieu Desnoyers
    Cc: Randy Dunlap
    Cc: Theodore Ts'o
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     
  • Use an inode flag to tag inodes which should avoid using the page cache.
    Convert ext2 to use it instead of mapping_is_xip(). Prevent I/Os to files
    tagged with the DAX flag from falling back to buffered I/O.

    Signed-off-by: Matthew Wilcox
    Reviewed-by: Jan Kara
    Reviewed-by: Mathieu Desnoyers
    Cc: Andreas Dilger
    Cc: Boaz Harrosh
    Cc: Christoph Hellwig
    Cc: Dave Chinner
    Cc: Jens Axboe
    Cc: Kirill A. Shutemov
    Cc: Randy Dunlap
    Cc: Ross Zwisler
    Cc: Theodore Ts'o
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     
  • Currently COW of an XIP file is done by first bringing in a read-only
    mapping, then retrying the fault and copying the page. It is much more
    efficient to tell the fault handler that a COW is being attempted (by
    passing in the pre-allocated page in the vm_fault structure), and allow
    the handler to perform the COW operation itself.

    The handler cannot insert the page itself if there is already a read-only
    mapping at that address, so allow the handler to return VM_FAULT_LOCKED
    and set the fault_page to be NULL. This indicates to the MM code that the
    i_mmap_lock is held instead of the page lock.

    Signed-off-by: Matthew Wilcox
    Acked-by: Kirill A. Shutemov
    Cc: Andreas Dilger
    Cc: Boaz Harrosh
    Cc: Christoph Hellwig
    Cc: Dave Chinner
    Cc: Jan Kara
    Cc: Jens Axboe
    Cc: Mathieu Desnoyers
    Cc: Randy Dunlap
    Cc: Ross Zwisler
    Cc: Theodore Ts'o
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     
  • DAX is a replacement for the variation of XIP currently supported by the
    ext2 filesystem. We have three different things in the tree called 'XIP',
    and the new focus is on access to data rather than executables, so a name
    change was in order. DAX stands for Direct Access. The X is for
    eXciting.

    The new focus on data access has resulted in more careful attention to
    races that exist in the current XIP code, but are not hit by the use-case
    that it was designed for. XIP's architecture worked fine for ext2, but
    DAX is architected to work with modern filsystems such as ext4 and XFS.
    DAX is not intended for use with btrfs; the value that btrfs adds relies
    on manipulating data and writing data to different locations, while DAX's
    value is for write-in-place and keeping the kernel from touching the data.

    DAX was developed in order to support NV-DIMMs, but it's become clear that
    its usefuless extends beyond NV-DIMMs and there are several potential
    customers including the tracing machinery. Other people want to place the
    kernel log in an area of memory, as long as they have a BIOS that does not
    clear DRAM on reboot.

    Patch 1 is a bug fix, probably worth including in 3.18.

    Patches 2 & 3 are infrastructure for DAX.

    Patches 4-8 replace the XIP code with its DAX equivalents, transforming
    ext2 to use the DAX code as we go. Note that patch 10 is the
    Documentation patch.

    Patches 9-15 clean up after the XIP code, removing the infrastructure
    that is no longer needed and renaming various XIP things to DAX.
    Most of these patches were added after Jan found things he didn't
    like in an earlier version of the ext4 patch ... that had been copied
    from ext2. So ext2 i being transformed to do things the same way that
    ext4 will later. The ability to mount ext2 filesystems with the 'xip'
    option is retained, although the 'dax' option is now preferred.

    Patch 16 adds some DAX infrastructure to support ext4.

    Patch 17 adds DAX support to ext4. It is broadly similar to ext2's DAX
    support, but it is more efficient than ext4's due to its support for
    unwritten extents.

    Patch 18 is another cleanup patch renaming XIP to DAX.

    My thanks to Mathieu Desnoyers for his reviews of the v11 patchset. Most
    of the changes below were based on his feedback.

    This patch (of 18):

    Pagecache faults recheck i_size after taking the page lock to ensure that
    the fault didn't race against a truncate. We don't have a page to lock in
    the XIP case, so use i_mmap_lock_read() instead. It is locked in the
    truncate path in unmap_mapping_range() after updating i_size. So while we
    hold it in the fault path, we are guaranteed that either i_size has
    already been updated in the truncate path, or that the truncate will
    subsequently call zap_page_range_single() and so remove the mapping we
    have just inserted.

    There is a window of time in which i_size has been reduced and the thread
    has a mapping to a page which will be removed from the file, but this is
    harmless as the page will not be allocated to a different purpose before
    the thread's access to it is revoked.

    [akpm@linux-foundation.org: switch to i_mmap_lock_read(), add comment in unmap_single_vma()]
    Signed-off-by: Matthew Wilcox
    Reviewed-by: Jan Kara
    Acked-by: Kirill A. Shutemov
    Reviewed-by: Mathieu Desnoyers
    Cc: Andreas Dilger
    Cc: Boaz Harrosh
    Cc: Christoph Hellwig
    Cc: Dave Chinner
    Cc: Jens Axboe
    Cc: Randy Dunlap
    Cc: Ross Zwisler
    Cc: Theodore Ts'o
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     

15 Feb, 2015

1 commit

  • Pull ACCESS_ONCE() rule tightening from Christian Borntraeger:
    "Tighten rules for ACCESS_ONCE

    This series tightens the rules for ACCESS_ONCE to only work on scalar
    types. It also contains the necessary fixups as indicated by build
    bots of linux-next. Now everything is in place to prevent new
    non-scalar users of ACCESS_ONCE and we can continue to convert code to
    READ_ONCE/WRITE_ONCE"

    * tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/borntraeger/linux:
    kernel: Fix sparse warning for ACCESS_ONCE
    next: sh: Fix compile error
    kernel: tighten rules for ACCESS ONCE
    mm/gup: Replace ACCESS_ONCE with READ_ONCE
    x86/spinlock: Leftover conversion ACCESS_ONCE->READ_ONCE
    x86/xen/p2m: Replace ACCESS_ONCE with READ_ONCE
    ppc/hugetlbfs: Replace ACCESS_ONCE with READ_ONCE
    ppc/kvm: Replace ACCESS_ONCE with READ_ONCE

    Linus Torvalds
     

14 Feb, 2015

17 commits

  • This feature let us to detect accesses out of bounds of global variables.
    This will work as for globals in kernel image, so for globals in modules.
    Currently this won't work for symbols in user-specified sections (e.g.
    __init, __read_mostly, ...)

    The idea of this is simple. Compiler increases each global variable by
    redzone size and add constructors invoking __asan_register_globals()
    function. Information about global variable (address, size, size with
    redzone ...) passed to __asan_register_globals() so we could poison
    variable's redzone.

    This patch also forces module_alloc() to return 8*PAGE_SIZE aligned
    address making shadow memory handling (
    kasan_module_alloc()/kasan_module_free() ) more simple. Such alignment
    guarantees that each shadow page backing modules address space correspond
    to only one module_alloc() allocation.

    Signed-off-by: Andrey Ryabinin
    Cc: Dmitry Vyukov
    Cc: Konstantin Serebryany
    Cc: Dmitry Chernenkov
    Signed-off-by: Andrey Konovalov
    Cc: Yuri Gribov
    Cc: Konstantin Khlebnikov
    Cc: Sasha Levin
    Cc: Christoph Lameter
    Cc: Joonsoo Kim
    Cc: Dave Hansen
    Cc: Andi Kleen
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: "H. Peter Anvin"
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Ryabinin
     
  • For instrumenting global variables KASan will shadow memory backing memory
    for modules. So on module loading we will need to allocate memory for
    shadow and map it at address in shadow that corresponds to the address
    allocated in module_alloc().

    __vmalloc_node_range() could be used for this purpose, except it puts a
    guard hole after allocated area. Guard hole in shadow memory should be a
    problem because at some future point we might need to have a shadow memory
    at address occupied by guard hole. So we could fail to allocate shadow
    for module_alloc().

    Now we have VM_NO_GUARD flag disabling guard page, so we need to pass into
    __vmalloc_node_range(). Add new parameter 'vm_flags' to
    __vmalloc_node_range() function.

    Signed-off-by: Andrey Ryabinin
    Cc: Dmitry Vyukov
    Cc: Konstantin Serebryany
    Cc: Dmitry Chernenkov
    Signed-off-by: Andrey Konovalov
    Cc: Yuri Gribov
    Cc: Konstantin Khlebnikov
    Cc: Sasha Levin
    Cc: Christoph Lameter
    Cc: Joonsoo Kim
    Cc: Dave Hansen
    Cc: Andi Kleen
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: "H. Peter Anvin"
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Ryabinin
     
  • For instrumenting global variables KASan will shadow memory backing memory
    for modules. So on module loading we will need to allocate memory for
    shadow and map it at address in shadow that corresponds to the address
    allocated in module_alloc().

    __vmalloc_node_range() could be used for this purpose, except it puts a
    guard hole after allocated area. Guard hole in shadow memory should be a
    problem because at some future point we might need to have a shadow memory
    at address occupied by guard hole. So we could fail to allocate shadow
    for module_alloc().

    Add a new vm_struct flag 'VM_NO_GUARD' indicating that vm area doesn't
    have a guard hole.

    Signed-off-by: Andrey Ryabinin
    Cc: Dmitry Vyukov
    Cc: Konstantin Serebryany
    Cc: Dmitry Chernenkov
    Signed-off-by: Andrey Konovalov
    Cc: Yuri Gribov
    Cc: Konstantin Khlebnikov
    Cc: Sasha Levin
    Cc: Christoph Lameter
    Cc: Joonsoo Kim
    Cc: Dave Hansen
    Cc: Andi Kleen
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: "H. Peter Anvin"
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Ryabinin
     
  • Stack instrumentation allows to detect out of bounds memory accesses for
    variables allocated on stack. Compiler adds redzones around every
    variable on stack and poisons redzones in function's prologue.

    Such approach significantly increases stack usage, so all in-kernel stacks
    size were doubled.

    Signed-off-by: Andrey Ryabinin
    Cc: Dmitry Vyukov
    Cc: Konstantin Serebryany
    Cc: Dmitry Chernenkov
    Signed-off-by: Andrey Konovalov
    Cc: Yuri Gribov
    Cc: Konstantin Khlebnikov
    Cc: Sasha Levin
    Cc: Christoph Lameter
    Cc: Joonsoo Kim
    Cc: Dave Hansen
    Cc: Andi Kleen
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: "H. Peter Anvin"
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Ryabinin
     
  • Recently instrumentation of builtin functions calls was removed from GCC
    5.0. To check the memory accessed by such functions, userspace asan
    always uses interceptors for them.

    So now we should do this as well. This patch declares
    memset/memmove/memcpy as weak symbols. In mm/kasan/kasan.c we have our
    own implementation of those functions which checks memory before accessing
    it.

    Default memset/memmove/memcpy now now always have aliases with '__'
    prefix. For files that built without kasan instrumentation (e.g.
    mm/slub.c) original mem* replaced (via #define) with prefixed variants,
    cause we don't want to check memory accesses there.

    Signed-off-by: Andrey Ryabinin
    Cc: Dmitry Vyukov
    Cc: Konstantin Serebryany
    Cc: Dmitry Chernenkov
    Signed-off-by: Andrey Konovalov
    Cc: Yuri Gribov
    Cc: Konstantin Khlebnikov
    Cc: Sasha Levin
    Cc: Christoph Lameter
    Cc: Joonsoo Kim
    Cc: Dave Hansen
    Cc: Andi Kleen
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: "H. Peter Anvin"
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Ryabinin
     
  • kmalloc internally round up allocation size, and kmemleak uses rounded up
    size as object's size. This makes kasan to complain while kmemleak scans
    memory or calculates of object's checksum. The simplest solution here is
    to disable kasan.

    Signed-off-by: Andrey Ryabinin
    Acked-by: Catalin Marinas
    Cc: Dmitry Vyukov
    Cc: Konstantin Serebryany
    Cc: Dmitry Chernenkov
    Signed-off-by: Andrey Konovalov
    Cc: Yuri Gribov
    Cc: Konstantin Khlebnikov
    Cc: Sasha Levin
    Cc: Christoph Lameter
    Cc: Joonsoo Kim
    Cc: Dave Hansen
    Cc: Andi Kleen
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: "H. Peter Anvin"
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Ryabinin
     
  • With this patch kasan will be able to catch bugs in memory allocated by
    slub. Initially all objects in newly allocated slab page, marked as
    redzone. Later, when allocation of slub object happens, requested by
    caller number of bytes marked as accessible, and the rest of the object
    (including slub's metadata) marked as redzone (inaccessible).

    We also mark object as accessible if ksize was called for this object.
    There is some places in kernel where ksize function is called to inquire
    size of really allocated area. Such callers could validly access whole
    allocated memory, so it should be marked as accessible.

    Code in slub.c and slab_common.c files could validly access to object's
    metadata, so instrumentation for this files are disabled.

    Signed-off-by: Andrey Ryabinin
    Signed-off-by: Dmitry Chernenkov
    Cc: Dmitry Vyukov
    Cc: Konstantin Serebryany
    Signed-off-by: Andrey Konovalov
    Cc: Yuri Gribov
    Cc: Konstantin Khlebnikov
    Cc: Sasha Levin
    Cc: Christoph Lameter
    Cc: Joonsoo Kim
    Cc: Dave Hansen
    Cc: Andi Kleen
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: "H. Peter Anvin"
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Ryabinin
     
  • It's ok for slub to access memory that marked by kasan as inaccessible
    (object's metadata). Kasan shouldn't print report in that case because
    these accesses are valid. Disabling instrumentation of slub.c code is not
    enough to achieve this because slub passes pointer to object's metadata
    into external functions like memchr_inv().

    We don't want to disable instrumentation for memchr_inv() because this is
    quite generic function, and we don't want to miss bugs.

    metadata_access_enable/metadata_access_disable used to tell KASan where
    accesses to metadata starts/end, so we could temporarily disable KASan
    reports.

    Signed-off-by: Andrey Ryabinin
    Cc: Dmitry Vyukov
    Cc: Konstantin Serebryany
    Cc: Dmitry Chernenkov
    Signed-off-by: Andrey Konovalov
    Cc: Yuri Gribov
    Cc: Konstantin Khlebnikov
    Cc: Sasha Levin
    Cc: Christoph Lameter
    Cc: Joonsoo Kim
    Cc: Dave Hansen
    Cc: Andi Kleen
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: "H. Peter Anvin"
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Ryabinin
     
  • Remove static and add function declarations to linux/slub_def.h so it
    could be used by kernel address sanitizer.

    Signed-off-by: Andrey Ryabinin
    Cc: Dmitry Vyukov
    Cc: Konstantin Serebryany
    Cc: Dmitry Chernenkov
    Signed-off-by: Andrey Konovalov
    Cc: Yuri Gribov
    Cc: Konstantin Khlebnikov
    Cc: Sasha Levin
    Cc: Christoph Lameter
    Cc: Joonsoo Kim
    Cc: Dave Hansen
    Cc: Andi Kleen
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: "H. Peter Anvin"
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Ryabinin
     
  • Add kernel address sanitizer hooks to mark allocated page's addresses as
    accessible in corresponding shadow region. Mark freed pages as
    inaccessible.

    Signed-off-by: Andrey Ryabinin
    Cc: Dmitry Vyukov
    Cc: Konstantin Serebryany
    Cc: Dmitry Chernenkov
    Signed-off-by: Andrey Konovalov
    Cc: Yuri Gribov
    Cc: Konstantin Khlebnikov
    Cc: Sasha Levin
    Cc: Christoph Lameter
    Cc: Joonsoo Kim
    Cc: Dave Hansen
    Cc: Andi Kleen
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: "H. Peter Anvin"
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Ryabinin
     
  • Currently memory hotplug won't work with KASan. As we don't have shadow
    for hotplugged memory, kernel will crash on the first access to it. To
    make this work we will need to allocate shadow for new memory.

    At some future point proper memory hotplug support will be implemented.
    Until then, print a warning at startup and disable memory hot-add.

    Signed-off-by: Andrey Ryabinin
    Cc: Dmitry Vyukov
    Cc: Konstantin Serebryany
    Cc: Dmitry Chernenkov
    Signed-off-by: Andrey Konovalov
    Cc: Yuri Gribov
    Cc: Konstantin Khlebnikov
    Cc: Sasha Levin
    Cc: Christoph Lameter
    Cc: Joonsoo Kim
    Cc: Dave Hansen
    Cc: Andi Kleen
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: "H. Peter Anvin"
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Ryabinin
     
  • Kernel Address sanitizer (KASan) is a dynamic memory error detector. It
    provides fast and comprehensive solution for finding use-after-free and
    out-of-bounds bugs.

    KASAN uses compile-time instrumentation for checking every memory access,
    therefore GCC > v4.9.2 required. v4.9.2 almost works, but has issues with
    putting symbol aliases into the wrong section, which breaks kasan
    instrumentation of globals.

    This patch only adds infrastructure for kernel address sanitizer. It's
    not available for use yet. The idea and some code was borrowed from [1].

    Basic idea:

    The main idea of KASAN is to use shadow memory to record whether each byte
    of memory is safe to access or not, and use compiler's instrumentation to
    check the shadow memory on each memory access.

    Address sanitizer uses 1/8 of the memory addressable in kernel for shadow
    memory and uses direct mapping with a scale and offset to translate a
    memory address to its corresponding shadow address.

    Here is function to translate address to corresponding shadow address:

    unsigned long kasan_mem_to_shadow(unsigned long addr)
    {
    return (addr >> KASAN_SHADOW_SCALE_SHIFT) + KASAN_SHADOW_OFFSET;
    }

    where KASAN_SHADOW_SCALE_SHIFT = 3.

    So for every 8 bytes there is one corresponding byte of shadow memory.
    The following encoding used for each shadow byte: 0 means that all 8 bytes
    of the corresponding memory region are valid for access; k (1
    Acked-by: Michal Marek
    Signed-off-by: Andrey Konovalov
    Cc: Dmitry Vyukov
    Cc: Konstantin Serebryany
    Cc: Dmitry Chernenkov
    Cc: Yuri Gribov
    Cc: Konstantin Khlebnikov
    Cc: Sasha Levin
    Cc: Christoph Lameter
    Cc: Joonsoo Kim
    Cc: Dave Hansen
    Cc: Andi Kleen
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: "H. Peter Anvin"
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Stephen Rothwell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Ryabinin
     
  • printk and friends can now format bitmaps using '%*pb[l]'. cpumask
    and nodemask also provide cpumask_pr_args() and nodemask_pr_args()
    respectively which can be used to generate the two printf arguments
    necessary to format the specified cpu/nodemask.

    Signed-off-by: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tejun Heo
     
  • printk and friends can now format bitmaps using '%*pb[l]'. cpumask
    and nodemask also provide cpumask_pr_args() and nodemask_pr_args()
    respectively which can be used to generate the two printf arguments
    necessary to format the specified cpu/nodemask.

    * This is an equivalent conversion but the whole function should be
    converted to use scnprinf famiily of functions rather than
    performing custom output length predictions in multiple places.

    Signed-off-by: Tejun Heo
    Acked-by: Christoph Lameter
    Cc: Pekka Enberg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tejun Heo
     
  • printk and friends can now format bitmaps using '%*pb[l]'. cpumask
    and nodemask also provide cpumask_pr_args() and nodemask_pr_args()
    respectively which can be used to generate the two printf arguments
    necessary to format the specified cpu/nodemask.

    Signed-off-by: Tejun Heo
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tejun Heo
     
  • slab frequently performs duplication of strings located in read-only
    memory section. Replacing kstrdup by kstrdup_const allows to avoid such
    operations.

    [akpm@linux-foundation.org: make the handling of kmem_cache.name const-correct]
    Signed-off-by: Andrzej Hajda
    Cc: Marek Szyprowski
    Cc: Kyungmin Park
    Cc: Mike Turquette
    Cc: Alexander Viro
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Tejun Heo
    Cc: Greg KH
    Cc: Geert Uytterhoeven
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrzej Hajda
     
  • kstrdup() is often used to duplicate strings where neither source neither
    destination will be ever modified. In such case we can just reuse the
    source instead of duplicating it. The problem is that we must be sure
    that the source is non-modifiable and its life-time is long enough.

    I suspect the good candidates for such strings are strings located in
    kernel .rodata section, they cannot be modifed because the section is
    read-only and their life-time is equal to kernel life-time.

    This small patchset proposes alternative version of kstrdup -
    kstrdup_const, which returns source string if it is located in .rodata
    otherwise it fallbacks to kstrdup. To verify if the source is in
    .rodata function checks if the address is between sentinels
    __start_rodata, __end_rodata. I guess it should work with all
    architectures.

    The main patch is accompanied by four patches constifying kstrdup for
    cases where situtation described above happens frequently.

    I have tested the patchset on mobile platform (exynos4210-trats) and it
    saves 3272 string allocations. Since minimal allocation is 32 or 64
    bytes depending on Kconfig options the patchset saves respectively about
    100KB or 200KB of memory.

    Stats from tested platform show that the main offender is sysfs:

    By caller:
    2260 __kernfs_new_node
    631 clk_register+0xc8/0x1b8
    318 clk_register+0x34/0x1b8
    51 kmem_cache_create
    12 alloc_vfsmnt

    By string (with count >= 5):
    883 power
    876 subsystem
    135 parameters
    132 device
    61 iommu_group
    ...

    This patch (of 5):

    Add an alternative version of kstrdup which returns pointer to constant
    char array. The function checks if input string is in persistent and
    read-only memory section, if yes it returns the input string, otherwise it
    fallbacks to kstrdup.

    kstrdup_const is accompanied by kfree_const performing conditional memory
    deallocation of the string.

    Signed-off-by: Andrzej Hajda
    Cc: Marek Szyprowski
    Cc: Kyungmin Park
    Cc: Mike Turquette
    Cc: Alexander Viro
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Tejun Heo
    Cc: Greg KH
    Cc: Geert Uytterhoeven
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrzej Hajda
     

13 Feb, 2015

10 commits

  • Merge third set of updates from Andrew Morton:

    - the rest of MM

    [ This includes getting rid of the numa hinting bits, in favor of
    just generic protnone logic. Yay. - Linus ]

    - core kernel

    - procfs

    - some of lib/ (lots of lib/ material this time)

    * emailed patches from Andrew Morton : (104 commits)
    lib/lcm.c: replace include
    lib/percpu_ida.c: remove redundant includes
    lib/strncpy_from_user.c: replace module.h include
    lib/stmp_device.c: replace module.h include
    lib/sort.c: move include inside #if 0
    lib/show_mem.c: remove redundant include
    lib/radix-tree.c: change to simpler include
    lib/plist.c: remove redundant include
    lib/nlattr.c: remove redundant include
    lib/kobject_uevent.c: remove redundant include
    lib/llist.c: remove redundant include
    lib/md5.c: simplify include
    lib/list_sort.c: rearrange includes
    lib/genalloc.c: remove redundant include
    lib/idr.c: remove redundant include
    lib/halfmd4.c: simplify includes
    lib/dynamic_queue_limits.c: simplify includes
    lib/sort.c: use simpler includes
    lib/interval_tree.c: simplify includes
    hexdump: make it return number of bytes placed in buffer
    ...

    Linus Torvalds
     
  • Keeping fragmentation of zsmalloc in a low level is our target. But now
    we still need to add the debug code in zsmalloc to get the quantitative
    data.

    This patch adds a new configuration CONFIG_ZSMALLOC_STAT to enable the
    statistics collection for developers. Currently only the objects
    statatitics in each class are collected. User can get the information via
    debugfs.

    cat /sys/kernel/debug/zsmalloc/zram0/...

    For example:

    After I copied "jdk-8u25-linux-x64.tar.gz" to zram with ext4 filesystem:
    class size obj_allocated obj_used pages_used
    0 32 0 0 0
    1 48 256 12 3
    2 64 64 14 1
    3 80 51 7 1
    4 96 128 5 3
    5 112 73 5 2
    6 128 32 4 1
    7 144 0 0 0
    8 160 0 0 0
    9 176 0 0 0
    10 192 0 0 0
    11 208 0 0 0
    12 224 0 0 0
    13 240 0 0 0
    14 256 16 1 1
    15 272 15 9 1
    16 288 0 0 0
    17 304 0 0 0
    18 320 0 0 0
    19 336 0 0 0
    20 352 0 0 0
    21 368 0 0 0
    22 384 0 0 0
    23 400 0 0 0
    24 416 0 0 0
    25 432 0 0 0
    26 448 0 0 0
    27 464 0 0 0
    28 480 0 0 0
    29 496 33 1 4
    30 512 0 0 0
    31 528 0 0 0
    32 544 0 0 0
    33 560 0 0 0
    34 576 0 0 0
    35 592 0 0 0
    36 608 0 0 0
    37 624 0 0 0
    38 640 0 0 0
    40 672 0 0 0
    42 704 0 0 0
    43 720 17 1 3
    44 736 0 0 0
    46 768 0 0 0
    49 816 0 0 0
    51 848 0 0 0
    52 864 14 1 3
    54 896 0 0 0
    57 944 13 1 3
    58 960 0 0 0
    62 1024 4 1 1
    66 1088 15 2 4
    67 1104 0 0 0
    71 1168 0 0 0
    74 1216 0 0 0
    76 1248 0 0 0
    83 1360 3 1 1
    91 1488 11 1 4
    94 1536 0 0 0
    100 1632 5 1 2
    107 1744 0 0 0
    111 1808 9 1 4
    126 2048 4 4 2
    144 2336 7 3 4
    151 2448 0 0 0
    168 2720 15 15 10
    190 3072 28 27 21
    202 3264 0 0 0
    254 4096 36209 36209 36209

    Total 37022 36326 36288

    We can calculate the overall fragentation by the last line:
    Total 37022 36326 36288
    (37022 - 36326) / 37022 = 1.87%

    Also by analysing objects alocated in every class we know why we got so
    low fragmentation: Most of the allocated objects is in . And
    there is only 1 page in class 254 zspage. So, No fragmentation will be
    introduced by allocating objs in class 254.

    And in future, we can collect other zsmalloc statistics as we need and
    analyse them.

    Signed-off-by: Ganesh Mahendran
    Suggested-by: Minchan Kim
    Acked-by: Minchan Kim
    Cc: Nitin Gupta
    Cc: Seth Jennings
    Cc: Dan Streetman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ganesh Mahendran
     
  • Currently the underlay of zpool: zsmalloc/zbud, do not know who creates
    them. There is not a method to let zsmalloc/zbud find which caller they
    belong to.

    Now we want to add statistics collection in zsmalloc. We need to name the
    debugfs dir for each pool created. The way suggested by Minchan Kim is to
    use a name passed by caller(such as zram) to create the zsmalloc pool.

    /sys/kernel/debug/zsmalloc/zram0

    This patch adds an argument `name' to zs_create_pool() and other related
    functions.

    Signed-off-by: Ganesh Mahendran
    Acked-by: Minchan Kim
    Cc: Seth Jennings
    Cc: Nitin Gupta
    Cc: Dan Streetman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ganesh Mahendran
     
  • The vmstat interfaces are good at hiding negative counts (at least when
    CONFIG_SMP); but if you peer behind the curtain, you find that
    nr_isolated_anon and nr_isolated_file soon go negative, and grow ever
    more negative: so they can absorb larger and larger numbers of isolated
    pages, yet still appear to be zero.

    I'm happy to avoid a congestion_wait() when too_many_isolated() myself;
    but I guess it's there for a good reason, in which case we ought to get
    too_many_isolated() working again.

    The imbalance comes from isolate_migratepages()'s ISOLATE_ABORT case:
    putback_movable_pages() decrements the NR_ISOLATED counts, but we forgot
    to call acct_isolated() to increment them.

    It is possible that the bug whcih this patch fixes could cause OOM kills
    when the system still has a lot of reclaimable page cache.

    Fixes: edc2ca612496 ("mm, compaction: move pageblock checks up from isolate_migratepages_range()")
    Signed-off-by: Hugh Dickins
    Acked-by: Vlastimil Babka
    Acked-by: Joonsoo Kim
    Cc: [3.18+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • A race condition starts to be visible in recent mmotm, where a PG_hwpoison
    flag is set on a migration source page *before* it's back in buddy page
    poo= l.

    This is problematic because no page flag is supposed to be set when
    freeing (see __free_one_page().) So the user-visible effect of this race
    is that it could trigger the BUG_ON() when soft-offlining is called.

    The root cause is that we call lru_add_drain_all() to make sure that the
    page is in buddy, but that doesn't work because this function just
    schedule= s a work item and doesn't wait its completion.
    drain_all_pages() does drainin= g directly, so simply dropping
    lru_add_drain_all() solves this problem.

    Fixes: f15bdfa802bf ("mm/memory-failure.c: fix memory leak in successful soft offlining")
    Signed-off-by: Naoya Horiguchi
    Cc: Andi Kleen
    Cc: Tony Luck
    Cc: Chen Gong
    Cc: [3.11+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • Add a necessary 'leave'.

    Signed-off-by: Yaowei Bai
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yaowei Bai
     
  • For whatever reason, generic_access_phys() only remaps one page, but
    actually allows to access arbitrary size. It's quite easy to trigger
    large reads, like printing out large structure with gdb, which leads to a
    crash. Fix it by remapping correct size.

    Fixes: 28b2ee20c7cb ("access_process_vm device memory infrastructure")
    Signed-off-by: Grazvydas Ignotas
    Cc: Rik van Riel
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Grazvydas Ignotas
     
  • mminit_loglevel is only referenced from __init and __meminit functions, so
    we can mark it __meminitdata.

    Signed-off-by: Rasmus Villemoes
    Cc: Vlastimil Babka
    Cc: Rik van Riel
    Cc: Joonsoo Kim
    Cc: David Rientjes
    Cc: Vishnu Pratap Singh
    Cc: Pintu Kumar
    Cc: Michal Nazarewicz
    Cc: Mel Gorman
    Cc: Paul Gortmaker
    Cc: Peter Zijlstra
    Cc: Tim Chen
    Cc: Hugh Dickins
    Cc: Li Zefan
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rasmus Villemoes
     
  • The only caller of mminit_verify_zonelist is build_all_zonelists_init,
    which is annotated with __init, so it should be safe to also mark the
    former as __init, saving ~400 bytes of .text.

    Signed-off-by: Rasmus Villemoes
    Cc: Vlastimil Babka
    Cc: Rik van Riel
    Cc: Joonsoo Kim
    Cc: David Rientjes
    Cc: Vishnu Pratap Singh
    Cc: Pintu Kumar
    Cc: Michal Nazarewicz
    Cc: Mel Gorman
    Cc: Paul Gortmaker
    Cc: Peter Zijlstra
    Cc: Tim Chen
    Cc: Hugh Dickins
    Cc: Li Zefan
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rasmus Villemoes
     
  • Pulling the code protected by if (system_state == SYSTEM_BOOTING) into
    its own helper allows us to shrink .text a little. This relies on
    build_all_zonelists already having a __ref annotation. Add a comment
    explaining why so one doesn't have to track it down through git log.

    The real saving comes in 3/5, ("mm/mm_init.c: Mark mminit_verify_zonelist
    as __init"), where we save about 400 bytes

    Signed-off-by: Rasmus Villemoes
    Cc: Vlastimil Babka
    Cc: Rik van Riel
    Cc: Joonsoo Kim
    Cc: David Rientjes
    Cc: Vishnu Pratap Singh
    Cc: Pintu Kumar
    Cc: Michal Nazarewicz
    Cc: Mel Gorman
    Cc: Paul Gortmaker
    Cc: Peter Zijlstra
    Cc: Tim Chen
    Cc: Hugh Dickins
    Cc: Li Zefan
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rasmus Villemoes