15 Mar, 2017

1 commit

  • Pull percpu fixes from Tejun Heo:

    - the allocation path was updating pcpu_nr_empty_pop_pages without the
    required locking which can lead to incorrect handling of empty chunks
    (e.g. keeping too many around), which is buggy but shouldn't lead to
    critical failures. Fixed by adding the locking

    - a trivial patch to drop an unused param from pcpu_get_pages()

    * 'for-4.11-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu:
    percpu: remove unused chunk_alloc parameter from pcpu_get_pages()
    percpu: acquire pcpu_lock when updating pcpu_nr_empty_pop_pages

    Linus Torvalds
     

13 Mar, 2017

1 commit

  • gup_p4d_range() should call gup_pud_range(), not itself.

    [ This was not noticed on x86: this is the HAVE_GENERIC_RCU_GUP code
    used by arm[64] and powerpc - Linus ]

    Fixes: c2febafc6773 ("mm: convert generic code to 5-level paging")
    Signed-off-by: Kirill A. Shutemov
    Reported-by: Chris Packham
    Reported-by: Anton Blanchard
    Acked-by: Michal Hocko
    Acked-by: Mark Rutland
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

11 Mar, 2017

2 commits

  • Merge 5-level page table prep from Kirill Shutemov:
    "Here's relatively low-risk part of 5-level paging patchset. Merging it
    now will make x86 5-level paging enabling in v4.12 easier.

    The first patch is actually x86-specific: detect 5-level paging
    support. It boils down to single define.

    The rest of patchset converts Linux MMU abstraction from 4- to 5-level
    paging.

    Enabling of new abstraction in most cases requires adding single line
    of code in arch-specific code. The rest is taken care by asm-generic/.

    Changes to mm/ code are mostly mechanical: add support for new page
    table level -- p4d_t -- where we deal with pud_t now.

    v2:
    - fix build on microblaze (Michal);
    - comment for __ARCH_HAS_5LEVEL_HACK in kasan_populate_zero_shadow();
    - acks from Michal"

    * emailed patches from Kirill A Shutemov :
    mm: introduce __p4d_alloc()
    mm: convert generic code to 5-level paging
    asm-generic: introduce
    arch, mm: convert all architectures to use 5level-fixup.h
    asm-generic: introduce __ARCH_USE_5LEVEL_HACK
    asm-generic: introduce 5level-fixup.h
    x86/cpufeature: Add 5-level paging detection

    Linus Torvalds
     
  • Merge fixes from Andrew Morton:
    "26 fixes"

    * emailed patches from Andrew Morton : (26 commits)
    userfaultfd: remove wrong comment from userfaultfd_ctx_get()
    fat: fix using uninitialized fields of fat_inode/fsinfo_inode
    sh: cayman: IDE support fix
    kasan: fix races in quarantine_remove_cache()
    kasan: resched in quarantine_remove_cache()
    mm: do not call mem_cgroup_free() from within mem_cgroup_alloc()
    thp: fix another corner case of munlock() vs. THPs
    rmap: fix NULL-pointer dereference on THP munlocking
    mm/memblock.c: fix memblock_next_valid_pfn()
    userfaultfd: selftest: vm: allow to build in vm/ directory
    userfaultfd: non-cooperative: userfaultfd_remove revalidate vma in MADV_DONTNEED
    userfaultfd: non-cooperative: fix fork fctx->new memleak
    mm/cgroup: avoid panic when init with low memory
    drivers/md/bcache/util.h: remove duplicate inclusion of blkdev.h
    mm/vmstats: add thp_split_pud event for clarity
    include/linux/fs.h: fix unsigned enum warning with gcc-4.2
    userfaultfd: non-cooperative: release all ctx in dup_userfaultfd_complete
    userfaultfd: non-cooperative: robustness check
    userfaultfd: non-cooperative: rollback userfaultfd_exit
    x86, mm: unify exit paths in gup_pte_range()
    ...

    Linus Torvalds
     

10 Mar, 2017

12 commits

  • quarantine_remove_cache() frees all pending objects that belong to the
    cache, before we destroy the cache itself. However there are currently
    two possibilities how it can fail to do so.

    First, another thread can hold some of the objects from the cache in
    temp list in quarantine_put(). quarantine_put() has a windows of
    enabled interrupts, and on_each_cpu() in quarantine_remove_cache() can
    finish right in that window. These objects will be later freed into the
    destroyed cache.

    Then, quarantine_reduce() has the same problem. It grabs a batch of
    objects from the global quarantine, then unlocks quarantine_lock and
    then frees the batch. quarantine_remove_cache() can finish while some
    objects from the cache are still in the local to_free list in
    quarantine_reduce().

    Fix the race with quarantine_put() by disabling interrupts for the whole
    duration of quarantine_put(). In combination with on_each_cpu() in
    quarantine_remove_cache() it ensures that quarantine_remove_cache()
    either sees the objects in the per-cpu list or in the global list.

    Fix the race with quarantine_reduce() by protecting quarantine_reduce()
    with srcu critical section and then doing synchronize_srcu() at the end
    of quarantine_remove_cache().

    I've done some assessment of how good synchronize_srcu() works in this
    case. And on a 4 CPU VM I see that it blocks waiting for pending read
    critical sections in about 2-3% of cases. Which looks good to me.

    I suspect that these races are the root cause of some GPFs that I
    episodically hit. Previously I did not have any explanation for them.

    BUG: unable to handle kernel NULL pointer dereference at 00000000000000c8
    IP: qlist_free_all+0x2e/0xc0 mm/kasan/quarantine.c:155
    PGD 6aeea067
    PUD 60ed7067
    PMD 0
    Oops: 0000 [#1] SMP KASAN
    Dumping ftrace buffer:
    (ftrace buffer empty)
    Modules linked in:
    CPU: 0 PID: 13667 Comm: syz-executor2 Not tainted 4.10.0+ #60
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
    task: ffff88005f948040 task.stack: ffff880069818000
    RIP: 0010:qlist_free_all+0x2e/0xc0 mm/kasan/quarantine.c:155
    RSP: 0018:ffff88006981f298 EFLAGS: 00010246
    RAX: ffffea0000ffff00 RBX: 0000000000000000 RCX: ffffea0000ffff1f
    RDX: 0000000000000000 RSI: ffff88003fffc3e0 RDI: 0000000000000000
    RBP: ffff88006981f2c0 R08: ffff88002fed7bd8 R09: 00000001001f000d
    R10: 00000000001f000d R11: ffff88006981f000 R12: ffff88003fffc3e0
    R13: ffff88006981f2d0 R14: ffffffff81877fae R15: 0000000080000000
    FS: 00007fb911a2d700(0000) GS:ffff88003ec00000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 00000000000000c8 CR3: 0000000060ed6000 CR4: 00000000000006f0
    Call Trace:
    quarantine_reduce+0x10e/0x120 mm/kasan/quarantine.c:239
    kasan_kmalloc+0xca/0xe0 mm/kasan/kasan.c:590
    kasan_slab_alloc+0x12/0x20 mm/kasan/kasan.c:544
    slab_post_alloc_hook mm/slab.h:456 [inline]
    slab_alloc_node mm/slub.c:2718 [inline]
    kmem_cache_alloc_node+0x1d3/0x280 mm/slub.c:2754
    __alloc_skb+0x10f/0x770 net/core/skbuff.c:219
    alloc_skb include/linux/skbuff.h:932 [inline]
    _sctp_make_chunk+0x3b/0x260 net/sctp/sm_make_chunk.c:1388
    sctp_make_data net/sctp/sm_make_chunk.c:1420 [inline]
    sctp_make_datafrag_empty+0x208/0x360 net/sctp/sm_make_chunk.c:746
    sctp_datamsg_from_user+0x7e8/0x11d0 net/sctp/chunk.c:266
    sctp_sendmsg+0x2611/0x3970 net/sctp/socket.c:1962
    inet_sendmsg+0x164/0x5b0 net/ipv4/af_inet.c:761
    sock_sendmsg_nosec net/socket.c:633 [inline]
    sock_sendmsg+0xca/0x110 net/socket.c:643
    SYSC_sendto+0x660/0x810 net/socket.c:1685
    SyS_sendto+0x40/0x50 net/socket.c:1653

    I am not sure about backporting. The bug is quite hard to trigger, I've
    seen it few times during our massive continuous testing (however, it
    could be cause of some other episodic stray crashes as it leads to
    memory corruption...). If it is triggered, the consequences are very
    bad -- almost definite bad memory corruption. The fix is non trivial
    and has chances of introducing new bugs. I am also not sure how
    actively people use KASAN on older releases.

    [dvyukov@google.com: - sorted includes[
    Link: http://lkml.kernel.org/r/20170309094028.51088-1-dvyukov@google.com
    Link: http://lkml.kernel.org/r/20170308151532.5070-1-dvyukov@google.com
    Signed-off-by: Dmitry Vyukov
    Acked-by: Andrey Ryabinin
    Cc: Greg Thelen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dmitry Vyukov
     
  • We see reported stalls/lockups in quarantine_remove_cache() on machines
    with large amounts of RAM. quarantine_remove_cache() needs to scan
    whole quarantine in order to take out all objects belonging to the
    cache. Quarantine is currently 1/32-th of RAM, e.g. on a machine with
    256GB of memory that will be 8GB. Moreover quarantine scanning is a
    walk over uncached linked list, which is slow.

    Add cond_resched() after scanning of each non-empty batch of objects.
    Batches are specifically kept of reasonable size for quarantine_put().
    On a machine with 256GB of RAM we should have ~512 non-empty batches,
    each with 16MB of objects.

    Link: http://lkml.kernel.org/r/20170308154239.25440-1-dvyukov@google.com
    Signed-off-by: Dmitry Vyukov
    Acked-by: Andrey Ryabinin
    Cc: Greg Thelen
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dmitry Vyukov
     
  • mem_cgroup_free() indirectly calls wb_domain_exit() which is not
    prepared to deal with a struct wb_domain object that hasn't executed
    wb_domain_init(). For instance, the following warning message is
    printed by lockdep if alloc_percpu() fails in mem_cgroup_alloc():

    INFO: trying to register non-static key.
    the code is fine but needs lockdep annotation.
    turning off the locking correctness validator.
    CPU: 1 PID: 1950 Comm: mkdir Not tainted 4.10.0+ #151
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
    Call Trace:
    dump_stack+0x67/0x99
    register_lock_class+0x36d/0x540
    __lock_acquire+0x7f/0x1a30
    lock_acquire+0xcc/0x200
    del_timer_sync+0x3c/0xc0
    wb_domain_exit+0x14/0x20
    mem_cgroup_free+0x14/0x40
    mem_cgroup_css_alloc+0x3f9/0x620
    cgroup_apply_control_enable+0x190/0x390
    cgroup_mkdir+0x290/0x3d0
    kernfs_iop_mkdir+0x58/0x80
    vfs_mkdir+0x10e/0x1a0
    SyS_mkdirat+0xa8/0xd0
    SyS_mkdir+0x14/0x20
    entry_SYSCALL_64_fastpath+0x18/0xad

    Add __mem_cgroup_free() which skips wb_domain_exit(). This is used by
    both mem_cgroup_free() and mem_cgroup_alloc() clean up.

    Fixes: 0b8f73e104285 ("mm: memcontrol: clean up alloc, online, offline, free functions")
    Link: http://lkml.kernel.org/r/20170306192122.24262-1-tahsin@google.com
    Signed-off-by: Tahsin Erdogan
    Acked-by: Michal Hocko
    Cc: Johannes Weiner
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tahsin Erdogan
     
  • The following test case triggers BUG() in munlock_vma_pages_range():

    int main(int argc, char *argv[])
    {
    int fd;

    system("mount -t tmpfs -o huge=always none /mnt");
    fd = open("/mnt/test", O_CREAT | O_RDWR);
    ftruncate(fd, 4UL << 20);
    mmap(NULL, 4UL << 20, PROT_READ | PROT_WRITE,
    MAP_SHARED | MAP_FIXED | MAP_LOCKED, fd, 0);
    mmap(NULL, 4096, PROT_READ | PROT_WRITE,
    MAP_SHARED | MAP_LOCKED, fd, 0);
    munlockall();
    return 0;
    }

    The second mmap() create PTE-mapping of the first huge page in file. It
    makes kernel munlock the page as we never keep PTE-mapped page mlocked.

    On munlockall() when we handle vma created by the first mmap(),
    munlock_vma_page() returns page_mask == 0, as the page is not mlocked
    anymore. On next iteration follow_page_mask() return tail page, but
    page_mask is HPAGE_NR_PAGES - 1. It makes us skip to the first tail
    page of the next huge page and step on
    VM_BUG_ON_PAGE(PageMlocked(page)).

    The fix is not use the page_mask from follow_page_mask() at all. It has
    no use for us.

    Link: http://lkml.kernel.org/r/20170302150252.34120-1-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov
    Cc: Andrea Arcangeli
    Cc: [4.5+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • The following test case triggers NULL-pointer derefernce in
    try_to_unmap_one():

    #include
    #include
    #include
    #include

    int main(int argc, char *argv[])
    {
    int fd;

    system("mount -t tmpfs -o huge=always none /mnt");
    fd = open("/mnt/test", O_CREAT | O_RDWR);
    ftruncate(fd, 2UL << 20);
    mmap(NULL, 2UL << 20, PROT_READ | PROT_WRITE,
    MAP_SHARED | MAP_FIXED | MAP_LOCKED, fd, 0);
    mmap(NULL, 2UL << 20, PROT_READ | PROT_WRITE,
    MAP_SHARED | MAP_LOCKED, fd, 0);
    munlockall();
    return 0;
    }

    Apparently, there's a case when we call try_to_unmap() on huge PMDs:
    it's TTU_MUNLOCK.

    Let's handle this case correctly.

    Fixes: c7ab0d2fdc84 ("mm: convert try_to_unmap_one() to use page_vma_mapped_walk()")
    Link: http://lkml.kernel.org/r/20170302151159.30592-1-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • Obviously, we should not access memblock.memory.regions[right] if
    'right' is outside of [0..memblock.memory.cnt>.

    Fixes: b92df1de5d28 ("mm: page_alloc: skip over regions of invalid pfns where possible")
    Link: http://lkml.kernel.org/r/20170303023745.9104-1-takahiro.akashi@linaro.org
    Signed-off-by: AKASHI Takahiro
    Cc: Paul Burton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    AKASHI Takahiro
     
  • userfaultfd_remove() has to be execute before zapping the pagetables or
    UFFDIO_COPY could keep filling pages after zap_page_range returned,
    which would result in non zero data after a MADV_DONTNEED.

    However userfaultfd_remove() may have to release the mmap_sem. This was
    handled correctly in MADV_REMOVE, but MADV_DONTNEED accessed a
    potentially stale vma (the very vma passed to zap_page_range(vma, ...)).

    The fix consists in revalidating the vma in case userfaultfd_remove()
    had to release the mmap_sem.

    This also optimizes away an unnecessary down_read/up_read in the
    MADV_REMOVE case if UFFD_EVENT_FORK had to be delivered.

    It all remains zero runtime cost in case CONFIG_USERFAULTFD=n as
    userfaultfd_remove() will be defined as "true" at build time.

    Link: http://lkml.kernel.org/r/20170302173738.18994-3-aarcange@redhat.com
    Signed-off-by: Andrea Arcangeli
    Acked-by: Mike Rapoport
    Cc: "Dr. David Alan Gilbert"
    Cc: Mike Kravetz
    Cc: Pavel Emelyanov
    Cc: Hillf Danton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • The system may panic when initialisation is done when almost all the
    memory is assigned to the huge pages using the kernel command line
    parameter hugepage=xxxx. Panic may occur like this:

    Unable to handle kernel paging request for data at address 0x00000000
    Faulting instruction address: 0xc000000000302b88
    Oops: Kernel access of bad area, sig: 11 [#1]
    SMP NR_CPUS=2048 [ 0.082424] NUMA
    pSeries
    Modules linked in:
    CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.9.0-15-generic #16-Ubuntu
    task: c00000021ed01600 task.stack: c00000010d108000
    NIP: c000000000302b88 LR: c000000000270e04 CTR: c00000000016cfd0
    REGS: c00000010d10b2c0 TRAP: 0300 Not tainted (4.9.0-15-generic)
    MSR: 8000000002009033 [ 0.082770] CR: 28424422 XER: 00000000
    CFAR: c0000000003d28b8 DAR: 0000000000000000 DSISR: 40000000 SOFTE: 1
    GPR00: c000000000270e04 c00000010d10b540 c00000000141a300 c00000010fff6300
    GPR04: 0000000000000000 00000000026012c0 c00000010d10b630 0000000487ab0000
    GPR08: 000000010ee90000 c000000001454fd8 0000000000000000 0000000000000000
    GPR12: 0000000000004400 c00000000fb80000 00000000026012c0 00000000026012c0
    GPR16: 00000000026012c0 0000000000000000 0000000000000000 0000000000000002
    GPR20: 000000000000000c 0000000000000000 0000000000000000 00000000024200c0
    GPR24: c0000000016eef48 0000000000000000 c00000010fff7d00 00000000026012c0
    GPR28: 0000000000000000 c00000010fff7d00 c00000010fff6300 c00000010d10b6d0
    NIP mem_cgroup_soft_limit_reclaim+0xf8/0x4f0
    LR do_try_to_free_pages+0x1b4/0x450
    Call Trace:
    do_try_to_free_pages+0x1b4/0x450
    try_to_free_pages+0xf8/0x270
    __alloc_pages_nodemask+0x7a8/0xff0
    new_slab+0x104/0x8e0
    ___slab_alloc+0x620/0x700
    __slab_alloc+0x34/0x60
    kmem_cache_alloc_node_trace+0xdc/0x310
    mem_cgroup_init+0x158/0x1c8
    do_one_initcall+0x68/0x1d0
    kernel_init_freeable+0x278/0x360
    kernel_init+0x24/0x170
    ret_from_kernel_thread+0x5c/0x74
    Instruction dump:
    eb81ffe0 eba1ffe8 ebc1fff0 ebe1fff8 4e800020 3d230001 e9499a42 3d220004
    3929acd8 794a1f24 7d295214 eac90100 2fa90000 419eff74 3b200000
    ---[ end trace 342f5208b00d01b6 ]---

    This is a chicken and egg issue where the kernel try to get free memory
    when allocating per node data in mem_cgroup_init(), but in that path
    mem_cgroup_soft_limit_reclaim() is called which assumes that these data
    are allocated.

    As mem_cgroup_soft_limit_reclaim() is best effort, it should return when
    these data are not yet allocated.

    This patch also fixes potential null pointer access in
    mem_cgroup_remove_from_trees() and mem_cgroup_update_tree().

    Link: http://lkml.kernel.org/r/1487856999-16581-2-git-send-email-ldufour@linux.vnet.ibm.com
    Signed-off-by: Laurent Dufour
    Acked-by: Michal Hocko
    Acked-by: Johannes Weiner
    Acked-by: Balbir Singh
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Laurent Dufour
     
  • We added support for PUD-sized transparent hugepages, however we count
    the event "thp split pud" into thp_split_pmd event.

    To separate the event count of thp split pud from pmd, add a new event
    named thp_split_pud.

    Link: http://lkml.kernel.org/r/1488282380-5076-1-git-send-email-xieyisheng1@huawei.com
    Signed-off-by: Yisheng Xie
    Cc: Vlastimil Babka
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Joonsoo Kim
    Cc: Sebastian Siewior
    Cc: Hugh Dickins
    Cc: Christoph Lameter
    Cc: Kirill A. Shutemov
    Cc: Aneesh Kumar K.V
    Cc: Mel Gorman
    Cc: Andrea Arcangeli
    Cc: Ebru Akagunduz
    Cc: David Rientjes
    Cc: Hanjun Guo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yisheng Xie
     
  • Pull block fixes from Jens Axboe:
    "Sending this a bit sooner than I otherwise would have, as a fix in the
    merge window had some unfortunate issues and side effects for some
    folks.

    This contains:

    - Fixes from Jan for the bdi registration/unregistration. These have
    been tested by the various parties reporting issues, and should be
    solid at this point.

    - Also from Jan, fix for axonram gendisk registration.

    - A stable fix for zram from Johannes.

    - A small series from Ming, fixing up some long standing issues with
    blk-mq hardware queue kobject initialization and registration.

    - A fix for sed opal from Jon, fixing a nonsensical range check and
    some set-but-not-used variables.

    - A fix from Neil for a long standing deadlock issue for stacking
    device drivers. With this in place, dm/md don't have to work around
    the issue anymore, and can be properly fixed up"

    * 'for-linus' of git://git.kernel.dk/linux-block:
    axonram: Fix gendisk handling
    blk: improve order of bio handling in generic_make_request()
    Revert "scsi, block: fix duplicate bdi name registration crashes"
    block: Make del_gendisk() safer for disks without queues
    bdi: Fix use-after-free in wb_congested_put()
    block: Allow bdi re-registration
    block/sed: Fix opal user range check and unused variables
    zram: set physical queue limits to avoid array out of bounds accesses
    blk-mq: free hctx->cpumask in release handler of hctx's kobject
    blk-mq: make lifetime consistent between hctx and its kobject
    blk-mq: make lifetime consitent between q/ctx and its kobject
    blk-mq: initialize mq kobjects in blk_mq_init_allocated_queue()

    Linus Torvalds
     
  • For full 5-level paging we need a helper to allocate p4d page table.

    Signed-off-by: Kirill A. Shutemov
    Acked-by: Michal Hocko
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • Convert all non-architecture-specific code to 5-level paging.

    It's mostly mechanical adding handling one more page table level in
    places where we deal with pud_t.

    Signed-off-by: Kirill A. Shutemov
    Acked-by: Michal Hocko
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

09 Mar, 2017

3 commits

  • Commit 13ad59df67f1 ("mm, page_alloc: avoid page_to_pfn() when merging
    buddies") moved the check for memory holes out of page_is_buddy() and
    had the callers do the check.

    But this wasn't done correctly in one place which caused ia64 to crash
    very early in boot.

    Update to fix that and make ia64 boot again.

    [ v2: Vlastimil pointed out we don't need to call page_to_pfn()
    since we already have the result of that in "buddy_pfn" ]

    Fixes: 13ad59df67f1 ("avoid page_to_pfn() when merging buddies")
    Cc: Mel Gorman
    Cc: Joonsoo Kim
    Cc: Michal Hocko
    Cc: "Kirill A. Shutemov"
    Cc: Johannes Weiner
    Cc: Andrew Morton
    Signed-off-by: Tony Luck
    Signed-off-by: Linus Torvalds

    Tony Luck
     
  • bdi_writeback_congested structures get created for each blkcg and bdi
    regardless whether bdi is registered or not. When they are created in
    unregistered bdi and the request queue (and thus bdi) is then destroyed
    while blkg still holds reference to bdi_writeback_congested structure,
    this structure will be referencing freed bdi and last wb_congested_put()
    will try to remove the structure from already freed bdi.

    With commit 165a5e22fafb "block: Move bdi_unregister() to
    del_gendisk()", SCSI started to destroy bdis without calling
    bdi_unregister() first (previously it was calling bdi_unregister() even
    for unregistered bdis) and thus the code detaching
    bdi_writeback_congested in cgwb_bdi_destroy() was not triggered and we
    started hitting this use-after-free bug. It is enough to boot a KVM
    instance with virtio-scsi device to trigger this behavior.

    Fix the problem by detaching bdi_writeback_congested structures in
    bdi_exit() instead of bdi_unregister(). This is also more logical as
    they can get attached to bdi regardless whether it ever got registered
    or not.

    Fixes: 165a5e22fafb127ecb5914e12e8c32a1f0d3f820
    Signed-off-by: Jan Kara
    Tested-by: Omar Sandoval
    Signed-off-by: Jens Axboe

    Jan Kara
     
  • SCSI can call device_add_disk() several times for one request queue when
    a device in unbound and bound, creating new gendisk each time. This will
    lead to bdi being repeatedly registered and unregistered. This was not a
    big problem until commit 165a5e22fafb "block: Move bdi_unregister() to
    del_gendisk()" since bdi was only registered repeatedly (bdi_register()
    handles repeated calls fine, only we ended up leaking reference to
    gendisk due to overwriting bdi->owner) but unregistered only in
    blk_cleanup_queue() which didn't get called repeatedly. After
    165a5e22fafb we were doing correct bdi_register() - bdi_unregister()
    cycles however bdi_unregister() is not prepared for it. So make sure
    bdi_unregister() cleans up bdi in such a way that it is prepared for
    a possible following bdi_register() call.

    An easy way to provoke this behavior is to enable
    CONFIG_DEBUG_TEST_DRIVER_REMOVE and use scsi_debug driver to create a
    scsi disk which immediately hangs without this fix.

    Fixes: 165a5e22fafb127ecb5914e12e8c32a1f0d3f820
    Signed-off-by: Jan Kara
    Tested-by: Omar Sandoval
    Signed-off-by: Jens Axboe

    Jan Kara
     

07 Mar, 2017

2 commits


04 Mar, 2017

2 commits

  • Pull vfs 'statx()' update from Al Viro.

    This adds the new extended stat() interface that internally subsumes our
    previous stat interfaces, and allows user mode to specify in more detail
    what kind of information it wants.

    It also allows for some explicit synchronization information to be
    passed to the filesystem, which can be relevant for network filesystems:
    is the cached value ok, or do you need open/close consistency, or what?

    From David Howells.

    Andreas Dilger points out that the first version of the extended statx
    interface was posted June 29, 2010:

    https://www.spinics.net/lists/linux-fsdevel/msg33831.html

    * 'rebased-statx' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    statx: Add a system call to make enhanced file info available

    Linus Torvalds
     
  • Pull sched.h split-up from Ingo Molnar:
    "The point of these changes is to significantly reduce the
    header footprint, to speed up the kernel build and to
    have a cleaner header structure.

    After these changes the new 's typical preprocessed
    size goes down from a previous ~0.68 MB (~22K lines) to ~0.45 MB (~15K
    lines), which is around 40% faster to build on typical configs.

    Not much changed from the last version (-v2) posted three weeks ago: I
    eliminated quirks, backmerged fixes plus I rebased it to an upstream
    SHA1 from yesterday that includes most changes queued up in -next plus
    all sched.h changes that were pending from Andrew.

    I've re-tested the series both on x86 and on cross-arch defconfigs,
    and did a bisectability test at a number of random points.

    I tried to test as many build configurations as possible, but some
    build breakage is probably still left - but it should be mostly
    limited to architectures that have no cross-compiler binaries
    available on kernel.org, and non-default configurations"

    * 'WIP.sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (146 commits)
    sched/headers: Clean up
    sched/headers: Remove #ifdefs from
    sched/headers: Remove the include from
    sched/headers, hrtimer: Remove the include from
    sched/headers, x86/apic: Remove the header inclusion from
    sched/headers, timers: Remove the include from
    sched/headers: Remove from
    sched/headers: Remove from
    sched/core: Remove unused prefetch_stack()
    sched/headers: Remove from
    sched/headers: Remove the 'init_pid_ns' prototype from
    sched/headers: Remove from
    sched/headers: Remove from
    sched/headers: Remove the runqueue_is_locked() prototype
    sched/headers: Remove from
    sched/headers: Remove from
    sched/headers: Remove from
    sched/headers: Remove from
    sched/headers: Remove the include from
    sched/headers: Remove from
    ...

    Linus Torvalds
     

03 Mar, 2017

3 commits

  • Add a system call to make extended file information available, including
    file creation and some attribute flags where available through the
    underlying filesystem.

    The getattr inode operation is altered to take two additional arguments: a
    u32 request_mask and an unsigned int flags that indicate the
    synchronisation mode. This change is propagated to the vfs_getattr*()
    function.

    Functions like vfs_stat() are now inline wrappers around new functions
    vfs_statx() and vfs_statx_fd() to reduce stack usage.

    ========
    OVERVIEW
    ========

    The idea was initially proposed as a set of xattrs that could be retrieved
    with getxattr(), but the general preference proved to be for a new syscall
    with an extended stat structure.

    A number of requests were gathered for features to be included. The
    following have been included:

    (1) Make the fields a consistent size on all arches and make them large.

    (2) Spare space, request flags and information flags are provided for
    future expansion.

    (3) Better support for the y2038 problem [Arnd Bergmann] (tv_sec is an
    __s64).

    (4) Creation time: The SMB protocol carries the creation time, which could
    be exported by Samba, which will in turn help CIFS make use of
    FS-Cache as that can be used for coherency data (stx_btime).

    This is also specified in NFSv4 as a recommended attribute and could
    be exported by NFSD [Steve French].

    (5) Lightweight stat: Ask for just those details of interest, and allow a
    netfs (such as NFS) to approximate anything not of interest, possibly
    without going to the server [Trond Myklebust, Ulrich Drepper, Andreas
    Dilger] (AT_STATX_DONT_SYNC).

    (6) Heavyweight stat: Force a netfs to go to the server, even if it thinks
    its cached attributes are up to date [Trond Myklebust]
    (AT_STATX_FORCE_SYNC).

    And the following have been left out for future extension:

    (7) Data version number: Could be used by userspace NFS servers [Aneesh
    Kumar].

    Can also be used to modify fill_post_wcc() in NFSD which retrieves
    i_version directly, but has just called vfs_getattr(). It could get
    it from the kstat struct if it used vfs_xgetattr() instead.

    (There's disagreement on the exact semantics of a single field, since
    not all filesystems do this the same way).

    (8) BSD stat compatibility: Including more fields from the BSD stat such
    as creation time (st_btime) and inode generation number (st_gen)
    [Jeremy Allison, Bernd Schubert].

    (9) Inode generation number: Useful for FUSE and userspace NFS servers
    [Bernd Schubert].

    (This was asked for but later deemed unnecessary with the
    open-by-handle capability available and caused disagreement as to
    whether it's a security hole or not).

    (10) Extra coherency data may be useful in making backups [Andreas Dilger].

    (No particular data were offered, but things like last backup
    timestamp, the data version number and the DOS archive bit would come
    into this category).

    (11) Allow the filesystem to indicate what it can/cannot provide: A
    filesystem can now say it doesn't support a standard stat feature if
    that isn't available, so if, for instance, inode numbers or UIDs don't
    exist or are fabricated locally...

    (This requires a separate system call - I have an fsinfo() call idea
    for this).

    (12) Store a 16-byte volume ID in the superblock that can be returned in
    struct xstat [Steve French].

    (Deferred to fsinfo).

    (13) Include granularity fields in the time data to indicate the
    granularity of each of the times (NFSv4 time_delta) [Steve French].

    (Deferred to fsinfo).

    (14) FS_IOC_GETFLAGS value. These could be translated to BSD's st_flags.
    Note that the Linux IOC flags are a mess and filesystems such as Ext4
    define flags that aren't in linux/fs.h, so translation in the kernel
    may be a necessity (or, possibly, we provide the filesystem type too).

    (Some attributes are made available in stx_attributes, but the general
    feeling was that the IOC flags were to ext[234]-specific and shouldn't
    be exposed through statx this way).

    (15) Mask of features available on file (eg: ACLs, seclabel) [Brad Boyer,
    Michael Kerrisk].

    (Deferred, probably to fsinfo. Finding out if there's an ACL or
    seclabal might require extra filesystem operations).

    (16) Femtosecond-resolution timestamps [Dave Chinner].

    (A __reserved field has been left in the statx_timestamp struct for
    this - if there proves to be a need).

    (17) A set multiple attributes syscall to go with this.

    ===============
    NEW SYSTEM CALL
    ===============

    The new system call is:

    int ret = statx(int dfd,
    const char *filename,
    unsigned int flags,
    unsigned int mask,
    struct statx *buffer);

    The dfd, filename and flags parameters indicate the file to query, in a
    similar way to fstatat(). There is no equivalent of lstat() as that can be
    emulated with statx() by passing AT_SYMLINK_NOFOLLOW in flags. There is
    also no equivalent of fstat() as that can be emulated by passing a NULL
    filename to statx() with the fd of interest in dfd.

    Whether or not statx() synchronises the attributes with the backing store
    can be controlled by OR'ing a value into the flags argument (this typically
    only affects network filesystems):

    (1) AT_STATX_SYNC_AS_STAT tells statx() to behave as stat() does in this
    respect.

    (2) AT_STATX_FORCE_SYNC will require a network filesystem to synchronise
    its attributes with the server - which might require data writeback to
    occur to get the timestamps correct.

    (3) AT_STATX_DONT_SYNC will suppress synchronisation with the server in a
    network filesystem. The resulting values should be considered
    approximate.

    mask is a bitmask indicating the fields in struct statx that are of
    interest to the caller. The user should set this to STATX_BASIC_STATS to
    get the basic set returned by stat(). It should be noted that asking for
    more information may entail extra I/O operations.

    buffer points to the destination for the data. This must be 256 bytes in
    size.

    ======================
    MAIN ATTRIBUTES RECORD
    ======================

    The following structures are defined in which to return the main attribute
    set:

    struct statx_timestamp {
    __s64 tv_sec;
    __s32 tv_nsec;
    __s32 __reserved;
    };

    struct statx {
    __u32 stx_mask;
    __u32 stx_blksize;
    __u64 stx_attributes;
    __u32 stx_nlink;
    __u32 stx_uid;
    __u32 stx_gid;
    __u16 stx_mode;
    __u16 __spare0[1];
    __u64 stx_ino;
    __u64 stx_size;
    __u64 stx_blocks;
    __u64 __spare1[1];
    struct statx_timestamp stx_atime;
    struct statx_timestamp stx_btime;
    struct statx_timestamp stx_ctime;
    struct statx_timestamp stx_mtime;
    __u32 stx_rdev_major;
    __u32 stx_rdev_minor;
    __u32 stx_dev_major;
    __u32 stx_dev_minor;
    __u64 __spare2[14];
    };

    The defined bits in request_mask and stx_mask are:

    STATX_TYPE Want/got stx_mode & S_IFMT
    STATX_MODE Want/got stx_mode & ~S_IFMT
    STATX_NLINK Want/got stx_nlink
    STATX_UID Want/got stx_uid
    STATX_GID Want/got stx_gid
    STATX_ATIME Want/got stx_atime{,_ns}
    STATX_MTIME Want/got stx_mtime{,_ns}
    STATX_CTIME Want/got stx_ctime{,_ns}
    STATX_INO Want/got stx_ino
    STATX_SIZE Want/got stx_size
    STATX_BLOCKS Want/got stx_blocks
    STATX_BASIC_STATS [The stuff in the normal stat struct]
    STATX_BTIME Want/got stx_btime{,_ns}
    STATX_ALL [All currently available stuff]

    stx_btime is the file creation time, stx_mask is a bitmask indicating the
    data provided and __spares*[] are where as-yet undefined fields can be
    placed.

    Time fields are structures with separate seconds and nanoseconds fields
    plus a reserved field in case we want to add even finer resolution. Note
    that times will be negative if before 1970; in such a case, the nanosecond
    fields will also be negative if not zero.

    The bits defined in the stx_attributes field convey information about a
    file, how it is accessed, where it is and what it does. The following
    attributes map to FS_*_FL flags and are the same numerical value:

    STATX_ATTR_COMPRESSED File is compressed by the fs
    STATX_ATTR_IMMUTABLE File is marked immutable
    STATX_ATTR_APPEND File is append-only
    STATX_ATTR_NODUMP File is not to be dumped
    STATX_ATTR_ENCRYPTED File requires key to decrypt in fs

    Within the kernel, the supported flags are listed by:

    KSTAT_ATTR_FS_IOC_FLAGS

    [Are any other IOC flags of sufficient general interest to be exposed
    through this interface?]

    New flags include:

    STATX_ATTR_AUTOMOUNT Object is an automount trigger

    These are for the use of GUI tools that might want to mark files specially,
    depending on what they are.

    Fields in struct statx come in a number of classes:

    (0) stx_dev_*, stx_blksize.

    These are local system information and are always available.

    (1) stx_mode, stx_nlinks, stx_uid, stx_gid, stx_[amc]time, stx_ino,
    stx_size, stx_blocks.

    These will be returned whether the caller asks for them or not. The
    corresponding bits in stx_mask will be set to indicate whether they
    actually have valid values.

    If the caller didn't ask for them, then they may be approximated. For
    example, NFS won't waste any time updating them from the server,
    unless as a byproduct of updating something requested.

    If the values don't actually exist for the underlying object (such as
    UID or GID on a DOS file), then the bit won't be set in the stx_mask,
    even if the caller asked for the value. In such a case, the returned
    value will be a fabrication.

    Note that there are instances where the type might not be valid, for
    instance Windows reparse points.

    (2) stx_rdev_*.

    This will be set only if stx_mode indicates we're looking at a
    blockdev or a chardev, otherwise will be 0.

    (3) stx_btime.

    Similar to (1), except this will be set to 0 if it doesn't exist.

    =======
    TESTING
    =======

    The following test program can be used to test the statx system call:

    samples/statx/test-statx.c

    Just compile and run, passing it paths to the files you want to examine.
    The file is built automatically if CONFIG_SAMPLES is enabled.

    Here's some example output. Firstly, an NFS directory that crosses to
    another FSID. Note that the AUTOMOUNT attribute is set because transiting
    this directory will cause d_automount to be invoked by the VFS.

    [root@andromeda ~]# /tmp/test-statx -A /warthog/data
    statx(/warthog/data) = 0
    results=7ff
    Size: 4096 Blocks: 8 IO Block: 1048576 directory
    Device: 00:26 Inode: 1703937 Links: 125
    Access: (3777/drwxrwxrwx) Uid: 0 Gid: 4041
    Access: 2016-11-24 09:02:12.219699527+0000
    Modify: 2016-11-17 10:44:36.225653653+0000
    Change: 2016-11-17 10:44:36.225653653+0000
    Attributes: 0000000000001000 (-------- -------- -------- -------- -------- -------- ---m---- --------)

    Secondly, the result of automounting on that directory.

    [root@andromeda ~]# /tmp/test-statx /warthog/data
    statx(/warthog/data) = 0
    results=7ff
    Size: 4096 Blocks: 8 IO Block: 1048576 directory
    Device: 00:27 Inode: 2 Links: 125
    Access: (3777/drwxrwxrwx) Uid: 0 Gid: 4041
    Access: 2016-11-24 09:02:12.219699527+0000
    Modify: 2016-11-17 10:44:36.225653653+0000
    Change: 2016-11-17 10:44:36.225653653+0000

    Signed-off-by: David Howells
    Signed-off-by: Al Viro

    David Howells
     
  • …sors into <linux/sched/signal.h>

    task_struct::signal and task_struct::sighand are pointers, which would normally make it
    straightforward to not define those types in sched.h.

    That is not so, because the types are accompanied by a myriad of APIs (macros and inline
    functions) that dereference them.

    Split the types and the APIs out of sched.h and move them into a new header, <linux/sched/signal.h>.

    With this change sched.h does not know about 'struct signal' and 'struct sighand' anymore,
    trying to put accessors into sched.h as a test fails the following way:

    ./include/linux/sched.h: In function ‘test_signal_types’:
    ./include/linux/sched.h:2461:18: error: dereferencing pointer to incomplete type ‘struct signal_struct’
    ^

    This reduces the size and complexity of sched.h significantly.

    Update all headers and .c code that relied on getting the signal handling
    functionality from <linux/sched.h> to include <linux/sched/signal.h>.

    The list of affected files in the preparatory patch was partly generated by
    grepping for the APIs, and partly by doing coverage build testing, both
    all[yes|mod|def|no]config builds on 64-bit and 32-bit x86, and an array of
    cross-architecture builds.

    Nevertheless some (trivial) build breakage is still expected related to rare
    Kconfig combinations and in-flight patches to various kernel code, but most
    of it should be handled by this patch.

    Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
    Cc: Mike Galbraith <efault@gmx.de>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar <mingo@kernel.org>

    Ingo Molnar
     
  • Pull vfs pile two from Al Viro:

    - orangefs fix

    - series of fs/namei.c cleanups from me

    - VFS stuff coming from overlayfs tree

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    orangefs: Use RCU for destroy_inode
    vfs: use helper for calling f_op->fsync()
    mm: use helper for calling f_op->mmap()
    vfs: use helpers for calling f_op->{read,write}_iter()
    vfs: pass type instead of fn to do_{loop,iter}_readv_writev()
    vfs: extract common parts of {compat_,}do_readv_writev()
    vfs: wrap write f_ops with file_{start,end}_write()
    vfs: deny copy_file_range() for non regular files
    vfs: deny fallocate() on directory
    vfs: create vfs helper vfs_tmpfile()
    namei.c: split unlazy_walk()
    namei.c: fold the check for DCACHE_OP_REVALIDATE into d_revalidate()
    lookup_fast(): clean up the logics around the fallback to non-rcu mode
    namei: fold unlazy_link() into its sole caller

    Linus Torvalds
     

02 Mar, 2017

14 commits