03 Apr, 2017

14 commits

  • statx has the ability to report inode creation times and inode flags, so
    hook up di_crtime and di_flags to that functionality.

    Signed-off-by: Darrick J. Wong
    Signed-off-by: David Howells
    Signed-off-by: Al Viro

    Darrick J. Wong
     
  • Return enhanced file attributes from the Ext4 filesystem. This includes
    the following:

    (1) The inode creation time (i_crtime) as stx_btime, setting STATX_BTIME.

    (2) Certain FS_xxx_FL flags are mapped to stx_attribute flags.

    This requires that all ext4 inodes have a getattr call, not just some of
    them, so to this end, split the ext4_getattr() function and only call part
    of it where appropriate.

    Example output:

    [root@andromeda ~]# touch foo
    [root@andromeda ~]# chattr +ai foo
    [root@andromeda ~]# /tmp/test-statx foo
    statx(foo) = 0
    results=fff
    Size: 0 Blocks: 0 IO Block: 4096 regular file
    Device: 08:12 Inode: 2101950 Links: 1
    Access: (0644/-rw-r--r--) Uid: 0 Gid: 0
    Access: 2016-02-11 17:08:29.031795451+0000
    Modify: 2016-02-11 17:08:29.031795451+0000
    Change: 2016-02-11 17:11:11.987790114+0000
    Birth: 2016-02-11 17:08:29.031795451+0000
    Attributes: 0000000000000030 (-------- -------- -------- -------- -------- -------- -------- --ai----)

    Signed-off-by: David Howells
    Signed-off-by: Al Viro

    David Howells
     
  • I found that statx() was significantly slower than stat(). As a
    microbenchmark, I compared 10,000,000 invocations of fstat() on a tmpfs
    file to the same with statx() passed a NULL path:

    $ time ./stat_benchmark

    real 0m1.464s
    user 0m0.275s
    sys 0m1.187s

    $ time ./statx_benchmark

    real 0m5.530s
    user 0m0.281s
    sys 0m5.247s

    statx is expected to be a little slower than stat because struct statx
    is larger than struct stat, but not by *that* much. It turns out that
    most of the overhead was in copying struct statx to userspace, mostly in
    all the stac/clac instructions that got generated for each __put_user()
    call. (This was on x86_64, but some other architectures, e.g. arm64,
    have something similar now too.)

    stat() instead initializes its struct on the stack and copies it to
    userspace with a single call to copy_to_user(). This turns out to be
    much faster, and changing statx to do this makes it almost as fast as
    stat:

    $ time ./statx_benchmark

    real 0m1.624s
    user 0m0.270s
    sys 0m1.354s

    For zeroing the reserved fields, start by zeroing the full struct with
    memset. This makes it clear that every byte copied to userspace is
    initialized, even implicit padding bytes (though there are none
    currently). In the scenarios I tested, it also performed the same as a
    designated initializer. Manually initializing each field was still
    slightly faster, but would have been more error-prone and less
    verifiable.

    Also rename statx_set_result() to cp_statx() for consistency with
    cp_old_stat() et al., and make it noinline so that struct statx doesn't
    add to the stack usage during the main portion of the syscall execution.

    Signed-off-by: Eric Biggers
    Signed-off-by: David Howells
    Signed-off-by: Al Viro

    Eric Biggers
     
  • request_mask and query_flags are function arguments, not passed in
    struct kstat. So remove the part of the comment which claims otherwise.
    This was apparently left over from an earlier version of the statx
    patch.

    Signed-off-by: Eric Biggers
    Signed-off-by: David Howells
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Eric Biggers
     
  • The statx() system call currently accepts unknown flags when called with
    a NULL path to operate on a file descriptor. Left unchanged, this could
    make it hard to introduce new query flags in the future, since
    applications may not be able to tell whether a given flag is supported.

    Fix this by failing the system call with EINVAL if any flags other than
    KSTAT_QUERY_FLAGS are specified in combination with a NULL path.

    Arguably, we could still permit known lookup-related flags such as
    AT_SYMLINK_NOFOLLOW. However, that would be inconsistent with how
    sys_utimensat() behaves when passed a NULL path, which seems to be the
    closest precedent. And given that the NULL path case is (I believe)
    mainly intended to be used to implement a wrapper function like fstatx()
    that doesn't have a path argument, I think rejecting lookup-related
    flags too is probably the best choice.

    Signed-off-by: Eric Biggers
    Signed-off-by: David Howells
    Signed-off-by: Al Viro

    Eric Biggers
     
  • Following the recent merge of statx, correct the documented prototype
    for the ->getattr() inode operation, and add an entry to the porting
    file.

    Signed-off-by: Eric Biggers
    Signed-off-by: David Howells
    Signed-off-by: Al Viro

    Eric Biggers
     
  • Linus Torvalds
     
  • Pull dmaengine fixes from Vinod Koul:
    "A couple of minor fixes for 4.11:

    - array bound fix for __get_unmap_pool()

    - cyclic period splitting for bcm2835"

    * tag 'dmaengine-fix-4.11-rc5' of git://git.infradead.org/users/vkoul/slave-dma:
    dmaengine: Fix array index out of bounds warning in __get_unmap_pool()
    dmaengine: bcm2835: Fix cyclic DMA period splitting

    Linus Torvalds
     
  • Pull x86 fixes from Thomas Gleixner:
    "This update provides:

    - prevent KASLR from randomizing EFI regions

    - restrict the usage of -maccumulate-outgoing-args and document when
    and why it is required.

    - make the Global Physical Address calculation for UV4 systems work
    correctly.

    - address a copy->paste->forgot-edit problem in the MCE exception
    table entries.

    - assign a name to AMD MCA bank 3, so the sysfs file registration
    works.

    - add a missing include in the boot code"

    * 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    x86/boot: Include missing header file
    x86/mce/AMD: Give a name to MCA bank 3 when accessed with legacy MSRs
    x86/build: Mostly disable '-maccumulate-outgoing-args'
    x86/mm/KASLR: Exclude EFI region from KASLR VA space randomization
    x86/mce: Fix copy/paste error in exception table entries
    x86/platform/uv: Fix calculation of Global Physical Address

    Linus Torvalds
     
  • Pull scheduler fixes from Thomas Gleixner:
    "This update provides:

    - make the scheduler clock switch to unstable mode smooth so the
    timestamps stay at microseconds granularity instead of switching to
    tick granularity.

    - unbreak perf test tsc by taking the new offset into account which
    was added in order to proveide better sched clock continuity

    - switching sched clock to unstable mode runs all clock related
    computations which affect the sched clock output itself from a work
    queue. In case of preemption sched clock uses half updated data and
    provides wrong timestamps. Keep the math in the protected context
    and delegate only the static key switch to workqueue context.

    - remove a duplicate header include"

    * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    sched/headers: Remove duplicate #include line
    sched/clock: Fix broken stable to unstable transfer
    sched/clock, x86/perf: Fix "perf test tsc"
    sched/clock: Fix clear_sched_clock_stable() preempt wobbly

    Linus Torvalds
     
  • Pull EFI fix from Thomas Gleixner:
    "Downgrade the missing ESRT header printk to warning level and remove a
    useless error printk which just generates noise for no value"

    * 'efi-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    efi/esrt: Cleanup bad memory map log messages

    Linus Torvalds
     
  • Pull timer fixes from Thomas Gleixner:
    "Two small fixes for the new CLKEVT_OF infrastructure"

    * 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    vmlinux.lds: Add __clkevt_of_table to kernel
    clockevents: Fix syntax error in clkevt-of macro

    Linus Torvalds
     
  • Pull irq fixes from Thomas Gleixner:
    "Two small fixlets:

    - select a required Kconfig to make the MVEBU driver compile

    - add the missing MIPS local GIC interrupts which prevent drivers to
    probe successfully"

    * 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    irqchip/mips-gic: Fix Local compare interrupt
    irqchip/mvebu-odmi: Select GENERIC_MSI_IRQ_DOMAIN

    Linus Torvalds
     
  • Pull core fix from Thomas Gleixner:
    "Prevent leaking kernel memory via /proc/$pid/syscall when the queried
    task is not in a syscall"

    * 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    lib/syscall: Clear return values when no stack

    Linus Torvalds
     

02 Apr, 2017

10 commits

  • Pull parisc fixes from Helge Deller:
    "Al Viro reported that - in case of read faults - our copy_from_user()
    implementation may claim to have copied more bytes than it actually
    did. In order to fix this bug and because of the way how gcc optimizes
    register usage for inline assembly in C code, we had to replace our
    pa_memcpy() function with a pure assembler implementation.

    While fixing the memcpy bug we noticed some other issues with our
    get_user() and put_user() functions, e.g. nested faults may return
    wrong data. This is now fixed by a common fixup handler for
    get_user/put_user in the exception handler which additionally makes
    generated code smaller and faster.

    The third patch is a trivial one-line fix for a patch which went in
    during 4.11-rc and which avoids stalled CPU warnings after power
    shutdown (for parisc machines which can't plug power off themselves).

    Due to the rewrite of pa_memcpy() into assembly this patch got bigger
    than what I wanted to have sent at this stage.

    Those patches have been running in production during the last few days
    on our debian build servers without any further issues"

    * 'parisc-4.11-3' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux:
    parisc: Avoid stalled CPU warnings after system shutdown
    parisc: Clean up fixup routines for get_user()/put_user()
    parisc: Fix access fault handling in pa_memcpy()

    Linus Torvalds
     
  • Pull SCSI fixes from James Bottomley:
    "Thirteen small fixes: The hopefully final effort to get the lpfc nvme
    kconfig problems sorted, there's one important sg fix (user can induce
    read after end of buffer) and one minor enhancement (adding an extra
    PCI ID to qedi). The rest are a set of minor fixes, which mostly occur
    as user visible in error legs or on specific devices"

    * tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
    scsi: ufs: remove the duplicated checking for supporting clkscaling
    scsi: lpfc: fix building without debugfs support
    scsi: lpfc: Fix PT2PT PRLI reject
    scsi: hpsa: fix volume offline state
    scsi: libsas: fix ata xfer length
    scsi: scsi_dh_alua: Warn if the first argument of alua_rtpg_queue() is NULL
    scsi: scsi_dh_alua: Ensure that alua_activate() calls the completion function
    scsi: scsi_dh_alua: Check scsi_device_get() return value
    scsi: sg: check length passed to SG_NEXT_CMD_LEN
    scsi: ufshcd-platform: remove the useless cast in ERR_PTR/IS_ERR
    scsi: qedi: Add PCI device-ID for QL41xxx adapters.
    scsi: aacraid: Fix potential null access
    scsi: qla2xxx: Fix crash in qla2xxx_eh_abort on bad ptr

    Linus Torvalds
     
  • Merge misc fixes from Andrew Morton:
    "11 fixes"

    * emailed patches from Andrew Morton :
    kasan: do not sanitize kexec purgatory
    drivers/rapidio/devices/tsi721.c: make module parameter variable name unique
    mm/hugetlb.c: don't call region_abort if region_chg fails
    kasan: report only the first error by default
    hugetlbfs: initialize shared policy as part of inode allocation
    mm: fix section name for .data..ro_after_init
    mm, hugetlb: use pte_present() instead of pmd_present() in follow_huge_pmd()
    mm: workingset: fix premature shadow node shrinking with cgroups
    mm: rmap: fix huge file mmap accounting in the memcg stats
    mm: move mm_percpu_wq initialization earlier
    mm: migrate: fix remove_migration_pte() for ksm pages

    Linus Torvalds
     
  • Pull USB fixes from Greg KH:
    "Here are some small USB fixes for 4.11-rc5.

    The usual xhci fixes are here, as well as a fix for yet-another-bug-
    found-by-KASAN, those developers are doing great stuff here.

    And there's a phy build warning fix that showed up in 4.11-rc1.

    All of these have been in linux-next with no reported issues"

    * tag 'usb-4.11-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb:
    usb: phy: isp1301: Fix build warning when CONFIG_OF is disabled
    xhci: Manually give back cancelled URB if we can't queue it for cancel
    xhci: Set URB actual length for stopped control transfers
    xhci: plat: Register shutdown for xhci_plat
    USB: fix linked-list corruption in rh_call_control()

    Linus Torvalds
     
  • Pull tty/serial fixes from Greg KH:
    "Here are some small fixes for some serial drivers and Kconfig help
    text for 4.11-rc5. Nothing major here at all, a few things resolving
    reported bugs in some random serial drivers.

    I don't think these made the last linux-next due to me getting to them
    yesterday, but I am not sure, they might have snuck in. The patches
    only affect drivers that the maintainers of sent me these patches for,
    so we should be safe here :)"

    * tag 'tty-4.11-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty:
    tty: pl011: fix earlycon work-around for QDF2400 erratum 44
    serial: 8250_EXAR: fix duplicate Kconfig text and add missing help text
    tty/serial: atmel: fix TX path in atmel_console_write()
    tty/serial: atmel: fix race condition (TX+DMA)
    serial: mxs-auart: Fix baudrate calculation

    Linus Torvalds
     
  • Pull ACPI fixes from Rafael Wysocki:
    "These fix two issues related to IOAPIC hotplug, an overzealous build
    optimization that prevents the function graph tracer from working with
    the ACPI subsystem correctly and an RCU synchronization issue in the
    ACPI APEI code.

    Specifics:

    - drop the unconditional setting of the '-Os' gcc flag from the ACPI
    Makefile to make the function graph tracer work correctly with the
    ACPI subsystem (Josh Poimboeuf).

    - add missing synchronize_rcu() to ghes_remove() which removes an
    element from an RCU-protected list, but fails to synchronize it
    properly afterward (James Morse).

    - fix two problems related to IOAPIC hotplug, a local variable
    initialization in setup_res() and the creation of platform device
    objects for IO(x)APICs which are (a) unused and (b) leaked on
    hot-removal (Joerg Roedel)"

    * tag 'acpi-4.11-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
    ACPI: Fix incompatibility with mcount-based function graph tracing
    ACPI / APEI: Add missing synchronize_rcu() on NOTIFY_SCI removal
    ACPI: Do not create a platform_device for IOAPIC/IOxAPIC
    ACPI: ioapic: Clear on-stack resource before using it

    Linus Torvalds
     
  • Pull power management fixes from Rafael Wysocki:
    "These fix a cpufreq core issue with the initialization of the cpufreq
    sysfs interface and a cpuidle powernv driver initialization issue.

    Specifics:

    - symbolic links from CPU directories to the corresponding cpufreq
    policy directories in sysfs are not created during initialization
    in some cases which confuses user space, so prevent that from
    happening (Rafael Wysocki).

    - the powernv cpuidle driver fails to pass a correct cpumaks to the
    cpuidle core in some cases which causes subsequent failures to
    occur, so fix it (Vaidyanathan Srinivasan)"

    * tag 'pm-4.11-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
    cpuidle: powernv: Pass correct drv->cpumask for registration
    cpufreq: Fix creation of symbolic links to policy directories

    Linus Torvalds
     
  • Pull i2c fixes from Wolfram Sang:
    "Two bugfixes from I2C, specifically the I2C mux section. Thanks to
    peda for collecting them"

    * 'i2c/for-current' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux:
    i2c: mux: pca954x: Add missing pca9546 definition to chip_desc
    Revert "i2c: mux: pca954x: Add ACPI support for pca954x"

    Linus Torvalds
     
  • Pull ARC fixes from Vineet Gupta:
    "Accumulated fixes for ARC which I've been been sitting on for a while:

    - reading clk from driver vs device tree [Vlad]

    - fix support for UIO in VDK platform [Alexey]

    - SLC busy bit reading workaround

    - build warning with kprobes header reorg"

    * tag 'arc-4.11-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/vgupta/arc:
    ARC: fix build warnings with !CONFIG_KPROBES
    ARCv2: SLC: Make sure busy bit is set properly on SLC flushing
    ARC: vdk: Fix support of UIO
    ARCv2: make unimplemented vectors as no-ops rather than halt core
    ARC: get rate from clk driver instead of reading device tree
    ARC: [dts] add cpu nodes to ARCHS SMP device tree
    ARC: [dts] add input clocks for cpu nodes

    Linus Torvalds
     
  • Pull nfsd fixes from Bruce Fields:
    "The restriction of NFSv4 to TCP went overboard and also broke the
    backchannel; fix.

    Also some minor refinements to the nfsd version-setting interface that
    we'd like to get fixed before release"

    * tag 'nfsd-4.11-1' of git://linux-nfs.org/~bfields/linux:
    svcrdma: set XPT_CONG_CTRL flag for bc xprt
    NFSD: fix nfsd_reset_versions for NFSv4.
    NFSD: fix nfsd_minorversion(.., NFSD_AVAIL)
    NFSD: further refinement of content of /proc/fs/nfsd/versions
    nfsd: map the ENOKEY to nfserr_perm for avoiding warning
    SUNRPC/backchanel: set XPT_CONG_CTRL flag for bc xprt

    Linus Torvalds
     

01 Apr, 2017

16 commits

  • The work-around for the Qualcomm Datacenter Technologies QDF2400
    erratum 44 sets the "qdf2400_e44_present" global variable if the
    work-around is needed. However, this check does not happen until after
    earlycon is initialized, which means the work-around is not
    used, and the console hangs as soon as it displays one character.

    Fixes: d8a4995bcea1 ("tty: pl011: Work around QDF2400 E44 stuck BUSY bit")
    Signed-off-by: Timur Tabi
    Signed-off-by: Greg Kroah-Hartman

    Timur Tabi
     
  • Pull btrfs fixes from Chris Mason:
    "We have three small fixes queued up in my for-linus-4.11 branch"

    * 'for-linus-4.11' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
    Btrfs: fix an integer overflow check
    btrfs: Change qgroup_meta_rsv to 64bit
    Btrfs: bring back repair during read

    Linus Torvalds
     
  • Fixes this:

    kexec: Undefined symbol: __asan_load8_noabort
    kexec-bzImage64: Loading purgatory failed

    Link: http://lkml.kernel.org/r/1489672155.4458.7.camel@gmx.de
    Signed-off-by: Mike Galbraith
    Cc: Alexander Potapenko
    Cc: Andrey Ryabinin
    Cc: Dmitry Vyukov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Galbraith
     
  • kbuild test robot reported a non-static variable name collision between
    a staging driver and a RapidIO driver, with a generic variable name of
    'dbg_level'.

    Both drivers should be changed so that they don't use this generic
    public variable name. This patch fixes the RapidIO driver but does not
    change the user interface (name) for the module parameter.

    drivers/staging/built-in.o:(.bss+0x109d0): multiple definition of `dbg_level'
    drivers/rapidio/built-in.o:(.bss+0x16c): first defined here

    Link: http://lkml.kernel.org/r/ab527fc5-aa3c-4b07-5d48-eef5de703192@infradead.org
    Signed-off-by: Randy Dunlap
    Reported-by: kbuild test robot
    Cc: Greg Kroah-Hartman
    Cc: Matt Porter
    Cc: Alexandre Bounine
    Cc: Jérémy Lefaure
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Randy Dunlap
     
  • Changes to hugetlbfs reservation maps is a two step process. The first
    step is a call to region_chg to determine what needs to be changed, and
    prepare that change. This should be followed by a call to call to
    region_add to commit the change, or region_abort to abort the change.

    The error path in hugetlb_reserve_pages called region_abort after a
    failed call to region_chg. As a result, the adds_in_progress counter in
    the reservation map is off by 1. This is caught by a VM_BUG_ON in
    resv_map_release when the reservation map is freed.

    syzkaller fuzzer (when using an injected kmalloc failure) found this
    bug, that resulted in the following:

    kernel BUG at mm/hugetlb.c:742!
    Call Trace:
    hugetlbfs_evict_inode+0x7b/0xa0 fs/hugetlbfs/inode.c:493
    evict+0x481/0x920 fs/inode.c:553
    iput_final fs/inode.c:1515 [inline]
    iput+0x62b/0xa20 fs/inode.c:1542
    hugetlb_file_setup+0x593/0x9f0 fs/hugetlbfs/inode.c:1306
    newseg+0x422/0xd30 ipc/shm.c:575
    ipcget_new ipc/util.c:285 [inline]
    ipcget+0x21e/0x580 ipc/util.c:639
    SYSC_shmget ipc/shm.c:673 [inline]
    SyS_shmget+0x158/0x230 ipc/shm.c:657
    entry_SYSCALL_64_fastpath+0x1f/0xc2
    RIP: resv_map_release+0x265/0x330 mm/hugetlb.c:742

    Link: http://lkml.kernel.org/r/1490821682-23228-1-git-send-email-mike.kravetz@oracle.com
    Signed-off-by: Mike Kravetz
    Reported-by: Dmitry Vyukov
    Acked-by: Hillf Danton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Kravetz
     
  • Disable kasan after the first report. There are several reasons for
    this:

    - Single bug quite often has multiple invalid memory accesses causing
    storm in the dmesg.

    - Write OOB access might corrupt metadata so the next report will print
    bogus alloc/free stacktraces.

    - Reports after the first easily could be not bugs by itself but just
    side effects of the first one.

    Given that multiple reports usually only do harm, it makes sense to
    disable kasan after the first one. If user wants to see all the
    reports, the boot-time parameter kasan_multi_shot must be used.

    [aryabinin@virtuozzo.com: wrote changelog and doc, added missing include]
    Link: http://lkml.kernel.org/r/20170323154416.30257-1-aryabinin@virtuozzo.com
    Signed-off-by: Mark Rutland
    Signed-off-by: Andrey Ryabinin
    Cc: Andrey Konovalov
    Cc: Alexander Potapenko
    Cc: Dmitry Vyukov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mark Rutland
     
  • Any time after inode allocation, destroy_inode can be called. The
    hugetlbfs inode contains a shared_policy structure, and
    mpol_free_shared_policy is unconditionally called as part of
    hugetlbfs_destroy_inode. Initialize the policy as part of inode
    allocation so that any quick (error path) calls to destroy_inode will be
    handed an initialized policy.

    syzkaller fuzzer found this bug, that resulted in the following:

    BUG: KASAN: user-memory-access in atomic_inc
    include/asm-generic/atomic-instrumented.h:87 [inline] at addr
    000000131730bd7a
    BUG: KASAN: user-memory-access in __lock_acquire+0x21a/0x3a80
    kernel/locking/lockdep.c:3239 at addr 000000131730bd7a
    Write of size 4 by task syz-executor6/14086
    CPU: 3 PID: 14086 Comm: syz-executor6 Not tainted 4.11.0-rc3+ #364
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
    Call Trace:
    atomic_inc include/asm-generic/atomic-instrumented.h:87 [inline]
    __lock_acquire+0x21a/0x3a80 kernel/locking/lockdep.c:3239
    lock_acquire+0x1ee/0x590 kernel/locking/lockdep.c:3762
    __raw_write_lock include/linux/rwlock_api_smp.h:210 [inline]
    _raw_write_lock+0x33/0x50 kernel/locking/spinlock.c:295
    mpol_free_shared_policy+0x43/0xb0 mm/mempolicy.c:2536
    hugetlbfs_destroy_inode+0xca/0x120 fs/hugetlbfs/inode.c:952
    alloc_inode+0x10d/0x180 fs/inode.c:216
    new_inode_pseudo+0x69/0x190 fs/inode.c:889
    new_inode+0x1c/0x40 fs/inode.c:918
    hugetlbfs_get_inode+0x40/0x420 fs/hugetlbfs/inode.c:734
    hugetlb_file_setup+0x329/0x9f0 fs/hugetlbfs/inode.c:1282
    newseg+0x422/0xd30 ipc/shm.c:575
    ipcget_new ipc/util.c:285 [inline]
    ipcget+0x21e/0x580 ipc/util.c:639
    SYSC_shmget ipc/shm.c:673 [inline]
    SyS_shmget+0x158/0x230 ipc/shm.c:657
    entry_SYSCALL_64_fastpath+0x1f/0xc2

    Analysis provided by Tetsuo Handa

    Link: http://lkml.kernel.org/r/1490477850-7944-1-git-send-email-mike.kravetz@oracle.com
    Signed-off-by: Mike Kravetz
    Reported-by: Dmitry Vyukov
    Acked-by: Hillf Danton
    Cc: Tetsuo Handa
    Cc: Michal Hocko
    Cc: Dave Hansen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Kravetz
     
  • A section name for .data..ro_after_init was added by both:

    commit d07a980c1b8d ("s390: add proper __ro_after_init support")

    and

    commit d7c19b066dcf ("mm: kmemleak: scan .data.ro_after_init")

    The latter adds incorrect wrapping around the existing s390 section, and
    came later. I'd prefer the s390 naming, so this moves the s390-specific
    name up to the asm-generic/sections.h and renames the section as used by
    kmemleak (and in the future, kernel/extable.c).

    Link: http://lkml.kernel.org/r/20170327192213.GA129375@beast
    Signed-off-by: Kees Cook
    Acked-by: Heiko Carstens [s390 parts]
    Acked-by: Jakub Kicinski
    Cc: Eddie Kovsky
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kees Cook
     
  • I found the race condition which triggers the following bug when
    move_pages() and soft offline are called on a single hugetlb page
    concurrently.

    Soft offlining page 0x119400 at 0x700000000000
    BUG: unable to handle kernel paging request at ffffea0011943820
    IP: follow_huge_pmd+0x143/0x190
    PGD 7ffd2067
    PUD 7ffd1067
    PMD 0
    [61163.582052] Oops: 0000 [#1] SMP
    Modules linked in: binfmt_misc ppdev virtio_balloon parport_pc pcspkr i2c_piix4 parport i2c_core acpi_cpufreq ip_tables xfs libcrc32c ata_generic pata_acpi virtio_blk 8139too crc32c_intel ata_piix serio_raw libata virtio_pci 8139cp virtio_ring virtio mii floppy dm_mirror dm_region_hash dm_log dm_mod [last unloaded: cap_check]
    CPU: 0 PID: 22573 Comm: iterate_numa_mo Tainted: P OE 4.11.0-rc2-mm1+ #2
    Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
    RIP: 0010:follow_huge_pmd+0x143/0x190
    RSP: 0018:ffffc90004bdbcd0 EFLAGS: 00010202
    RAX: 0000000465003e80 RBX: ffffea0004e34d30 RCX: 00003ffffffff000
    RDX: 0000000011943800 RSI: 0000000000080001 RDI: 0000000465003e80
    RBP: ffffc90004bdbd18 R08: 0000000000000000 R09: ffff880138d34000
    R10: ffffea0004650000 R11: 0000000000c363b0 R12: ffffea0011943800
    R13: ffff8801b8d34000 R14: ffffea0000000000 R15: 000077ff80000000
    FS: 00007fc977710740(0000) GS:ffff88007dc00000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: ffffea0011943820 CR3: 000000007a746000 CR4: 00000000001406f0
    Call Trace:
    follow_page_mask+0x270/0x550
    SYSC_move_pages+0x4ea/0x8f0
    SyS_move_pages+0xe/0x10
    do_syscall_64+0x67/0x180
    entry_SYSCALL64_slow_path+0x25/0x25
    RIP: 0033:0x7fc976e03949
    RSP: 002b:00007ffe72221d88 EFLAGS: 00000246 ORIG_RAX: 0000000000000117
    RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fc976e03949
    RDX: 0000000000c22390 RSI: 0000000000001400 RDI: 0000000000005827
    RBP: 00007ffe72221e00 R08: 0000000000c2c3a0 R09: 0000000000000004
    R10: 0000000000c363b0 R11: 0000000000000246 R12: 0000000000400650
    R13: 00007ffe72221ee0 R14: 0000000000000000 R15: 0000000000000000
    Code: 81 e4 ff ff 1f 00 48 21 c2 49 c1 ec 0c 48 c1 ea 0c 4c 01 e2 49 bc 00 00 00 00 00 ea ff ff 48 c1 e2 06 49 01 d4 f6 45 bc 04 74 90 8b 7c 24 20 40 f6 c7 01 75 2b 4c 89 e7 8b 47 1c 85 c0 7e 2a
    RIP: follow_huge_pmd+0x143/0x190 RSP: ffffc90004bdbcd0
    CR2: ffffea0011943820
    ---[ end trace e4f81353a2d23232 ]---
    Kernel panic - not syncing: Fatal exception
    Kernel Offset: disabled

    This bug is triggered when pmd_present() returns true for non-present
    hugetlb, so fixing the present check in follow_huge_pmd() prevents it.
    Using pmd_present() to determine present/non-present for hugetlb is not
    correct, because pmd_present() checks multiple bits (not only
    _PAGE_PRESENT) for historical reason and it can misjudge hugetlb state.

    Fixes: e66f17ff7177 ("mm/hugetlb: take page table lock in follow_huge_pmd()")
    Link: http://lkml.kernel.org/r/1490149898-20231-1-git-send-email-n-horiguchi@ah.jp.nec.com
    Signed-off-by: Naoya Horiguchi
    Acked-by: Hillf Danton
    Cc: Hugh Dickins
    Cc: Michal Hocko
    Cc: "Kirill A. Shutemov"
    Cc: Mike Kravetz
    Cc: Christian Borntraeger
    Cc: Gerald Schaefer
    Cc: [4.0+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • Commit 0a6b76dd23fa ("mm: workingset: make shadow node shrinker memcg
    aware") enabled cgroup-awareness in the shadow node shrinker, but forgot
    to also enable cgroup-awareness in the list_lru the shadow nodes sit on.

    Consequently, all shadow nodes are sitting on a global (per-NUMA node)
    list, while the shrinker applies the limits according to the amount of
    cache in the cgroup its shrinking. The result is excessive pressure on
    the shadow nodes from cgroups that have very little cache.

    Enable memcg-mode on the shadow node LRUs, such that per-cgroup limits
    are applied to per-cgroup lists.

    Fixes: 0a6b76dd23fa ("mm: workingset: make shadow node shrinker memcg aware")
    Link: http://lkml.kernel.org/r/20170322005320.8165-1-hannes@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Acked-by: Vladimir Davydov
    Cc: Michal Hocko
    Cc: [4.6+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Huge pages are accounted as single units in the memcg's "file_mapped"
    counter. Account the correct number of base pages, like we do in the
    corresponding node counter.

    Link: http://lkml.kernel.org/r/20170322005111.3156-1-hannes@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Reviewed-by: Kirill A. Shutemov
    Acked-by: Michal Hocko
    Cc: Vladimir Davydov
    Cc: [4.8+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Yang Li has reported that drain_all_pages triggers a WARN_ON which means
    that this function is called earlier than the mm_percpu_wq is
    initialized on arm64 with CMA configured:

    WARNING: CPU: 2 PID: 1 at mm/page_alloc.c:2423 drain_all_pages+0x244/0x25c
    Modules linked in:
    CPU: 2 PID: 1 Comm: swapper/0 Not tainted 4.11.0-rc1-next-20170310-00027-g64dfbc5 #127
    Hardware name: Freescale Layerscape 2088A RDB Board (DT)
    task: ffffffc07c4a6d00 task.stack: ffffffc07c4a8000
    PC is at drain_all_pages+0x244/0x25c
    LR is at start_isolate_page_range+0x14c/0x1f0
    [...]
    drain_all_pages+0x244/0x25c
    start_isolate_page_range+0x14c/0x1f0
    alloc_contig_range+0xec/0x354
    cma_alloc+0x100/0x1fc
    dma_alloc_from_contiguous+0x3c/0x44
    atomic_pool_init+0x7c/0x208
    arm64_dma_init+0x44/0x4c
    do_one_initcall+0x38/0x128
    kernel_init_freeable+0x1a0/0x240
    kernel_init+0x10/0xfc
    ret_from_fork+0x10/0x20

    Fix this by moving the whole setup_vmstat which is an initcall right now
    to init_mm_internals which will be called right after the WQ subsystem
    is initialized.

    Link: http://lkml.kernel.org/r/20170315164021.28532-1-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Reported-by: Yang Li
    Tested-by: Yang Li
    Tested-by: Xiaolong Ye
    Cc: Mel Gorman
    Cc: Vlastimil Babka
    Cc: Tetsuo Handa
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • I found that calling page migration for ksm pages causes the following
    bug:

    page:ffffea0004d51180 count:2 mapcount:2 mapping:ffff88013c785141 index:0x913
    flags: 0x57ffffc0040068(uptodate|lru|active|swapbacked)
    raw: 0057ffffc0040068 ffff88013c785141 0000000000000913 0000000200000001
    raw: ffffea0004d5f9e0 ffffea0004d53f60 0000000000000000 ffff88007d81b800
    page dumped because: VM_BUG_ON_PAGE(!PageLocked(page))
    page->mem_cgroup:ffff88007d81b800
    ------------[ cut here ]------------
    kernel BUG at /src/linux-dev/mm/rmap.c:1086!
    invalid opcode: 0000 [#1] SMP
    Modules linked in: ppdev parport_pc virtio_balloon i2c_piix4 pcspkr parport i2c_core acpi_cpufreq ip_tables xfs libcrc32c ata_generic pata_acpi ata_piix 8139too libata virtio_blk 8139cp crc32c_intel mii virtio_pci virtio_ring serio_raw virtio floppy dm_mirror dm_region_hash dm_log dm_mod
    CPU: 0 PID: 3162 Comm: bash Not tainted 4.11.0-rc2-mm1+ #1
    Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
    RIP: 0010:do_page_add_anon_rmap+0x1ba/0x260
    RSP: 0018:ffffc90002473b30 EFLAGS: 00010282
    RAX: 0000000000000021 RBX: ffffea0004d51180 RCX: 0000000000000006
    RDX: 0000000000000000 RSI: 0000000000000082 RDI: ffff88007dc0dfe0
    RBP: ffffc90002473b58 R08: 00000000fffffffe R09: 00000000000001c1
    R10: 0000000000000005 R11: 00000000000001c0 R12: ffff880139ab3d80
    R13: 0000000000000000 R14: 0000700000000200 R15: 0000160000000000
    FS: 00007f5195f50740(0000) GS:ffff88007dc00000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 00007fd450287000 CR3: 000000007a08e000 CR4: 00000000001406f0
    Call Trace:
    page_add_anon_rmap+0x18/0x20
    remove_migration_pte+0x220/0x2c0
    rmap_walk_ksm+0x143/0x220
    rmap_walk+0x55/0x60
    remove_migration_ptes+0x53/0x80
    migrate_pages+0x8ed/0xb60
    soft_offline_page+0x309/0x8d0
    store_soft_offline_page+0xaf/0xf0
    dev_attr_store+0x18/0x30
    sysfs_kf_write+0x3a/0x50
    kernfs_fop_write+0xff/0x180
    __vfs_write+0x37/0x160
    vfs_write+0xb2/0x1b0
    SyS_write+0x55/0xc0
    do_syscall_64+0x67/0x180
    entry_SYSCALL64_slow_path+0x25/0x25
    RIP: 0033:0x7f51956339e0
    RSP: 002b:00007ffcfa0dffc8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
    RAX: ffffffffffffffda RBX: 000000000000000c RCX: 00007f51956339e0
    RDX: 000000000000000c RSI: 00007f5195f53000 RDI: 0000000000000001
    RBP: 00007f5195f53000 R08: 000000000000000a R09: 00007f5195f50740
    R10: 000000000000000b R11: 0000000000000246 R12: 00007f5195907400
    R13: 000000000000000c R14: 0000000000000001 R15: 0000000000000000
    Code: fe ff ff 48 81 c2 00 02 00 00 48 89 55 d8 e8 2e c3 fd ff 48 8b 55 d8 e9 42 ff ff ff 48 c7 c6 e0 52 a1 81 48 89 df e8 46 ad fe ff 0b 48 83 e8 01 e9 7f fe ff ff 48 83 e8 01 e9 96 fe ff ff 48
    RIP: do_page_add_anon_rmap+0x1ba/0x260 RSP: ffffc90002473b30
    ---[ end trace a679d00f4af2df48 ]---
    Kernel panic - not syncing: Fatal exception
    Kernel Offset: disabled
    ---[ end Kernel panic - not syncing: Fatal exception

    The problem is in the following lines:

    new = page - pvmw.page->index +
    linear_page_index(vma, pvmw.address);

    The 'new' is calculated with 'page' which is given by the caller as a
    destination page and some offset adjustment for thp. But this doesn't
    properly work for ksm pages because pvmw.page->index doesn't change for
    each address but linear_page_index() changes, which means that 'new'
    points to different pages for each addresses backed by the ksm page. As
    a result, we try to set totally unrelated pages as destination pages,
    and that causes kernel crash.

    This patch fixes the miscalculation and makes ksm page migration work
    fine.

    Fixes: 3fe87967c536 ("mm: convert remove_migration_pte() to use page_vma_mapped_walk()")
    Link: http://lkml.kernel.org/r/1489717683-29905-1-git-send-email-n-horiguchi@ah.jp.nec.com
    Signed-off-by: Naoya Horiguchi
    Cc: Kirill A. Shutemov
    Cc: Michal Hocko
    Cc: Mel Gorman
    Cc: Hillf Danton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • * pm-cpufreq-fixes:
    cpufreq: Fix creation of symbolic links to policy directories

    * pm-cpuidle-fixes:
    cpuidle: powernv: Pass correct drv->cpumask for registration

    Rafael J. Wysocki
     
  • * acpi-hotplug-fixes:
    ACPI: Do not create a platform_device for IOAPIC/IOxAPIC
    ACPI: ioapic: Clear on-stack resource before using it

    * acpi-build-fixes:
    ACPI: Fix incompatibility with mcount-based function graph tracing

    * acpi-apei-fixes:
    ACPI / APEI: Add missing synchronize_rcu() on NOTIFY_SCI removal

    Rafael J. Wysocki
     
  • Pull NFS client fixes from Anna Schumaker:
    "Here are a few more bugfixes that came in over the last couple of
    weeks. Most of these fix various hangs and loops that people found,
    but we also had a few error handling fixes.

    Stable Bugfixes:
    - fix infinite loop on BAD_STATEID error

    Other Bugfixes:
    - fix old dentry rehash after move
    - fix pnfs GETDEVINFO hangs
    - fix pnfs fallback to MDS on commit errors
    - fix flexfiles kernel oops"

    * tag 'nfs-for-4.11-3' of git://git.linux-nfs.org/projects/anna/linux-nfs:
    nfs: flexfiles: fix kernel OOPS if MDS returns unsupported DS type
    NFSv4.1 fix infinite loop on IO BAD_STATEID error
    PNFS fix fallback to MDS if got error on commit to DS
    NFS filelayout:call GETDEVICEINFO after pnfs_layout_process completes
    NFS store nfs4_deviceid in struct nfs4_filelayout_segment
    NFS cleanup struct nfs4_filelayout_segment
    NFS: Fix old dentry rehash after move

    Linus Torvalds