18 Dec, 2018

1 commit

  • recvmmsg() takes two arguments to pointers of structures that differ
    between 32-bit and 64-bit architectures: mmsghdr and timespec.

    For y2038 compatbility, we are changing the native system call from
    timespec to __kernel_timespec with a 64-bit time_t (in another patch),
    and use the existing compat system call on both 32-bit and 64-bit
    architectures for compatibility with traditional 32-bit user space.

    As we now have two variants of recvmmsg() for 32-bit tasks that are both
    different from the variant that we use on 64-bit tasks, this means we
    also require two compat system calls!

    The solution I picked is to flip things around: The existing
    compat_sys_recvmmsg() call gets moved from net/compat.c into net/socket.c
    and now handles the case for old user space on all architectures that
    have set CONFIG_COMPAT_32BIT_TIME. A new compat_sys_recvmmsg_time64()
    call gets added in the old place for 64-bit architectures only, this
    one handles the case of a compat mmsghdr structure combined with
    __kernel_timespec.

    In the indirect sys_socketcall(), we now need to call either
    do_sys_recvmmsg() or __compat_sys_recvmmsg(), depending on what kind of
    architecture we are on. For compat_sys_socketcall(), no such change is
    needed, we always call __compat_sys_recvmmsg().

    I decided to not add a new SYS_RECVMMSG_TIME64 socketcall: Any libc
    implementation for 64-bit time_t will need significant changes including
    an updated asm/unistd.h, and it seems better to consistently use the
    separate syscalls that configuration, leaving the socketcall only for
    backward compatibility with 32-bit time_t based libc.

    The naming is asymmetric for the moment, so both existing syscalls
    entry points keep their names, while the new ones are recvmmsg_time32
    and compat_recvmmsg_time64 respectively. I expect that we will rename
    the compat syscalls later as we start using generated syscall tables
    everywhere and add these entry points.

    Signed-off-by: Arnd Bergmann

    Arnd Bergmann
     

08 Dec, 2018

2 commits

  • This prepares sys_futex for y2038 safe calling: the native
    syscall is changed to receive a __kernel_timespec argument, which
    will be switched to 64-bit time_t in the future. All the internal
    time handling gets changed to timespec64, and the compat_sys_futex
    entry point is moved under the CONFIG_COMPAT_32BIT_TIME check
    to provide compatibility for existing 32-bit architectures.

    Signed-off-by: Arnd Bergmann

    Arnd Bergmann
     
  • We are going to share the compat_sys_futex() handler between 64-bit
    architectures and 32-bit architectures that need to deal with both 32-bit
    and 64-bit time_t, and this is easier if both entry points are in the
    same file.

    In fact, most other system call handlers do the same thing these days, so
    let's follow the trend here and merge all of futex_compat.c into futex.c.

    In the process, a few minor changes have to be done to make sure everything
    still makes sense: handle_futex_death() and futex_cmpxchg_enabled() become
    local symbol, and the compat version of the fetch_robust_entry() function
    gets renamed to compat_fetch_robust_entry() to avoid a symbol clash.

    This is intended as a purely cosmetic patch, no behavior should
    change.

    Signed-off-by: Arnd Bergmann

    Arnd Bergmann
     

07 Dec, 2018

5 commits

  • struct timespec is not y2038 safe.
    struct __kernel_timespec is the new y2038 safe structure for all
    syscalls that are using struct timespec.
    Update io_pgetevents interfaces to use struct __kernel_timespec.

    sigset_t also has different representations on 32 bit and 64 bit
    architectures. Hence, we need to support the following different
    syscalls:

    New y2038 safe syscalls:
    (Controlled by CONFIG_64BIT_TIME for 32 bit ABIs)

    Native 64 bit(unchanged) and native 32 bit : sys_io_pgetevents
    Compat : compat_sys_io_pgetevents_time64

    Older y2038 unsafe syscalls:
    (Controlled by CONFIG_32BIT_COMPAT_TIME for 32 bit ABIs)

    Native 32 bit : sys_io_pgetevents_time32
    Compat : compat_sys_io_pgetevents

    Note that io_getevents syscalls do not have a y2038 safe solution.

    Signed-off-by: Deepa Dinamani
    Signed-off-by: Arnd Bergmann

    Deepa Dinamani
     
  • struct timespec is not y2038 safe.
    struct __kernel_timespec is the new y2038 safe structure for all
    syscalls that are using struct timespec.
    Update pselect interfaces to use struct __kernel_timespec.

    sigset_t also has different representations on 32 bit and 64 bit
    architectures. Hence, we need to support the following different
    syscalls:

    New y2038 safe syscalls:
    (Controlled by CONFIG_64BIT_TIME for 32 bit ABIs)

    Native 64 bit(unchanged) and native 32 bit : sys_pselect6
    Compat : compat_sys_pselect6_time64

    Older y2038 unsafe syscalls:
    (Controlled by CONFIG_32BIT_COMPAT_TIME for 32 bit ABIs)

    Native 32 bit : pselect6_time32
    Compat : compat_sys_pselect6

    Note that all other versions of select syscalls will not have
    y2038 safe versions.

    Signed-off-by: Deepa Dinamani
    Signed-off-by: Arnd Bergmann

    Deepa Dinamani
     
  • struct timespec is not y2038 safe.
    struct __kernel_timespec is the new y2038 safe structure for all
    syscalls that are using struct timespec.
    Update ppoll interfaces to use struct __kernel_timespec.

    sigset_t also has different representations on 32 bit and 64 bit
    architectures. Hence, we need to support the following different
    syscalls:

    New y2038 safe syscalls:
    (Controlled by CONFIG_64BIT_TIME for 32 bit ABIs)

    Native 64 bit(unchanged) and native 32 bit : sys_ppoll
    Compat : compat_sys_ppoll_time64

    Older y2038 unsafe syscalls:
    (Controlled by CONFIG_32BIT_COMPAT_TIME for 32 bit ABIs)

    Native 32 bit : ppoll_time32
    Compat : compat_sys_ppoll

    Signed-off-by: Deepa Dinamani
    Signed-off-by: Arnd Bergmann

    Deepa Dinamani
     
  • Refactor the logic to restore the sigmask before the syscall
    returns into an api.
    This is useful for versions of syscalls that pass in the
    sigmask and expect the current->sigmask to be changed during
    the execution and restored after the execution of the syscall.

    With the advent of new y2038 syscalls in the subsequent patches,
    we add two more new versions of the syscalls (for pselect, ppoll
    and io_pgetevents) in addition to the existing native and compat
    versions. Adding such an api reduces the logic that would need to
    be replicated otherwise.

    Signed-off-by: Deepa Dinamani
    Signed-off-by: Arnd Bergmann

    Deepa Dinamani
     
  • Refactor reading sigset from userspace and updating sigmask
    into an api.

    This is useful for versions of syscalls that pass in the
    sigmask and expect the current->sigmask to be changed during,
    and restored after, the execution of the syscall.

    With the advent of new y2038 syscalls in the subsequent patches,
    we add two more new versions of the syscalls (for pselect, ppoll,
    and io_pgetevents) in addition to the existing native and compat
    versions. Adding such an api reduces the logic that would need to
    be replicated otherwise.

    Note that the calls to sigprocmask() ignored the return value
    from the api as the function only returns an error on an invalid
    first argument that is hardcoded at these call sites.
    The updated logic uses set_current_blocked() instead.

    Signed-off-by: Deepa Dinamani
    Signed-off-by: Arnd Bergmann

    Deepa Dinamani
     

05 Nov, 2018

5 commits

  • Linus Torvalds
     
  • Pull UBIFS updates from Richard Weinberger:

    - Full filesystem authentication feature, UBIFS is now able to have the
    whole filesystem structure authenticated plus user data encrypted and
    authenticated.

    - Minor cleanups

    * tag 'tags/upstream-4.20-rc1' of git://git.infradead.org/linux-ubifs: (26 commits)
    ubifs: Remove unneeded semicolon
    Documentation: ubifs: Add authentication whitepaper
    ubifs: Enable authentication support
    ubifs: Do not update inode size in-place in authenticated mode
    ubifs: Add hashes and HMACs to default filesystem
    ubifs: authentication: Authenticate super block node
    ubifs: Create hash for default LPT
    ubfis: authentication: Authenticate master node
    ubifs: authentication: Authenticate LPT
    ubifs: Authenticate replayed journal
    ubifs: Add auth nodes to garbage collector journal head
    ubifs: Add authentication nodes to journal
    ubifs: authentication: Add hashes to index nodes
    ubifs: Add hashes to the tree node cache
    ubifs: Create functions to embed a HMAC in a node
    ubifs: Add helper functions for authentication support
    ubifs: Add separate functions to init/crc a node
    ubifs: Format changes for authentication support
    ubifs: Store read superblock node
    ubifs: Drop write_node
    ...

    Linus Torvalds
     
  • Pull NFS client bugfixes from Trond Myklebust:
    "Highlights include:

    Bugfix:
    - Fix build issues on architectures that don't provide 64-bit cmpxchg

    Cleanups:
    - Fix a spelling mistake"

    * tag 'nfs-for-4.20-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
    NFS: fix spelling mistake, EACCESS -> EACCES
    SUNRPC: Use atomic(64)_t for seq_send(64)

    Linus Torvalds
     
  • Pull more timer updates from Thomas Gleixner:
    "A set of commits for the new C-SKY architecture timers"

    * 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    dt-bindings: timer: gx6605s SOC timer
    clocksource/drivers/c-sky: Add gx6605s SOC system timer
    dt-bindings: timer: C-SKY Multi-processor timer
    clocksource/drivers/c-sky: Add C-SKY SMP timer

    Linus Torvalds
     
  • Pull NTB updates from Jon Mason:
    "Fairly minor changes and bug fixes:

    NTB IDT thermal changes and hook into hwmon, ntb_netdev clean-up of
    private struct, and a few bug fixes"

    * tag 'ntb-4.20' of git://github.com/jonmason/ntb:
    ntb: idt: Alter the driver info comments
    ntb: idt: Discard temperature sensor IRQ handler
    ntb: idt: Add basic hwmon sysfs interface
    ntb: idt: Alter temperature read method
    ntb_netdev: Simplify remove with client device drvdata
    NTB: transport: Try harder to alloc an aligned MW buffer
    ntb: ntb_transport: Mark expected switch fall-throughs
    ntb: idt: Set PCIe bus address to BARLIMITx
    NTB: ntb_hw_idt: replace IS_ERR_OR_NULL with regular NULL checks
    ntb: intel: fix return value for ndev_vec_mask()
    ntb_netdev: fix sleep time mismatch

    Linus Torvalds
     

04 Nov, 2018

27 commits

  • Pull scheduler fixes from Ingo Molnar:
    "A memory (under-)allocation fix and a comment fix"

    * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    sched/topology: Fix off by one bug
    sched/rt: Update comment in pick_next_task_rt()

    Linus Torvalds
     
  • Pull x86 fixes from Ingo Molnar:
    "A number of fixes and some late updates:

    - make in_compat_syscall() behavior on x86-32 similar to other
    platforms, this touches a number of generic files but is not
    intended to impact non-x86 platforms.

    - objtool fixes

    - PAT preemption fix

    - paravirt fixes/cleanups

    - cpufeatures updates for new instructions

    - earlyprintk quirk

    - make microcode version in sysfs world-readable (it is already
    world-readable in procfs)

    - minor cleanups and fixes"

    * 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    compat: Cleanup in_compat_syscall() callers
    x86/compat: Adjust in_compat_syscall() to generic code under !COMPAT
    objtool: Support GCC 9 cold subfunction naming scheme
    x86/numa_emulation: Fix uniform-split numa emulation
    x86/paravirt: Remove unused _paravirt_ident_32
    x86/mm/pat: Disable preemption around __flush_tlb_all()
    x86/paravirt: Remove GPL from pv_ops export
    x86/traps: Use format string with panic() call
    x86: Clean up 'sizeof x' => 'sizeof(x)'
    x86/cpufeatures: Enumerate MOVDIR64B instruction
    x86/cpufeatures: Enumerate MOVDIRI instruction
    x86/earlyprintk: Add a force option for pciserial device
    objtool: Support per-function rodata sections
    x86/microcode: Make revision and processor flags world-readable

    Linus Torvalds
     
  • Pull perf updates and fixes from Ingo Molnar:
    "These are almost all tooling updates: 'perf top', 'perf trace' and
    'perf script' fixes and updates, an UAPI header sync with the merge
    window versions, license marker updates, much improved Sparc support
    from David Miller, and a number of fixes"

    * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (66 commits)
    perf intel-pt/bts: Calculate cpumode for synthesized samples
    perf intel-pt: Insert callchain context into synthesized callchains
    perf tools: Don't clone maps from parent when synthesizing forks
    perf top: Start display thread earlier
    tools headers uapi: Update linux/if_link.h header copy
    tools headers uapi: Update linux/netlink.h header copy
    tools headers: Sync the various kvm.h header copies
    tools include uapi: Update linux/mmap.h copy
    perf trace beauty: Use the mmap flags table generated from headers
    perf beauty: Wire up the mmap flags table generator to the Makefile
    perf beauty: Add a generator for MAP_ mmap's flag constants
    tools include uapi: Update asound.h copy
    tools arch uapi: Update asm-generic/unistd.h and arm64 unistd.h copies
    tools include uapi: Update linux/fs.h copy
    perf callchain: Honour the ordering of PERF_CONTEXT_{USER,KERNEL,etc}
    perf cs-etm: Correct CPU mode for samples
    perf unwind: Take pgoff into account when reporting elf to libdwfl
    perf top: Do not use overwrite mode by default
    perf top: Allow disabling the overwrite mode
    perf trace: Beautify mount's first pathname arg
    ...

    Linus Torvalds
     
  • Pull irq fixes from Ingo Molnar:
    "An irqchip driver fix and a memory (over-)allocation fix"

    * 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    irqchip/irq-mvebu-sei: Fix a NULL vs IS_ERR() bug in probe function
    irq/matrix: Fix memory overallocation

    Linus Torvalds
     
  • With the addition of the NUMA identity level, we increased @level by
    one and will run off the end of the array in the distance sort loop.

    Fixed: 051f3ca02e46 ("sched/topology: Introduce NUMA identity node sched domain")
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Signed-off-by: Ingo Molnar

    Ingo Molnar
     
  • Pull ARM SoC fixes from Olof Johansson:
    "A few fixes who have come in near or during the merge window:

    - Removal of a VLA usage in Marvell mpp platform code

    - Enable some IPMI options for ARM64 servers by default, helps
    testing

    - Enable PREEMPT on 32-bit ARMv7 defconfig

    - Minor fix for stm32 DT (removal of an unused DMA property)

    - Bugfix for TI OMAP1-based ams-delta (-EINVAL -> IRQ_NOTCONNECTED)"

    * tag 'armsoc-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc:
    ARM: dts: stm32: update HASH1 dmas property on stm32mp157c
    ARM: orion: avoid VLA in orion_mpp_conf
    ARM: defconfig: Update multi_v7 to use PREEMPT
    arm64: defconfig: Enable some IPMI configs
    soc: ti: QMSS: Fix usage of irq_set_affinity_hint
    ARM: OMAP1: ams-delta: Fix impossible .irq < 0

    Linus Torvalds
     
  • Pull more arm64 updates from Catalin Marinas:

    - fix W+X page (mark RO) allocated by the arm64 kprobes code

    - Makefile fix for .i files in out of tree modules

    * tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
    arm64: kprobe: make page to RO mode when allocate it
    arm64: kdump: fix small typo
    arm64: makefile fix build of .i file in external module case

    Linus Torvalds
     
  • Pull dma-mapping fix from Christoph Hellwig:
    "Avoid compile warnings on non-default arm64 configs"

    * tag 'dma-mapping-4.20-2' of git://git.infradead.org/users/hch/dma-mapping:
    arm64: fix warnings without CONFIG_IOMMU_DMA

    Linus Torvalds
     
  • Pull Kbuild updates from Masahiro Yamada:

    - clean-up leftovers in Kconfig files

    - remove stale oldnoconfig and silentoldconfig targets

    - remove unneeded cc-fullversion and cc-name variables

    - improve merge_config script to allow overriding option prefix

    * tag 'kbuild-v4.20-2' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild:
    kbuild: remove cc-name variable
    kbuild: replace cc-name test with CONFIG_CC_IS_CLANG
    merge_config.sh: Allow to define config prefix
    kbuild: remove unused cc-fullversion variable
    kconfig: remove silentoldconfig target
    kconfig: remove oldnoconfig target
    powerpc: PCI_MSI needs PCI
    powerpc: remove CONFIG_MCA leftovers
    powerpc: remove CONFIG_PCI_QSPAN
    scsi: aha152x: rename the PCMCIA define

    Linus Torvalds
     
  • Pull cifs fixes and updates from Steve French:
    "Three small fixes (one Kerberos related, one for stable, and another
    fixes an oops in xfstest 377), two helpful debugging improvements,
    three patches for cifs directio and some minor cleanup"

    * tag '4.20-rc1-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6:
    cifs: fix signed/unsigned mismatch on aio_read patch
    cifs: don't dereference smb_file_target before null check
    CIFS: Add direct I/O functions to file_operations
    CIFS: Add support for direct I/O write
    CIFS: Add support for direct I/O read
    smb3: missing defines and structs for reparse point handling
    smb3: allow more detailed protocol info on open files for debugging
    smb3: on kerberos mount if server doesn't specify auth type use krb5
    smb3: add trace point for tree connection
    cifs: fix spelling mistake, EACCESS -> EACCES
    cifs: fix return value for cifs_listxattr

    Linus Torvalds
     
  • Pull 9p fix from Al Viro:
    "Regression fix for net/9p handling of iov_iter; broken by braino when
    switching to iov_iter_is_kvec() et.al., spotted and fixed by Marc"

    * 'work.afs' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    iov_iter: Fix 9p virtio breakage

    Linus Torvalds
     
  • Pull more SCSI updates from James Bottomley:
    "This is a set of minor small (and safe changes) that didn't make the
    initial pull request plus some bug fixes"

    * tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
    scsi: mvsas: Remove set but not used variable 'id'
    scsi: qla2xxx: Remove two arguments from qlafx00_error_entry()
    scsi: qla2xxx: Make sure that qlafx00_ioctl_iosb_entry() initializes 'res'
    scsi: qla2xxx: Remove a set-but-not-used variable
    scsi: qla2xxx: Make qla2x00_sysfs_write_nvram() easier to analyze
    scsi: qla2xxx: Declare local functions 'static'
    scsi: qla2xxx: Improve several kernel-doc headers
    scsi: qla2xxx: Modify fall-through annotations
    scsi: 3w-sas: 3w-9xxx: Use unsigned char for cdb
    scsi: mvsas: Use dma_pool_zalloc
    scsi: target: Don't request modules that aren't even built
    scsi: target: Set response length for REPORT TARGET PORT GROUPS

    Linus Torvalds
     
  • Merge more updates from Andrew Morton:

    - more ocfs2 work

    - various leftovers

    * emailed patches from Andrew Morton :
    memory_hotplug: cond_resched in __remove_pages
    bfs: add sanity check at bfs_fill_super()
    kernel/sysctl.c: remove duplicated include
    kernel/kexec_file.c: remove some duplicated includes
    mm, thp: consolidate THP gfp handling into alloc_hugepage_direct_gfpmask
    ocfs2: fix clusters leak in ocfs2_defrag_extent()
    ocfs2: dlmglue: clean up timestamp handling
    ocfs2: don't put and assigning null to bh allocated outside
    ocfs2: fix a misuse a of brelse after failing ocfs2_check_dir_entry
    ocfs2: don't use iocb when EIOCBQUEUED returns
    ocfs2: without quota support, avoid calling quota recovery
    ocfs2: remove ocfs2_is_o2cb_active()
    mm: thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings
    include/linux/notifier.h: SRCU: fix ctags
    mm: handle no memcg case in memcg_kmem_charge() properly

    Linus Torvalds
     
  • We have received a bug report that unbinding a large pmem (>1TB) can
    result in a soft lockup:

    NMI watchdog: BUG: soft lockup - CPU#9 stuck for 23s! [ndctl:4365]
    [...]
    Supported: Yes
    CPU: 9 PID: 4365 Comm: ndctl Not tainted 4.12.14-94.40-default #1 SLE12-SP4
    Hardware name: Intel Corporation S2600WFD/S2600WFD, BIOS SE5C620.86B.01.00.0833.051120182255 05/11/2018
    task: ffff9cce7d4410c0 task.stack: ffffbe9eb1bc4000
    RIP: 0010:__put_page+0x62/0x80
    Call Trace:
    devm_memremap_pages_release+0x152/0x260
    release_nodes+0x18d/0x1d0
    device_release_driver_internal+0x160/0x210
    unbind_store+0xb3/0xe0
    kernfs_fop_write+0x102/0x180
    __vfs_write+0x26/0x150
    vfs_write+0xad/0x1a0
    SyS_write+0x42/0x90
    do_syscall_64+0x74/0x150
    entry_SYSCALL_64_after_hwframe+0x3d/0xa2
    RIP: 0033:0x7fd13166b3d0

    It has been reported on an older (4.12) kernel but the current upstream
    code doesn't cond_resched in the hot remove code at all and the given
    range to remove might be really large. Fix the issue by calling
    cond_resched once per memory section.

    Link: http://lkml.kernel.org/r/20181031125840.23982-1-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Acked-by: Johannes Thumshirn
    Cc: Dan Williams
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • syzbot is reporting too large memory allocation at bfs_fill_super() [1].
    Since file system image is corrupted such that bfs_sb->s_start == 0,
    bfs_fill_super() is trying to allocate 8MB of continuous memory. Fix
    this by adding a sanity check on bfs_sb->s_start, __GFP_NOWARN and
    printf().

    [1] https://syzkaller.appspot.com/bug?id=16a87c236b951351374a84c8a32f40edbc034e96

    Link: http://lkml.kernel.org/r/1525862104-3407-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp
    Signed-off-by: Tetsuo Handa
    Reported-by: syzbot
    Reviewed-by: Andrew Morton
    Cc: Tigran Aivazian
    Cc: Matthew Wilcox
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tetsuo Handa
     
  • Remove one include of .
    No functional changes.

    Link: http://lkml.kernel.org/r/20181004134223.17735-1-michael@schupikov.de
    Signed-off-by: Michael Schupikov
    Reviewed-by: Richard Weinberger
    Acked-by: Luis Chamberlain
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michael Schupikov
     
  • We include kexec.h and slab.h twice in kexec_file.c. It's unnecessary.
    hence just remove them.

    Link: http://lkml.kernel.org/r/1537498098-19171-1-git-send-email-zhongjiang@huawei.com
    Signed-off-by: zhong jiang
    Reviewed-by: Bhupesh Sharma
    Reviewed-by: Andrew Morton
    Acked-by: Baoquan He
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    zhong jiang
     
  • THP allocation mode is quite complex and it depends on the defrag mode.
    This complexity is hidden in alloc_hugepage_direct_gfpmask from a large
    part currently. The NUMA special casing (namely __GFP_THISNODE) is
    however independent and placed in alloc_pages_vma currently. This both
    adds an unnecessary branch to all vma based page allocation requests and
    it makes the code more complex unnecessarily as well. Not to mention
    that e.g. shmem THP used to do the node reclaiming unconditionally
    regardless of the defrag mode until recently. This was not only
    unexpected behavior but it was also hardly a good default behavior and I
    strongly suspect it was just a side effect of the code sharing more than
    a deliberate decision which suggests that such a layering is wrong.

    Get rid of the thp special casing from alloc_pages_vma and move the
    logic to alloc_hugepage_direct_gfpmask. __GFP_THISNODE is applied to the
    resulting gfp mask only when the direct reclaim is not requested and
    when there is no explicit numa binding to preserve the current logic.

    Please note that there's also a slight difference wrt MPOL_BIND now. The
    previous code would avoid using __GFP_THISNODE if the local node was
    outside of policy_nodemask(). After this patch __GFP_THISNODE is avoided
    for all MPOL_BIND policies. So there's a difference that if local node
    is actually allowed by the bind policy's nodemask, previously
    __GFP_THISNODE would be added, but now it won't be. From the behavior
    POV this is still correct because the policy nodemask is used.

    Link: http://lkml.kernel.org/r/20180925120326.24392-3-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Cc: Alex Williamson
    Cc: Andrea Arcangeli
    Cc: David Rientjes
    Cc: "Kirill A. Shutemov"
    Cc: Mel Gorman
    Cc: Stefan Priebe - Profihost AG
    Cc: Zi Yan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • ocfs2_defrag_extent() might leak allocated clusters. When the file
    system has insufficient space, the number of claimed clusters might be
    less than the caller wants. If that happens, the original code might
    directly commit the transaction without returning clusters.

    This patch is based on code in ocfs2_add_clusters_in_btree().

    [akpm@linux-foundation.org: include localalloc.h, reduce scope of data_ac]
    Link: http://lkml.kernel.org/r/20180904041621.16874-3-lchen@suse.com
    Signed-off-by: Larry Chen
    Reviewed-by: Andrew Morton
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Junxiao Bi
    Cc: Joseph Qi
    Cc: Changwei Ge
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Larry Chen
     
  • The handling of timestamps outside of the 1970..2038 range in the dlm
    glue is rather inconsistent: on 32-bit architectures, this has always
    wrapped around to negative timestamps in the 1902..1969 range, while on
    64-bit kernels all timestamps are interpreted as positive 34 bit numbers
    in the 1970..2514 year range.

    Now that the VFS code handles 64-bit timestamps on all architectures, we
    can make the behavior more consistent here, and return the same result
    that we had on 64-bit already, making the file system y2038 safe in the
    process. Outside of dlmglue, it already uses 64-bit on-disk timestamps
    anway, so that part is fine.

    For consistency, I'm changing ocfs2_pack_timespec() to clamp anything
    outside of the supported range to the minimum and maximum values. This
    avoids a possible ambiguity of values before 1970 in particular, which
    used to be interpreted as times at the end of the 2514 range previously.

    Link: http://lkml.kernel.org/r/20180619155826.4106487-1-arnd@arndb.de
    Signed-off-by: Arnd Bergmann
    Reviewed-by: Andrew Morton
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Junxiao Bi
    Cc: Joseph Qi
    Cc: Changwei Ge
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Arnd Bergmann
     
  • ocfs2_read_blocks() and ocfs2_read_blocks_sync() are both used to read
    several blocks from disk. Currently, the input argument *bhs* can be
    NULL or NOT. It depends on the caller's behavior. If the function
    fails in reading blocks from disk, the corresponding bh will be assigned
    to NULL and put.

    Obviously, above process for non-NULL input bh is not appropriate.
    Because the caller doesn't even know its bhs are put and re-assigned.

    If buffer head is managed by caller, ocfs2_read_blocks and
    ocfs2_read_blocks_sync() should not evaluate it to NULL. It will cause
    caller accessing illegal memory, thus crash.

    Link: http://lkml.kernel.org/r/HK2PR06MB045285E0F4FBB561F9F2F9B3D5680@HK2PR06MB0452.apcprd06.prod.outlook.com
    Signed-off-by: Changwei Ge
    Reviewed-by: Guozhonghua
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Junxiao Bi
    Cc: Joseph Qi
    Cc: Changwei Ge
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Changwei Ge
     
  • Somehow, file system metadata was corrupted, which causes
    ocfs2_check_dir_entry() to fail in function ocfs2_dir_foreach_blk_el().

    According to the original design intention, if above happens we should
    skip the problematic block and continue to retrieve dir entry. But
    there is obviouse misuse of brelse around related code.

    After failure of ocfs2_check_dir_entry(), current code just moves to
    next position and uses the problematic buffer head again and again
    during which the problematic buffer head is released for multiple times.
    I suppose, this a serious issue which is long-lived in ocfs2. This may
    cause other file systems which is also used in a the same host insane.

    So we should also consider about bakcporting this patch into linux
    -stable.

    Link: http://lkml.kernel.org/r/HK2PR06MB045211675B43EED794E597B6D56E0@HK2PR06MB0452.apcprd06.prod.outlook.com
    Signed-off-by: Changwei Ge
    Suggested-by: Changkuo Shi
    Reviewed-by: Andrew Morton
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Junxiao Bi
    Cc: Joseph Qi
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Changwei Ge
     
  • When -EIOCBQUEUED returns, it means that aio_complete() will be called
    from dio_complete(), which is an asynchronous progress against
    write_iter. Generally, IO is a very slow progress than executing
    instruction, but we still can't take the risk to access a freed iocb.

    And we do face a BUG crash issue. Using the crash tool, iocb is
    obviously freed already.

    crash> struct -x kiocb ffff881a350f5900
    struct kiocb {
    ki_filp = 0xffff881a350f5a80,
    ki_pos = 0x0,
    ki_complete = 0x0,
    private = 0x0,
    ki_flags = 0x0
    }

    And the backtrace shows:
    ocfs2_file_write_iter+0xcaa/0xd00 [ocfs2]
    aio_run_iocb+0x229/0x2f0
    do_io_submit+0x291/0x540
    SyS_io_submit+0x10/0x20
    system_call_fastpath+0x16/0x75

    Link: http://lkml.kernel.org/r/1523361653-14439-1-git-send-email-ge.changwei@h3c.com
    Signed-off-by: Changwei Ge
    Reviewed-by: Andrew Morton
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Junxiao Bi
    Cc: Joseph Qi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Changwei Ge
     
  • During one dead node's recovery by other node, quota recovery work will
    be queued. We should avoid calling quota when it is not supported, so
    check the quota flags.

    Link: http://lkml.kernel.org/r/71604351584F6A4EBAE558C676F37CA401071AC9FB@H3CMLB12-EX.srv.huawei-3com.com
    Signed-off-by: guozhonghua
    Reviewed-by: Jan Kara
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Junxiao Bi
    Cc: Joseph Qi
    Cc: Changwei Ge
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Guozhonghua
     
  • Remove ocfs2_is_o2cb_active(). We have similar functions to identify
    which cluster stack is being used via osb->osb_cluster_stack.

    Secondly, the current implementation of ocfs2_is_o2cb_active() is not
    totally safe. Based on the design of stackglue, we need to get
    ocfs2_stack_lock before using ocfs2_stack related data structures, and
    that active_stack pointer can be NULL in the case of mount failure.

    Link: http://lkml.kernel.org/r/1495441079-11708-1-git-send-email-ghe@suse.com
    Signed-off-by: Gang He
    Reviewed-by: Joseph Qi
    Reviewed-by: Eric Ren
    Acked-by: Changwei Ge
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Junxiao Bi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Gang He
     
  • THP allocation might be really disruptive when allocated on NUMA system
    with the local node full or hard to reclaim. Stefan has posted an
    allocation stall report on 4.12 based SLES kernel which suggests the
    same issue:

    kvm: page allocation stalls for 194572ms, order:9, mode:0x4740ca(__GFP_HIGHMEM|__GFP_IO|__GFP_FS|__GFP_COMP|__GFP_NOMEMALLOC|__GFP_HARDWALL|__GFP_THISNODE|__GFP_MOVABLE|__GFP_DIRECT_RECLAIM), nodemask=(null)
    kvm cpuset=/ mems_allowed=0-1
    CPU: 10 PID: 84752 Comm: kvm Tainted: G W 4.12.0+98-ph 0000001 SLE15 (unreleased)
    Hardware name: Supermicro SYS-1029P-WTRT/X11DDW-NT, BIOS 2.0 12/05/2017
    Call Trace:
    dump_stack+0x5c/0x84
    warn_alloc+0xe0/0x180
    __alloc_pages_slowpath+0x820/0xc90
    __alloc_pages_nodemask+0x1cc/0x210
    alloc_pages_vma+0x1e5/0x280
    do_huge_pmd_wp_page+0x83f/0xf00
    __handle_mm_fault+0x93d/0x1060
    handle_mm_fault+0xc6/0x1b0
    __do_page_fault+0x230/0x430
    do_page_fault+0x2a/0x70
    page_fault+0x7b/0x80
    [...]
    Mem-Info:
    active_anon:126315487 inactive_anon:1612476 isolated_anon:5
    active_file:60183 inactive_file:245285 isolated_file:0
    unevictable:15657 dirty:286 writeback:1 unstable:0
    slab_reclaimable:75543 slab_unreclaimable:2509111
    mapped:81814 shmem:31764 pagetables:370616 bounce:0
    free:32294031 free_pcp:6233 free_cma:0
    Node 0 active_anon:254680388kB inactive_anon:1112760kB active_file:240648kB inactive_file:981168kB unevictable:13368kB isolated(anon):0kB isolated(file):0kB mapped:280240kB dirty:1144kB writeback:0kB shmem:95832kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 81225728kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
    Node 1 active_anon:250583072kB inactive_anon:5337144kB active_file:84kB inactive_file:0kB unevictable:49260kB isolated(anon):20kB isolated(file):0kB mapped:47016kB dirty:0kB writeback:4kB shmem:31224kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 31897600kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no

    The defrag mode is "madvise" and from the above report it is clear that
    the THP has been allocated for MADV_HUGEPAGA vma.

    Andrea has identified that the main source of the problem is
    __GFP_THISNODE usage:

    : The problem is that direct compaction combined with the NUMA
    : __GFP_THISNODE logic in mempolicy.c is telling reclaim to swap very
    : hard the local node, instead of failing the allocation if there's no
    : THP available in the local node.
    :
    : Such logic was ok until __GFP_THISNODE was added to the THP allocation
    : path even with MPOL_DEFAULT.
    :
    : The idea behind the __GFP_THISNODE addition, is that it is better to
    : provide local memory in PAGE_SIZE units than to use remote NUMA THP
    : backed memory. That largely depends on the remote latency though, on
    : threadrippers for example the overhead is relatively low in my
    : experience.
    :
    : The combination of __GFP_THISNODE and __GFP_DIRECT_RECLAIM results in
    : extremely slow qemu startup with vfio, if the VM is larger than the
    : size of one host NUMA node. This is because it will try very hard to
    : unsuccessfully swapout get_user_pages pinned pages as result of the
    : __GFP_THISNODE being set, instead of falling back to PAGE_SIZE
    : allocations and instead of trying to allocate THP on other nodes (it
    : would be even worse without vfio type1 GUP pins of course, except it'd
    : be swapping heavily instead).

    Fix this by removing __GFP_THISNODE for THP requests which are
    requesting the direct reclaim. This effectivelly reverts 5265047ac301
    on the grounds that the zone/node reclaim was known to be disruptive due
    to premature reclaim when there was memory free. While it made sense at
    the time for HPC workloads without NUMA awareness on rare machines, it
    was ultimately harmful in the majority of cases. The existing behaviour
    is similar, if not as widespare as it applies to a corner case but
    crucially, it cannot be tuned around like zone_reclaim_mode can. The
    default behaviour should always be to cause the least harm for the
    common case.

    If there are specialised use cases out there that want zone_reclaim_mode
    in specific cases, then it can be built on top. Longterm we should
    consider a memory policy which allows for the node reclaim like behavior
    for the specific memory ranges which would allow a

    [1] http://lkml.kernel.org/r/20180820032204.9591-1-aarcange@redhat.com

    Mel said:

    : Both patches look correct to me but I'm responding to this one because
    : it's the fix. The change makes sense and moves further away from the
    : severe stalling behaviour we used to see with both THP and zone reclaim
    : mode.
    :
    : I put together a basic experiment with usemem configured to reference a
    : buffer multiple times that is 80% the size of main memory on a 2-socket
    : box with symmetric node sizes and defrag set to "always". The defrag
    : setting is not the default but it would be functionally similar to
    : accessing a buffer with madvise(MADV_HUGEPAGE). Usemem is configured to
    : reference the buffer multiple times and while it's not an interesting
    : workload, it would be expected to complete reasonably quickly as it fits
    : within memory. The results were;
    :
    : usemem
    : vanilla noreclaim-v1
    : Amean Elapsd-1 42.78 ( 0.00%) 26.87 ( 37.18%)
    : Amean Elapsd-3 27.55 ( 0.00%) 7.44 ( 73.00%)
    : Amean Elapsd-4 5.72 ( 0.00%) 5.69 ( 0.45%)
    :
    : This shows the elapsed time in seconds for 1 thread, 3 threads and 4
    : threads referencing buffers 80% the size of memory. With the patches
    : applied, it's 37.18% faster for the single thread and 73% faster with two
    : threads. Note that 4 threads showing little difference does not indicate
    : the problem is related to thread counts. It's simply the case that 4
    : threads gets spread so their workload mostly fits in one node.
    :
    : The overall view from /proc/vmstats is more startling
    :
    : 4.19.0-rc1 4.19.0-rc1
    : vanillanoreclaim-v1r1
    : Minor Faults 35593425 708164
    : Major Faults 484088 36
    : Swap Ins 3772837 0
    : Swap Outs 3932295 0
    :
    : Massive amounts of swap in/out without the patch
    :
    : Direct pages scanned 6013214 0
    : Kswapd pages scanned 0 0
    : Kswapd pages reclaimed 0 0
    : Direct pages reclaimed 4033009 0
    :
    : Lots of reclaim activity without the patch
    :
    : Kswapd efficiency 100% 100%
    : Kswapd velocity 0.000 0.000
    : Direct efficiency 67% 100%
    : Direct velocity 11191.956 0.000
    :
    : Mostly from direct reclaim context as you'd expect without the patch.
    :
    : Page writes by reclaim 3932314.000 0.000
    : Page writes file 19 0
    : Page writes anon 3932295 0
    : Page reclaim immediate 42336 0
    :
    : Writes from reclaim context is never good but the patch eliminates it.
    :
    : We should never have default behaviour to thrash the system for such a
    : basic workload. If zone reclaim mode behaviour is ever desired but on a
    : single task instead of a global basis then the sensible option is to build
    : a mempolicy that enforces that behaviour.

    This was a severe regression compared to previous kernels that made
    important workloads unusable and it starts when __GFP_THISNODE was
    added to THP allocations under MADV_HUGEPAGE. It is not a significant
    risk to go to the previous behavior before __GFP_THISNODE was added, it
    worked like that for years.

    This was simply an optimization to some lucky workloads that can fit in
    a single node, but it ended up breaking the VM for others that can't
    possibly fit in a single node, so going back is safe.

    [mhocko@suse.com: rewrote the changelog based on the one from Andrea]
    Link: http://lkml.kernel.org/r/20180925120326.24392-2-mhocko@kernel.org
    Fixes: 5265047ac301 ("mm, thp: really limit transparent hugepage allocation to local node")
    Signed-off-by: Andrea Arcangeli
    Signed-off-by: Michal Hocko
    Reported-by: Stefan Priebe
    Debugged-by: Andrea Arcangeli
    Reported-by: Alex Williamson
    Reviewed-by: Mel Gorman
    Tested-by: Mel Gorman
    Cc: Zi Yan
    Cc: Vlastimil Babka
    Cc: David Rientjes
    Cc: "Kirill A. Shutemov"
    Cc: [4.1+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli