29 May, 2016

1 commit

  • Pull more rdma updates from Doug Ledford:
    "This is the second group of code for the 4.7 merge window. It looks
    large, but only in one sense. I'll get to that in a minute. The list
    of changes here breaks down as follows:

    - Dynamic counter infrastructure in the IB drivers

    This is a sysfs based code to allow free form access to the
    hardware counters RDMA devices might support so drivers don't need
    to code this up repeatedly themselves

    - SendOnlyFullMember multicast support

    - IB router support

    - A couple misc fixes

    - The big item on the list: hfi1 driver updates, plus moving the hfi1
    driver out of staging

    There was a group of 15 patches in the hfi1 list that I thought I had
    in the first pull request but they weren't. So that added to the
    length of the hfi1 section here.

    As far as these go, everything but the hfi1 is pretty straight
    forward.

    The hfi1 is, if you recall, the driver that Al had complaints about
    how it used the write/writev interfaces in an overloaded fashion. The
    write portion of their interface behaved like the write handler in the
    IB stack proper and did bi-directional communications. The writev
    interface, on the other hand, only accepts SDMA request structures.
    The completions for those structures are sent back via an entirely
    different event mechanism.

    With the security patch, we put security checks on the write
    interface, however, we also knew they would be going away soon. Now,
    we've converted the write handler in the hfi1 driver to use ioctls
    from the IB reserved magic area for its bidirectional communications.
    With that change, Intel has addressed all of the items originally on
    their TODO when they went into staging (as well as many items added to
    the list later).

    As such, I moved them out, and since they were the last item in the
    staging/rdma directory, and I don't have immediate plans to use the
    staging area again, I removed the staging/rdma area.

    Because of the move out of staging, as well as a series of 5 patches
    in the hfi1 driver that removed code people thought should be done in
    a different way and was optional to begin with (a snoop debug
    interface, an eeprom driver for an eeprom connected directory to their
    hfi1 chip and not via an i2c bus, and a few other things like that),
    the line count, especially the removal count, is high"

    * tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma: (56 commits)
    staging/rdma: Remove the entire rdma subdirectory of staging
    IB/core: Make device counter infrastructure dynamic
    IB/hfi1: Fix pio map initialization
    IB/hfi1: Correct 8051 link parameter settings
    IB/hfi1: Update pkey table properly after link down or FM start
    IB/rdamvt: Fix rdmavt s_ack_queue sizing
    IB/rdmavt: Max atomic value should be a u8
    IB/hfi1: Fix hard lockup due to not using save/restore spin lock
    IB/hfi1: Add tracing support for send with invalidate opcode
    IB/hfi1, qib: Add ieth to the packet header definitions
    IB/hfi1: Move driver out of staging
    IB/hfi1: Do not free hfi1 cdev parent structure early
    IB/hfi1: Add trace message in user IOCTL handling
    IB/hfi1: Remove write(), use ioctl() for user cmds
    IB/hfi1: Add ioctl() interface for user commands
    IB/hfi1: Remove unused user command
    IB/hfi1: Remove snoop/diag interface
    IB/hfi1: Remove EPROM functionality from data device
    IB/hfi1: Remove UI char device
    IB/hfi1: Remove multiple device cdev
    ...

    Linus Torvalds
     

28 May, 2016

39 commits

  • Pull more input subsystem updates from Dmitry Torokhov:
    "Just a few more driver fixes; new drivers will be coming in the next
    merge window"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input:
    Input: pwm-beeper - fix - scheduling while atomic
    Input: xpad - xbox one elite controller support
    Input: xpad - add more third-party controllers
    Input: xpad - prevent spurious input from wired Xbox 360 controllers
    Input: xpad - move pending clear to the correct location
    Input: uinput - handle compat ioctl for UI_SET_PHYS

    Linus Torvalds
     
  • Pull more i2c updates from Wolfram Sang:
    "Here is the second pull request from I2C for this merge window:

    - one new feature (which nearly fell through the cracks): i2c-dev
    does now use the cdev API so it can handle >256 minors. Seems
    people do need that.

    - two fixes for the just added DMA feature for i2c-rcar

    - some typo fixes"

    * 'i2c/for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux:
    i2c: dev: don't start function name with 'return'
    i2c: dev: switch from register_chrdev to cdev API
    i2c: xlr: rename ARCH_TANGOX to ARCH_TANGO
    i2c: at91: change log when dma configuration fails
    misc: at24: Fix typo in at24 header file
    i2c: rcar: should depend on HAS_DMA
    i2c: rcar: use dma_request_chan()

    Linus Torvalds
     
  • Pull UML updates from Richard Weinberger:
    "This contains a nice FPU fixup from Eli Cooper for UML"

    * 'for-linus-4.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/uml:
    um: add extended processor state save/restore support
    um: extend fpstate to _xstate to support YMM registers
    um: fix FPU state preservation around signal handlers

    Linus Torvalds
     
  • Pull UBI/UBIFS updates from Richard Weinberger:
    "This contains mostly cleanups and minor improvements of UBI and UBIFS"

    * tag 'upstream-4.7-rc1' of git://git.infradead.org/linux-ubifs:
    ubifs: ubifs_dump_inode: Fix dumping field bulk_read
    UBI: Fix static volume checks when Fastmap is used
    UBI: Set free_count to zero before walking through erase list
    UBI: Silence an unintialized variable warning
    UBI: Clean up return in ubi_remove_volume()
    UBI: Modify wrong comment in ubi_leb_map function.
    UBI: Don't read back all data in ubi_eba_copy_leb()
    UBI: Add ro-mode sysfs attribute

    Linus Torvalds
     
  • Older versions of gcc don't understand named initializers inside a
    anonymous structure or union member. It can be worked around by adding
    the bracin gin the initializer for the anonymous member.

    Without this, gcc 4.4.4 will fail the build with

    CC fs/nfs/nfs4state.o
    fs/nfs/nfs4state.c:69: error: unknown field ‘data’ specified in initializer
    fs/nfs/nfs4state.c:69: warning: missing braces around initializer
    fs/nfs/nfs4state.c:69: warning: (near initialization for ‘zero_stateid..data’)
    make[2]: *** [fs/nfs/nfs4state.o] Error 1

    introduced in commit 93b717fd81bf ("NFSv4: Label stateids with the type")

    Reported-and-tested-by: Boris Ostrovsky
    Cc: Anna Schumaker
    Cc: Trond Myklebust
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     
  • Pull vfs fixes from Al Viro:
    "Followups to the parallel lookup work:

    - update docs

    - restore killability of the places that used to take ->i_mutex
    killably now that we have down_write_killable() merged

    - Additionally, it turns out that I missed a prerequisite for
    security_d_instantiate() stuff - ->getxattr() wasn't the only thing
    that could be called before dentry is attached to inode; with smack
    we needed the same treatment applied to ->setxattr() as well"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    switch ->setxattr() to passing dentry and inode separately
    switch xattr_handler->set() to passing dentry and inode separately
    restore killability of old mutex_lock_killable(&inode->i_mutex) users
    add down_write_killable_nested()
    update D/f/directory-locking

    Linus Torvalds
     
  • smack ->d_instantiate() uses ->setxattr(), so to be able to call it before
    we'd hashed the new dentry and attached it to inode, we need ->setxattr()
    instances getting the inode as an explicit argument rather than obtaining
    it from dentry.

    Similar change for ->getxattr() had been done in commit ce23e64. Unlike
    ->getxattr() (which is used by both selinux and smack instances of
    ->d_instantiate()) ->setxattr() is used only by smack one and unfortunately
    it got missed back then.

    Reported-by: Seung-Woo Kim
    Tested-by: Casey Schaufler
    Signed-off-by: Al Viro

    Al Viro
     
  • Pull overlayfs update from Miklos Szeredi:
    "The meat of this is a change to use the mounter's credentials for
    operations that require elevated privileges (such as whiteout
    creation). This fixes behavior under user namespaces as well as being
    a nice cleanup"

    * 'overlayfs-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs:
    ovl: Do d_type check only if work dir creation was successful
    ovl: update documentation
    ovl: override creds with the ones from the superblock mounter

    Linus Torvalds
     
  • Pwm config may sleep so defer it using a worker.

    On a Freescale i.MX53 based board we ran into "BUG: scheduling while
    atomic" because input_inject_event locks interrupts, but
    imx_pwm_config_v2 sleeps.

    Tested on Freescale i.MX53 SoC with 4.6.0.

    Signed-off-by: Manfred Schlaegl
    Cc: stable@vger.kernel.org
    Signed-off-by: Dmitry Torokhov

    Manfred Schlaegl
     
  • Pull btrfs cleanups and fixes from Chris Mason:
    "We have another round of fixes and a few cleanups.

    I have a fix for short returns from btrfs_copy_from_user, which
    finally nails down a very hard to find regression we added in v4.6.

    Dave is pushing around gfp parameters, mostly to cleanup internal apis
    and make it a little more consistent.

    The rest are smaller fixes, and one speelling fixup patch"

    * 'for-linus-4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (22 commits)
    Btrfs: fix handling of faults from btrfs_copy_from_user
    btrfs: fix string and comment grammatical issues and typos
    btrfs: scrub: Set bbio to NULL before calling btrfs_map_block
    Btrfs: fix unexpected return value of fiemap
    Btrfs: free sys_array eb as soon as possible
    btrfs: sink gfp parameter to convert_extent_bit
    btrfs: make state preallocation more speculative in __set_extent_bit
    btrfs: untangle gotos a bit in convert_extent_bit
    btrfs: untangle gotos a bit in __clear_extent_bit
    btrfs: untangle gotos a bit in __set_extent_bit
    btrfs: sink gfp parameter to set_record_extent_bits
    btrfs: sink gfp parameter to set_extent_new
    btrfs: sink gfp parameter to set_extent_defrag
    btrfs: sink gfp parameter to set_extent_delalloc
    btrfs: sink gfp parameter to clear_extent_dirty
    btrfs: sink gfp parameter to clear_record_extent_bits
    btrfs: sink gfp parameter to clear_extent_bits
    btrfs: sink gfp parameter to set_extent_bits
    btrfs: make find_workspace warn if there are no workspaces
    btrfs: make find_workspace always succeed
    ...

    Linus Torvalds
     
  • added the according id and incresed XPAD_PKT_LEN to 64 as the elite
    controller sends at least 33 byte messages [1].
    Verified to be working by [2].

    [1]: https://franticrain.github.io/sniffs/XboxOneSniff.html
    [2]: https://github.com/paroj/xpad/issues/23

    Signed-off-by: Pierre-Loup A. Griffais
    Signed-off-by: Pavel Rojtberg
    Signed-off-by: Dmitry Torokhov

    Pavel Rojtberg
     
  • Signed-off-by: Pierre-Loup A. Griffais
    Signed-off-by: Thomas Debesse
    Signed-off-by: aronschatz
    Signed-off-by: Pavel Rojtberg
    Signed-off-by: Dmitry Torokhov

    Pavel Rojtberg
     
  • After initially connecting a wired Xbox 360 controller or sending it
    a command to change LEDs, a status/response packet is interpreted as
    controller input. This causes the state of buttons represented in
    byte 2 of the controller data packet to be incorrect until the next
    valid input packet. Wireless Xbox 360 controllers are not affected.

    Writing a new value to the LED device while holding the Start button
    and running jstest is sufficient to reproduce this bug. An event will
    come through with the Start button released.

    Xboxdrv also won't attempt to read controller input from a packet
    where byte 0 is non-zero. It also checks that byte 1 is 0x14, but
    that value differs between wired and wireless controllers and this
    code is shared by both. I think just checking byte 0 is enough to
    eliminate unwanted packets.

    The following are some examples of 3-byte status packets I saw:
    01 03 02
    02 03 00
    03 03 03
    08 03 00

    Signed-off-by: Cameron Gutman
    Signed-off-by: Pavel Rojtberg
    Cc: stable@vger.kernel.org
    Signed-off-by: Dmitry Torokhov

    Cameron Gutman
     
  • otherwise we lose ff commands: https://github.com/paroj/xpad/issues/27

    Signed-off-by: Pavel Rojtberg
    Cc: stable@vger.kernel.org
    Signed-off-by: Dmitry Torokhov

    Pavel Rojtberg
     
  • Now that the allmodconfig x86-64 build is clean wrt IS_ERR_VALUE() uses
    on integers, add a cast to a pointer and back to the argument, so that
    any new mis-uses of IS_ERR_VALUE() will cause warnings like

    warning: cast to pointer from integer of different size [-Wint-to-pointer-cast]

    so that we don't re-introduce any bogus uses.

    Signed-off-by: Linus Torvalds

    Linus Torvalds
     
  • The do_brk() and vm_brk() return value was "unsigned long" and returned
    the starting address on success, and an error value on failure. The
    reasons are entirely historical, and go back to it basically behaving
    like the mmap() interface does.

    However, nobody actually wanted that interface, and it causes totally
    pointless IS_ERR_VALUE() confusion.

    What every single caller actually wants is just the simpler integer
    return of zero for success and negative error number on failure.

    So just convert to that much clearer and more common calling convention,
    and get rid of all the IS_ERR_VALUE() uses wrt vm_brk().

    Signed-off-by: Linus Torvalds

    Linus Torvalds
     
  • Most users of IS_ERR_VALUE() in the kernel are wrong, as they
    pass an 'int' into a function that takes an 'unsigned long'
    argument. This happens to work because the type is sign-extended
    on 64-bit architectures before it gets converted into an
    unsigned type.

    However, anything that passes an 'unsigned short' or 'unsigned int'
    argument into IS_ERR_VALUE() is guaranteed to be broken, as are
    8-bit integers and types that are wider than 'unsigned long'.

    Andrzej Hajda has already fixed a lot of the worst abusers that
    were causing actual bugs, but it would be nice to prevent any
    users that are not passing 'unsigned long' arguments.

    This patch changes all users of IS_ERR_VALUE() that I could find
    on 32-bit ARM randconfig builds and x86 allmodconfig. For the
    moment, this doesn't change the definition of IS_ERR_VALUE()
    because there are probably still architecture specific users
    elsewhere.

    Almost all the warnings I got are for files that are better off
    using 'if (err)' or 'if (err < 0)'.
    The only legitimate user I could find that we get a warning for
    is the (32-bit only) freescale fman driver, so I did not remove
    the IS_ERR_VALUE() there but changed the type to 'unsigned long'.
    For 9pfs, I just worked around one user whose calling conventions
    are so obscure that I did not dare change the behavior.

    I was using this definition for testing:

    #define IS_ERR_VALUE(x) ((unsigned long*)NULL == (typeof (x)*)NULL && \
    unlikely((unsigned long long)(x) >= (unsigned long long)(typeof(x))-MAX_ERRNO))

    which ends up making all 16-bit or wider types work correctly with
    the most plausible interpretation of what IS_ERR_VALUE() was supposed
    to return according to its users, but also causes a compile-time
    warning for any users that do not pass an 'unsigned long' argument.

    I suggested this approach earlier this year, but back then we ended
    up deciding to just fix the users that are obviously broken. After
    the initial warning that caused me to get involved in the discussion
    (fs/gfs2/dir.c) showed up again in the mainline kernel, Linus
    asked me to send the whole thing again.

    [ Updated the 9p parts as per Al Viro - Linus ]

    Signed-off-by: Arnd Bergmann
    Cc: Andrzej Hajda
    Cc: Andrew Morton
    Link: https://lkml.org/lkml/2016/1/7/363
    Link: https://lkml.org/lkml/2016/5/27/486
    Acked-by: Srinivas Kandagatla # For nvmem part
    Signed-off-by: Linus Torvalds

    Arnd Bergmann
     
  • The register_page_bootmem_info_node() function needs to be marked __init
    in order to avoid a new warning introduced by commit f65e91df25aa ("mm:
    use early_pfn_to_nid in register_page_bootmem_info_node").

    Otherwise you'll get a warning about how a non-init function calls
    early_pfn_to_nid (which is __meminit)

    Cc: Yang Shi
    Cc: Andrew Morton
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     
  • Merge misc updates and fixes from Andrew Morton:

    - late-breaking ocfs2 updates

    - random bunch of fixes

    * emailed patches from Andrew Morton :
    mm: disable DEFERRED_STRUCT_PAGE_INIT on !NO_BOOTMEM
    mm/memcontrol.c: move comments for get_mctgt_type() to proper position
    mm/memcontrol.c: fix the margin computation in mem_cgroup_margin()
    mm/cma: silence warnings due to max() usage
    mm: thp: avoid false positive VM_BUG_ON_PAGE in page_move_anon_rmap()
    oom_reaper: close race with exiting task
    mm: use early_pfn_to_nid in register_page_bootmem_info_node
    mm: use early_pfn_to_nid in page_ext_init
    MAINTAINERS: Kdump maintainers update
    MAINTAINERS: add kexec_core.c and kexec_file.c
    mm: oom: do not reap task if there are live threads in threadgroup
    direct-io: fix direct write stale data exposure from concurrent buffered read
    ocfs2: bump up o2cb network protocol version
    ocfs2: o2hb: fix hb hung time
    ocfs2: o2hb: don't negotiate if last hb fail
    ocfs2: o2hb: add some user/debug log
    ocfs2: o2hb: add NEGOTIATE_APPROVE message
    ocfs2: o2hb: add NEGO_TIMEOUT message
    ocfs2: o2hb: add negotiate timer

    Linus Torvalds
     
  • When we have !NO_BOOTMEM, the deferred page struct initialization
    doesn't work well because the pages reserved in bootmem are released to
    the page allocator uncoditionally. It causes memory corruption and
    system crash eventually.

    As Mel suggested, the bootmem is retiring slowly. We fix the issue by
    simply hiding DEFERRED_STRUCT_PAGE_INIT when bootmem is enabled.

    Link: http://lkml.kernel.org/r/1460602170-5821-1-git-send-email-gwshan@linux.vnet.ibm.com
    Signed-off-by: Gavin Shan
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Gavin Shan
     
  • Move the comments for get_mctgt_type() to be before get_mctgt_type()
    implementation.

    Link: http://lkml.kernel.org/r/1463644638-7446-1-git-send-email-roy.qing.li@gmail.com
    Signed-off-by: Li RongQing
    Acked-by: Michal Hocko
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Li RongQing
     
  • mem_cgroup_margin() might return (memory.limit - memory_count) when the
    memsw.limit is in excess. This doesn't happen usually because we do not
    allow excess on hard limits and (memory.limit
    Acked-by: Vladimir Davydov
    Acked-by: Michal Hocko
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Li RongQing
     
  • pageblock_order can be (at least) an unsigned int or an unsigned long
    depending on the kernel config and architecture, so use max_t(unsigned
    long, ...) when comparing it.

    fixes these warnings:

    In file included from include/asm-generic/bug.h:13:0,
    from arch/powerpc/include/asm/bug.h:127,
    from include/linux/bug.h:4,
    from include/linux/mmdebug.h:4,
    from include/linux/mm.h:8,
    from include/linux/memblock.h:18,
    from mm/cma.c:28:
    mm/cma.c: In function 'cma_init_reserved_mem':
    include/linux/kernel.h:748:17: warning: comparison of distinct pointer types lacks a cast
    (void) (&_max1 == &_max2); ^
    mm/cma.c:186:27: note: in expansion of macro 'max'
    alignment = PAGE_SIZE << max(MAX_ORDER - 1, pageblock_order);
    ^
    mm/cma.c: In function 'cma_declare_contiguous':
    include/linux/kernel.h:748:17: warning: comparison of distinct pointer types lacks a cast
    (void) (&_max1 == &_max2); ^
    include/linux/kernel.h:747:9: note: in definition of macro 'max'
    typeof(y) _max2 = (y); ^
    mm/cma.c:270:29: note: in expansion of macro 'max'
    (phys_addr_t)PAGE_SIZE << max(MAX_ORDER - 1, pageblock_order));
    ^
    include/linux/kernel.h:748:17: warning: comparison of distinct pointer types lacks a cast
    (void) (&_max1 == &_max2); ^
    include/linux/kernel.h:747:21: note: in definition of macro 'max'
    typeof(y) _max2 = (y); ^
    mm/cma.c:270:29: note: in expansion of macro 'max'
    (phys_addr_t)PAGE_SIZE << max(MAX_ORDER - 1, pageblock_order));
    ^

    [akpm@linux-foundation.org: coding-style fixes]
    Link: http://lkml.kernel.org/r/20160526150748.5be38a4f@canb.auug.org.au
    Signed-off-by: Stephen Rothwell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Stephen Rothwell
     
  • If page_move_anon_rmap() is refiling a pmd-splitted THP mapped in a tail
    page from a pte, the "address" must be THP aligned in order for the
    page->index bugcheck to pass in the CONFIG_DEBUG_VM=y builds.

    Link: http://lkml.kernel.org/r/1464253620-106404-1-git-send-email-kirill.shutemov@linux.intel.com
    Fixes: 6d0a07edd17c ("mm: thp: calculate the mapcount correctly for THP pages during WP faults")
    Signed-off-by: Kirill A. Shutemov
    Reported-by: Mika Westerberg
    Tested-by: Mika Westerberg
    Reviewed-by: Andrea Arcangeli
    Cc: [4.5]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • Tetsuo has reported:
    Out of memory: Kill process 443 (oleg's-test) score 855 or sacrifice child
    Killed process 443 (oleg's-test) total-vm:493248kB, anon-rss:423880kB, file-rss:4kB, shmem-rss:0kB
    sh invoked oom-killer: gfp_mask=0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD), order=0, oom_score_adj=0
    sh cpuset=/ mems_allowed=0
    CPU: 2 PID: 1 Comm: sh Not tainted 4.6.0-rc7+ #51
    Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/31/2013
    Call Trace:
    dump_stack+0x85/0xc8
    dump_header+0x5b/0x394
    oom_reaper: reaped process 443 (oleg's-test), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

    In other words:

    __oom_reap_task exit_mm
    atomic_inc_not_zero
    tsk->mm = NULL
    mmput
    atomic_dec_and_test # > 0
    exit_oom_victim # New victim will be
    # selected

    # no TIF_MEMDIE task so we can select a new one
    unmap_page_range # to release the memory

    The race exists even without the oom_reaper because anybody who pins the
    address space and gets preempted might race with exit_mm but oom_reaper
    made this race more probable.

    We can address the oom_reaper part by using oom_lock for __oom_reap_task
    because this would guarantee that a new oom victim will not be selected
    if the oom reaper might race with the exit path. This doesn't solve the
    original issue, though, because somebody else still might be pinning
    mm_users and so __mmput won't be called to release the memory but that
    is not really realiably solvable because the task will get away from the
    oom sight as soon as it is unhashed from the task_list and so we cannot
    guarantee a new victim won't be selected.

    [akpm@linux-foundation.org: fix use of unused `mm', Per Stephen]
    [akpm@linux-foundation.org: coding-style fixes]
    Fixes: aac453635549 ("mm, oom: introduce oom reaper")
    Link: http://lkml.kernel.org/r/1464271493-20008-1-git-send-email-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Reported-by: Tetsuo Handa
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • register_page_bootmem_info_node() is invoked in mem_init(), so it will
    be called before page_alloc_init_late() if DEFERRED_STRUCT_PAGE_INIT is
    enabled. But, pfn_to_nid() depends on memmap which won't be fully setup
    until page_alloc_init_late() is done, so replace pfn_to_nid() by
    early_pfn_to_nid().

    Link: http://lkml.kernel.org/r/1464210007-30930-1-git-send-email-yang.shi@linaro.org
    Signed-off-by: Yang Shi
    Cc: Mel Gorman
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yang Shi
     
  • page_ext_init() checks suitable pages with pfn_to_nid(), but
    pfn_to_nid() depends on memmap which will not be setup fully until
    page_alloc_init_late() is done. Use early_pfn_to_nid() instead of
    pfn_to_nid() so that page extension could be still used early even
    though CONFIG_ DEFERRED_STRUCT_PAGE_INIT is enabled and catch early page
    allocation call sites.

    Suggested by Joonsoo Kim [1], this fix basically undoes the change
    introduced by commit b8f1a75d61d840 ("mm: call page_ext_init() after all
    struct pages are initialized") and fixes the same problem with a better
    approach.

    [1] http://lkml.kernel.org/r/CAAmzW4OUmyPwQjvd7QUfc6W1Aic__TyAuH80MLRZNMxKy0-wPQ@mail.gmail.com

    Link: http://lkml.kernel.org/r/1464198689-23458-1-git-send-email-yang.shi@linaro.org
    Signed-off-by: Yang Shi
    Cc: Joonsoo Kim
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yang Shi
     
  • I am proposing following updates to kdump maintainership. I have got
    busy in other things and not getting time to spend on kdump.

    Remove Haren Myneni as he has not participated in kdump development for
    a long time now.

    Add the names of Dave and Baoquan as kdump maintainers as they have been
    contributing to kdump for a long time now and they are in a much better
    position to spend time on this than me.

    Mark myself as a reviewer.

    Link: http://lkml.kernel.org/r/20160525131616.GB27291@redhat.com
    Signed-off-by: Vivek Goyal
    Acked-by: Simon Horman
    Cc: Haren Myneni
    Cc: Dave Young
    Cc: Baoquan He
    Cc: "Eric W. Biederman"
    Cc: Joe Perches
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vivek Goyal
     
  • In the below commits kexec.c was split to kexec.c, kexec_file.c and
    kexec_core.c.

    commit a43cac0d9dc2 ("kexec: split kexec_file syscall code to kexec_file.c")
    commit 2965faa5e03d ("kexec: split kexec_load syscall from kexec core code")

    Both kexec_file.c and kexec_core.c still belong to the kexec component.
    In order to get correct mail lists by using the script get_maintainer.pl,
    add these files to MAINTAINERS.

    Link: http://lkml.kernel.org/r/1464189735-59113-1-git-send-email-mnghuan@gmail.com
    Signed-off-by: Minfei Huang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minfei Huang
     
  • If the current process is exiting, we don't invoke oom killer, instead
    we give it access to memory reserves and try to reap its mm in case
    nobody is going to use it. There's a mistake in the code performing
    this check - we just ignore any process of the same thread group no
    matter if it is exiting or not - see try_oom_reaper. Fix it.

    Link: http://lkml.kernel.org/r/1464087628-7318-1-git-send-email-vdavydov@virtuozzo.com
    Fixes: 3ef22dfff239 ("oom, oom_reaper: try to reap tasks which skip regular OOM killer path")Signed-off-by: Vladimir Davydov
    Acked-by: Michal Hocko
    Cc: Tetsuo Handa
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • Currently direct writes inside i_size on a DIO_SKIP_HOLES filesystem are
    not allowed to allocate blocks(get_more_blocks() sets 'create' to 0
    before calling get_block() callback), if it's a sparse file, direct
    writes fall back to buffered writes to avoid stale data exposure from
    concurrent buffered read. But there're two cases that can result in
    stale data exposure are not correctly detected.

    1. The detection for "writing inside i_size" is not sufficient,
    writes can be treated as "extending writes" wrongly. For example,
    direct write 1FSB (file system block) to a 1FSB sparse file on
    ext2/3/4, starting from offset 0, in this case it's writing inside
    i_size, but 'create' is non-zero, because 'block_in_file' and
    '(i_size_read(inode) >> blkbits' are both zero.

    2. Direct writes starting from or beyong i_size (not inside i_size)
    also could trigger block allocation and expose stale data. For
    example, consider a sparse file with i_size of 2k, and a write to
    offset 2k or 3k into the file, with a filesystem block size of 4k.
    (Thanks to Jeff Moyer for pointing this case out in his review.)

    The first problem can be demostrated by running ltp-aiodio test ADSP045
    many times. When testing on extN filesystems, I see test failures
    occasionally, buffered read could read non-zero (stale) data.

    ADSP045: dio_sparse -a 4k -w 4k -s 2k -n 1

    dio_sparse 0 TINFO : Dirtying free blocks
    dio_sparse 0 TINFO : Starting I/O tests
    non zero buffer at buf[0] => 0xffffffaa,ffffffaa,ffffffaa,ffffffaa
    non-zero read at offset 0
    dio_sparse 0 TINFO : Killing childrens(s)
    dio_sparse 1 TFAIL : dio_sparse.c:191: 1 children(s) exited abnormally

    The second problem can also be reproduced easily by a hacked dio_sparse
    program, which accepts an option to specify the write offset.

    What we should really do is to disable block allocation for writes that
    could result in filling holes inside i_size.

    Link: http://lkml.kernel.org/r/1463156728-13357-1-git-send-email-guaneryu@gmail.com
    Reviewed-by: Jan Kara
    Signed-off-by: Eryu Guan
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eryu Guan
     
  • Two new messages are added to support negotiating hb timeout. Stop
    nodes frmo talking an old version to mount as they will cause the
    negotiation to fail.

    Link: http://lkml.kernel.org/r/1464231615-27939-1-git-send-email-junxiao.bi@oracle.com
    Signed-off-by: Junxiao Bi
    Reviewed-by: Mark Fasheh
    Cc: Joel Becker
    Cc: Joseph Qi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Junxiao Bi
     
  • hr_last_timeout_start should be set as the last time where hb is
    still OK. When hb write timeout, hung time will be (jiffies -
    hr_last_timeout_start).

    Signed-off-by: Junxiao Bi
    Reviewed-by: Ryan Ding
    Reviewed-by: Mark Fasheh
    Cc: Gang He
    Cc: rwxybh
    Cc: Joel Becker
    Cc: Joseph Qi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Junxiao Bi
     
  • Sometimes io error is returned when storage is down for a while. Like
    for iscsi device, stroage is made offline when session timeout, and this
    will make all io return -EIO. For this case, nodes shouldn't do
    negotiate timeout but should fence self. So let nodes fence self when
    o2hb_do_disk_heartbeat return an error, this is the same behavior with
    o2hb without negotiate timer.

    Signed-off-by: Junxiao Bi
    Reviewed-by: Ryan Ding
    Reviewed-by: Mark Fasheh
    Cc: Gang He
    Cc: rwxybh
    Cc: Joel Becker
    Cc: Joseph Qi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Junxiao Bi
     
  • Signed-off-by: Junxiao Bi
    Reviewed-by: Ryan Ding
    Reviewed-by: Mark Fasheh
    Cc: Gang He
    Cc: rwxybh
    Cc: Joel Becker
    Cc: Joseph Qi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Junxiao Bi
     
  • This message is used to re-queue write timeout timer and negotiate timer
    when all nodes suffer a write hung to storage, this makes node not fence
    self if storage down.

    Signed-off-by: Junxiao Bi
    Reviewed-by: Ryan Ding
    Reviewed-by: Mark Fasheh
    Cc: Gang He
    Cc: rwxybh
    Cc: Joel Becker
    Cc: Joseph Qi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Junxiao Bi
     
  • This message is sent to master node when non-master nodes's negotiate
    timer expired. Master node records these nodes in a bitmap which is
    used to do write timeout timer re-queue decision.

    Signed-off-by: Junxiao Bi
    Reviewed-by: Ryan Ding
    Reviewed-by: Mark Fasheh
    Cc: Gang He
    Cc: rwxybh
    Cc: Joel Becker
    Cc: Joseph Qi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Junxiao Bi
     
  • This series of patches is to fix the issue that when storage down, all
    nodes will fence self due to write timeout.

    With this patch set, all nodes will keep going until storage back
    online, except if the following issue happens, then all nodes will do as
    before to fence self.

    1. io error got
    2. network between nodes down
    3. nodes panic

    This patch (of 6):

    When storage down, all nodes will fence self due to write timeout. The
    negotiate timer is designed to avoid this, with it node will wait until
    storage up again.

    Negotiate timer working in the following way:

    1. The timer expires before write timeout timer, its timeout is half
    of write timeout now. It is re-queued along with write timeout timer.
    If expires, it will send NEGO_TIMEOUT message to master node(node with
    lowest node number). This message does nothing but marks a bit in a
    bitmap recording which nodes are negotiating timeout on master node.

    2. If storage down, nodes will send this message to master node, then
    when master node finds its bitmap including all online nodes, it sends
    NEGO_APPROVL message to all nodes one by one, this message will
    re-queue write timeout timer and negotiate timer. For any node doesn't
    receive this message or meets some issue when handling this message, it
    will be fenced. If storage up at any time, o2hb_thread will run and
    re-queue all the timer, nothing will be affected by these two steps.

    Signed-off-by: Junxiao Bi
    Reviewed-by: Ryan Ding
    Reviewed-by: Mark Fasheh
    Cc: Gang He
    Cc: rwxybh
    Cc: Joel Becker
    Cc: Joseph Qi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Junxiao Bi
     
  • Pull block fixes from Jens Axboe:
    "A set of fixes that wasn't included in the first merge window pull
    request. This pull request contains:

    - A set of NVMe fixes from Keith, and one from Nic for the integrity
    side of it.

    - Fix from Ming, clearing ->mq_ops if we don't successfully setup a
    queue for multiqueue.

    - A set of stability fixes for bcache from Jiri, and also marking
    bcache as orphaned as it's no longer actively maintained (in
    mainline, at least)"

    * 'for-linus' of git://git.kernel.dk/linux-block:
    blk-mq: clear q->mq_ops if init fail
    MAINTAINERS: mark bcache as orphan
    bcache: bch_gc_thread() is not freezable
    bcache: bch_allocator_thread() is not freezable
    bcache: bch_writeback_thread() is not freezable
    nvme/host: Add missing blk_integrity tag_size + flags assignments
    NVMe: Add device ID's with stripe quirk
    NVMe: Short-cut removal on surprise hot-unplug
    NVMe: Allow user initiated rescan
    NVMe: Reduce driver log spamming
    NVMe: Unbind driver on failure
    NVMe: Delete only created queues
    NVMe: Allocate queues only for online cpus

    Linus Torvalds