26 Jun, 2015

8 commits

  • printk log_buf keeps various metadata for each message including its
    sequence number and timestamp. The metadata is currently available only
    through /dev/kmsg and stripped out before passed onto console drivers. We
    want this metadata to be available to console drivers too so that console
    consumers can get full information including the metadata and dictionary,
    which among other things can be used to detect whether messages got lost
    in transit.

    This patch implements support for extended console drivers. Consoles can
    indicate that they want extended messages by setting the new CON_EXTENDED
    flag and they'll be fed messages formatted the same way as /dev/kmsg.

    ",,,;\n"

    If extended consoles exist, in-kernel fragment assembly is disabled. This
    ensures that all messages emitted to consoles have full metadata including
    sequence number. The contflag carries enough information to reassemble
    the fragments from the reader side trivially. Note that this only affects
    /dev/kmsg. Regular console and /proc/kmsg outputs are not affected by
    this change.

    * Extended message formatting for console drivers is enabled iff there
    are registered extended consoles.

    * Comment describing /dev/kmsg message format updated to add missing
    contflag field and help distinguishing variable from verbatim terms.

    Signed-off-by: Tejun Heo
    Cc: David Miller
    Cc: Kay Sievers
    Reviewed-by: Petr Mladek
    Cc: Tetsuo Handa
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tejun Heo
     
  • This patchset updates netconsole so that it can emit messages with the
    same header as used in /dev/kmsg which gives neconsole receiver full log
    information which enables things like structured logging and detection
    of lost messages.

    This patch (of 7):

    devkmsg_read() uses 8k buffer and assumes that the formatted output
    message won't overrun which seems safe given LOG_LINE_MAX, the current use
    of dict and the escaping method being used; however, we're planning to use
    devkmsg formatting wider and accounting for the buffer size properly isn't
    that complicated.

    This patch defines CONSOLE_EXT_LOG_MAX as 8192 and updates devkmsg_read()
    so that it limits output accordingly.

    Signed-off-by: Tejun Heo
    Cc: David Miller
    Cc: Kay Sievers
    Reviewed-by: Petr Mladek
    Cc: Tetsuo Handa
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tejun Heo
     
  • clone has some of the quirkiest syscall handling in the kernel, with a
    pile of special cases, historical curiosities, and architecture-specific
    calling conventions. In particular, clone with CLONE_SETTLS accepts a
    parameter "tls" that the C entry point completely ignores and some
    assembly entry points overwrite; instead, the low-level arch-specific
    code pulls the tls parameter out of the arch-specific register captured
    as part of pt_regs on entry to the kernel. That's a massive hack, and
    it makes the arch-specific code only work when called via the specific
    existing syscall entry points; because of this hack, any new clone-like
    system call would have to accept an identical tls argument in exactly
    the same arch-specific position, rather than providing a unified system
    call entry point across architectures.

    The first patch allows architectures to handle the tls argument via
    normal C parameter passing, if they opt in by selecting
    HAVE_COPY_THREAD_TLS. The second patch makes 32-bit and 64-bit x86 opt
    into this.

    These two patches came out of the clone4 series, which isn't ready for
    this merge window, but these first two cleanup patches were entirely
    uncontroversial and have acks. I'd like to go ahead and submit these
    two so that other architectures can begin building on top of this and
    opting into HAVE_COPY_THREAD_TLS. However, I'm also happy to wait and
    send these through the next merge window (along with v3 of clone4) if
    anyone would prefer that.

    This patch (of 2):

    clone with CLONE_SETTLS accepts an argument to set the thread-local
    storage area for the new thread. sys_clone declares an int argument
    tls_val in the appropriate point in the argument list (based on the
    various CLONE_BACKWARDS variants), but doesn't actually use or pass along
    that argument. Instead, sys_clone calls do_fork, which calls
    copy_process, which calls the arch-specific copy_thread, and copy_thread
    pulls the corresponding syscall argument out of the pt_regs captured at
    kernel entry (knowing what argument of clone that architecture passes tls
    in).

    Apart from being awful and inscrutable, that also only works because only
    one code path into copy_thread can pass the CLONE_SETTLS flag, and that
    code path comes from sys_clone with its architecture-specific
    argument-passing order. This prevents introducing a new version of the
    clone system call without propagating the same architecture-specific
    position of the tls argument.

    However, there's no reason to pull the argument out of pt_regs when
    sys_clone could just pass it down via C function call arguments.

    Introduce a new CONFIG_HAVE_COPY_THREAD_TLS for architectures to opt into,
    and a new copy_thread_tls that accepts the tls parameter as an additional
    unsigned long (syscall-argument-sized) argument. Change sys_clone's tls
    argument to an unsigned long (which does not change the ABI), and pass
    that down to copy_thread_tls.

    Architectures that don't opt into copy_thread_tls will continue to ignore
    the C argument to sys_clone in favor of the pt_regs captured at kernel
    entry, and thus will be unable to introduce new versions of the clone
    syscall.

    Patch co-authored by Josh Triplett and Thiago Macieira.

    Signed-off-by: Josh Triplett
    Acked-by: Andy Lutomirski
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Thiago Macieira
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Josh Triplett
     
  • Commit 3876488444e7 ("include/stddef.h: Move offsetofend() from vfio.h
    to a generic kernel header") added offsetofend outside the normal
    include #ifndef/#endif guard. Move it inside.

    Miscellanea:

    o remove unnecessary blank line
    o standardize offsetof macros whitespace style

    Signed-off-by: Joe Perches
    Cc: Denys Vlasenko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joe Perches
     
  • Cleanup commit 73679e508201 ("compiler-intel.h: Remove duplicate
    definition") removed the double definition of __memory_barrier()
    intrinsics.

    However, in doing so, it also removed the preceding #undef barrier by
    accident, meaning, the actual barrier() macro from compiler-gcc.h with
    inline asm is still in place as __GNUC__ is provided.

    Subsequently, barrier() can never be defined as __memory_barrier() from
    compiler.h since it already has a definition in place and if we trust
    the comment in compiler-intel.h, ecc doesn't support gcc specific asm
    statements.

    I don't have an ecc at hand (unsure if that's still used in the field?)
    and only found this by accident during code review, a revert of that
    cleanup would be simplest option.

    Fixes: 73679e508201 ("compiler-intel.h: Remove duplicate definition")
    Signed-off-by: Daniel Borkmann
    Reviewed-by: Pranith Kumar
    Cc: Pranith Kumar
    Cc: H. Peter Anvin
    Cc: mancha security
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Daniel Borkmann
     
  • As gcc major version numbers are going to advance rather rapidly in the
    future, there's no real value in separate files for each compiler
    version.

    Deduplicate some of the macros #defined in each file too.

    Neaten comments using normal kernel commenting style.

    Signed-off-by: Joe Perches
    Cc: Andi Kleen
    Cc: Michal Marek
    Cc: Segher Boessenkool
    Cc: Sasha Levin
    Cc: Anton Blanchard
    Cc: Alan Modra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joe Perches
     
  • - Move the inline and noinline blocks together

    - Comment neatening

    - Alignment of __attribute__ uses

    - Consistent naming of __must_be_array macro argument

    - Multiline macro neatening

    Signed-off-by: Joe Perches
    Cc: Andi Kleen
    Cc: Michal Marek
    Cc: Segher Boessenkool
    Cc: Sasha Levin
    Cc: Anton Blanchard
    Cc: Alan Modra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joe Perches
     
  • Remove zpool_evict() helper function. As zbud is currently the only
    zpool implementation that supports eviction, add zpool and zpool_ops
    references to struct zbud_pool and directly call zpool_ops->evict(zpool,
    handle) on eviction.

    Currently zpool provides the zpool_evict helper which locks the zpool
    list lock and searches through all pools to find the specific one
    matching the caller, and call the corresponding zpool_ops->evict
    function. However, this is unnecessary, as the zbud pool can simply
    keep a reference to the zpool that created it, as well as the zpool_ops,
    and directly call the zpool_ops->evict function, when it needs to evict
    a page. This avoids a spinlock and list search in zpool for each
    eviction.

    Signed-off-by: Dan Streetman
    Cc: Seth Jennings
    Cc: Minchan Kim
    Cc: Nitin Gupta
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Streetman
     

25 Jun, 2015

32 commits

  • Merge first patchbomb from Andrew Morton:

    - a few misc things

    - ocfs2 udpates

    - kernel/watchdog.c feature work (took ages to get right)

    - most of MM. A few tricky bits are held up and probably won't make 4.2.

    * emailed patches from Andrew Morton : (91 commits)
    mm: kmemleak_alloc_percpu() should follow the gfp from per_alloc()
    mm, thp: respect MPOL_PREFERRED policy with non-local node
    tmpfs: truncate prealloc blocks past i_size
    mm/memory hotplug: print the last vmemmap region at the end of hot add memory
    mm/mmap.c: optimization of do_mmap_pgoff function
    mm: kmemleak: optimise kmemleak_lock acquiring during kmemleak_scan
    mm: kmemleak: avoid deadlock on the kmemleak object insertion error path
    mm: kmemleak: do not acquire scan_mutex in kmemleak_do_cleanup()
    mm: kmemleak: fix delete_object_*() race when called on the same memory block
    mm: kmemleak: allow safe memory scanning during kmemleak disabling
    memcg: convert mem_cgroup->under_oom from atomic_t to int
    memcg: remove unused mem_cgroup->oom_wakeups
    frontswap: allow multiple backends
    x86, mirror: x86 enabling - find mirrored memory ranges
    mm/memblock: allocate boot time data structures from mirrored memory
    mm/memblock: add extra "flags" to memblock to allow selection of memory based on attribute
    mm: do not ignore mapping_gfp_mask in page cache allocation paths
    mm/cma.c: fix typos in comments
    mm/oom_kill.c: print points as unsigned int
    mm/hugetlb: handle races in alloc_huge_page and hugetlb_reserve_pages
    ...

    Linus Torvalds
     
  • Pull f2fs updates from Jaegeuk Kim:
    "New features:
    - per-file encryption (e.g., ext4)
    - FALLOC_FL_ZERO_RANGE
    - FALLOC_FL_COLLAPSE_RANGE
    - RENAME_WHITEOUT

    Major enhancement/fixes:
    - recovery broken superblocks
    - enhance f2fs_trim_fs with a discard_map
    - fix a race condition on dentry block allocation
    - fix a deadlock during summary operation
    - fix a missing fiemap result

    .. and many minor bug fixes and clean-ups were done"

    * tag 'for-f2fs-4.2' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (83 commits)
    f2fs: do not trim preallocated blocks when truncating after i_size
    f2fs crypto: add alloc_bounce_page
    f2fs crypto: fix to handle errors likewise ext4
    f2fs: drop the volatile_write flag only
    f2fs: skip committing valid superblock
    f2fs: setting discard option in parse_options()
    f2fs: fix to return exact trimmed size
    f2fs: support FALLOC_FL_INSERT_RANGE
    f2fs: hide common code in f2fs_replace_block
    f2fs: disable the discard option when device doesn't support
    f2fs crypto: remove alloc_page for bounce_page
    f2fs: fix a deadlock for summary page lock vs. sentry_lock
    f2fs crypto: clean up error handling in f2fs_fname_setup_filename
    f2fs crypto: avoid f2fs_inherit_context for symlink
    f2fs crypto: do not set encryption policy for non-directory by ioctl
    f2fs crypto: allow setting encryption policy once
    f2fs crypto: check context consistent for rename2
    f2fs: avoid duplicated code by reusing f2fs_read_end_io
    f2fs crypto: use per-inode tfm structure
    f2fs: recovering broken superblock during mount
    ...

    Linus Torvalds
     
  • Pull input subsystem updates from Dmitry Torokhov:
    "Thanks to Samuel Thibault input device (keyboard) LEDs are no longer
    hardwired within the input core but use LED subsystem and so allow use
    of different triggers; Hans de Goede did a large update for the ALPS
    touchpad driver; we have new TI drv2665 haptics driver and DA9063
    OnKey driver, and host of other drivers got various fixes"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input: (55 commits)
    Input: pixcir_i2c_ts - fix receive error
    MAINTAINERS: remove non existent input mt git tree
    Input: improve usage of gpiod API
    tty/vt/keyboard: define LED triggers for VT keyboard lock states
    tty/vt/keyboard: define LED triggers for VT LED states
    Input: export LEDs as class devices in sysfs
    Input: cyttsp4 - use swap() in cyttsp4_get_touch()
    Input: goodix - do not explicitly set evbits in input device
    Input: goodix - export id and version read from device
    Input: goodix - fix variable length array warning
    Input: goodix - fix alignment issues
    Input: add OnKey driver for DA9063 MFD part
    Input: elan_i2c - add product IDs FW names
    Input: elan_i2c - add support for multi IC type and iap format
    Input: focaltech - report finger width to userspace
    tty: remove platform_sysrq_reset_seq
    Input: synaptics_i2c - use proper boolean values
    Input: psmouse - use true instead of 1 for boolean values
    Input: cyapa - fix a few typos in comments
    Input: stmpe-ts - enforce device tree only mode
    ...

    Linus Torvalds
     
  • Pull pin control updates from Linus Walleij:
    "Here is the bulk of pin control changes for the v4.2 series: Quite a
    lot of new SoC subdrivers and two new main drivers this time, apart
    from that business as usual.

    Details:

    Core functionality:
    - Enable exclusive pin ownership: it is possible to flag a pin
    controller so that GPIO and other functions cannot use a single pin
    simultaneously.

    New drivers:
    - NXP LPC18xx System Control Unit pin controller
    - Imagination Pistachio SoC pin controller

    New subdrivers:
    - Freescale i.MX7d SoC
    - Intel Sunrisepoint-H PCH
    - Renesas PFC R8A7793
    - Renesas PFC R8A7794
    - Mediatek MT6397, MT8127
    - SiRF Atlas 7
    - Allwinner A33
    - Qualcomm MSM8660
    - Marvell Armada 395
    - Rockchip RK3368

    Cleanups:
    - A big cleanup of the Marvell MVEBU driver rectifying it to
    correspond to reality
    - Drop platform device probing from the SH PFC driver, we are now a
    DT only shop for SuperH
    - Drop obsolte multi-platform check for SH PFC
    - Various janitorial: constification, grammar etc

    Improvements:
    - The AT91 GPIO portions now supports the set_multiple() feature
    - Split out SPI pins on the Xilinx Zynq
    - Support DTs without specific function nodes in the i.MX driver"

    * tag 'pinctrl-v4.2-1' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl: (99 commits)
    pinctrl: rockchip: add support for the rk3368
    pinctrl: rockchip: generalize perpin driver-strength setting
    pinctrl: sh-pfc: r8a7794: add SDHI pin groups
    pinctrl: sh-pfc: r8a7794: add MMCIF pin groups
    pinctrl: sh-pfc: add R8A7794 PFC support
    pinctrl: make pinctrl_register() return proper error code
    pinctrl: mvebu: armada-39x: add support for Armada 395 variant
    pinctrl: mvebu: armada-39x: add missing SATA functions
    pinctrl: mvebu: armada-39x: add missing PCIe functions
    pinctrl: mvebu: armada-38x: add ptp functions
    pinctrl: mvebu: armada-38x: add ua1 functions
    pinctrl: mvebu: armada-38x: add nand functions
    pinctrl: mvebu: armada-38x: add sata functions
    pinctrl: mvebu: armada-xp: add dram functions
    pinctrl: mvebu: armada-xp: add nand rb function
    pinctrl: mvebu: armada-xp: add spi1 function
    pinctrl: mvebu: armada-39x: normalize ref clock naming
    pinctrl: mvebu: armada-xp: rename spi to spi0
    pinctrl: mvebu: armada-370: align spi1 clock pin naming
    pinctrl: mvebu: armada-370: align VDD cpu-pd pin naming with datasheet
    ...

    Linus Torvalds
     
  • Pull backlight updates from Lee Jones:
    "Changes to existing drivers:

    - supply MODULE_DEVICE_TABLE() to ensure probing
    - constify struct; da9052_bl
    - enable compile test; lcd_l4f00242t03, lcd_lms283fg05, backlight_gpio
    - suspend/resume bugfix; lp855x_bl
    - devm_gpiod_get_optional() API fixup; pwm_bl
    - error handling fixup; backlight"

    * tag 'backlight-for-linus-4.2' of git://git.kernel.org/pub/scm/linux/kernel/git/lee/backlight:
    backlight: Change the return type of backlight_update_status() to int
    backlight: pwm_bl: Simplify usage of devm_gpiod_get_optional
    backlight: lp855x: Don't clear level on suspend/blank
    backlight: Allow compile test of GPIO consumers if !GPIOLIB
    video: backlight: da9052: Constify platform_device_id
    gpio-backlight: Discover driver during boot time

    Linus Torvalds
     
  • Beginning at commit d52d3997f843 ("ipv6: Create percpu rt6_info"), the
    following INFO splat is logged:

    ===============================
    [ INFO: suspicious RCU usage. ]
    4.1.0-rc7-next-20150612 #1 Not tainted
    -------------------------------
    kernel/sched/core.c:7318 Illegal context switch in RCU-bh read-side critical section!
    other info that might help us debug this:
    rcu_scheduler_active = 1, debug_locks = 0
    3 locks held by systemd/1:
    #0: (rtnl_mutex){+.+.+.}, at: [] rtnetlink_rcv+0x1f/0x40
    #1: (rcu_read_lock_bh){......}, at: [] ipv6_add_addr+0x62/0x540
    #2: (addrconf_hash_lock){+...+.}, at: [] ipv6_add_addr+0x184/0x540
    stack backtrace:
    CPU: 0 PID: 1 Comm: systemd Not tainted 4.1.0-rc7-next-20150612 #1
    Hardware name: TOSHIBA TECRA A50-A/TECRA A50-A, BIOS Version 4.20 04/17/2014
    Call Trace:
    dump_stack+0x4c/0x6e
    lockdep_rcu_suspicious+0xe7/0x120
    ___might_sleep+0x1d5/0x1f0
    __might_sleep+0x4d/0x90
    kmem_cache_alloc+0x47/0x250
    create_object+0x39/0x2e0
    kmemleak_alloc_percpu+0x61/0xe0
    pcpu_alloc+0x370/0x630

    Additional backtrace lines are truncated. In addition, the above splat
    is followed by several "BUG: sleeping function called from invalid
    context at mm/slub.c:1268" outputs. As suggested by Martin KaFai Lau,
    these are the clue to the fix. Routine kmemleak_alloc_percpu() always
    uses GFP_KERNEL for its allocations, whereas it should follow the gfp
    from its callers.

    Reviewed-by: Catalin Marinas
    Reviewed-by: Kamalesh Babulal
    Acked-by: Martin KaFai Lau
    Signed-off-by: Larry Finger
    Cc: Martin KaFai Lau
    Cc: Catalin Marinas
    Cc: Tejun Heo
    Cc: Christoph Lameter
    Cc: [3.18+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Larry Finger
     
  • Change frontswap single pointer to a singly linked list of frontswap
    implementations. Update Xen tmem implementation as register no longer
    returns anything.

    Frontswap only keeps track of a single implementation; any
    implementation that registers second (or later) will replace the
    previously registered implementation, and gets a pointer to the previous
    implementation that the new implementation is expected to pass all
    frontswap functions to if it can't handle the function itself. However
    that method doesn't really make much sense, as passing that work on to
    every implementation adds unnecessary work to implementations; instead,
    frontswap should simply keep a list of all registered implementations
    and try each implementation for any function. Most importantly, neither
    of the two currently existing frontswap implementations in the kernel
    actually do anything with any previous frontswap implementation that
    they replace when registering.

    This allows frontswap to successfully manage multiple implementations by
    keeping a list of them all.

    Signed-off-by: Dan Streetman
    Cc: Konrad Rzeszutek Wilk
    Cc: Boris Ostrovsky
    Cc: David Vrabel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Streetman
     
  • UEFI GetMemoryMap() uses a new attribute bit to mark mirrored memory
    address ranges. See UEFI 2.5 spec pages 157-158:

    http://www.uefi.org/sites/default/files/resources/UEFI%202_5.pdf

    On EFI enabled systems scan the memory map and tell memblock about any
    mirrored ranges.

    Signed-off-by: Tony Luck
    Cc: Xishi Qiu
    Cc: Hanjun Guo
    Cc: Xiexiuqi
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: "H. Peter Anvin"
    Cc: Yinghai Lu
    Cc: Naoya Horiguchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tony Luck
     
  • Try to allocate all boot time kernel data structures from mirrored
    memory.

    If we run out of mirrored memory print warnings, but fall back to using
    non-mirrored memory to make sure that we still boot.

    By number of bytes, most of what we allocate at boot time is the page
    structures. 64 bytes per 4K page on x86_64 ... or about 1.5% of total
    system memory. For workloads where the bulk of memory is allocated to
    applications this may represent a useful improvement to system
    availability since 1.5% of total memory might be a third of the memory
    allocated to the kernel.

    Signed-off-by: Tony Luck
    Cc: Xishi Qiu
    Cc: Hanjun Guo
    Cc: Xiexiuqi
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: "H. Peter Anvin"
    Cc: Yinghai Lu
    Cc: Naoya Horiguchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tony Luck
     
  • Some high end Intel Xeon systems report uncorrectable memory errors as a
    recoverable machine check. Linux has included code for some time to
    process these and just signal the affected processes (or even recover
    completely if the error was in a read only page that can be replaced by
    reading from disk).

    But we have no recovery path for errors encountered during kernel code
    execution. Except for some very specific cases were are unlikely to ever
    be able to recover.

    Enter memory mirroring. Actually 3rd generation of memory mirroing.

    Gen1: All memory is mirrored
    Pro: No s/w enabling - h/w just gets good data from other side of the
    mirror
    Con: Halves effective memory capacity available to OS/applications

    Gen2: Partial memory mirror - just mirror memory begind some memory controllers
    Pro: Keep more of the capacity
    Con: Nightmare to enable. Have to choose between allocating from
    mirrored memory for safety vs. NUMA local memory for performance

    Gen3: Address range partial memory mirror - some mirror on each memory
    controller
    Pro: Can tune the amount of mirror and keep NUMA performance
    Con: I have to write memory management code to implement

    The current plan is just to use mirrored memory for kernel allocations.
    This has been broken into two phases:

    1) This patch series - find the mirrored memory, use it for boot time
    allocations

    2) Wade into mm/page_alloc.c and define a ZONE_MIRROR to pick up the
    unused mirrored memory from mm/memblock.c and only give it out to
    select kernel allocations (this is still being scoped because
    page_alloc.c is scary).

    This patch (of 3):

    Add extra "flags" to memblock to allow selection of memory based on
    attribute. No functional changes

    Signed-off-by: Tony Luck
    Cc: Xishi Qiu
    Cc: Hanjun Guo
    Cc: Xiexiuqi
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: "H. Peter Anvin"
    Cc: Yinghai Lu
    Cc: Naoya Horiguchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tony Luck
     
  • We have confusing functions to clear pmd, pmd_clear_* and pmd_clear. Add
    _huge_ to pmdp_clear functions so that we are clear that they operate on
    hugepage pte.

    We don't bother about other functions like pmdp_set_wrprotect,
    pmdp_clear_flush_young, because they operate on PTE bits and hence
    indicate they are operating on hugepage ptes

    Signed-off-by: Aneesh Kumar K.V
    Acked-by: Kirill A. Shutemov
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: Michael Ellerman
    Cc: Andrea Arcangeli
    Cc: Martin Schwidefsky
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Aneesh Kumar K.V
     
  • Also move the pmd_trans_huge check to generic code.

    Signed-off-by: Aneesh Kumar K.V
    Acked-by: Kirill A. Shutemov
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: Michael Ellerman
    Cc: Andrea Arcangeli
    Cc: Martin Schwidefsky
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Aneesh Kumar K.V
     
  • Architectures like ppc64 [1] need to do special things while clearing pmd
    before a collapse. For them this operation is largely different from a
    normal hugepage pte clear. Hence add a separate function to clear pmd
    before collapse. After this patch pmdp_* functions operate only on
    hugepage pte, and not on regular pmd_t values pointing to page table.

    [1] ppc64 needs to invalidate all the normal page pte mappings we already
    have inserted in the hardware hash page table. But before doing that we
    need to make sure there are no parallel hash page table insert going on.
    So we need to do a kick_all_cpus_sync() before flushing the older hash
    table entries. By moving this to a separate function we capture these
    details and mention how it is different from a hugepage pte clear.

    This patch is a cleanup and only does code movement for clarity. There
    should not be any change in functionality.

    Signed-off-by: Aneesh Kumar K.V
    Acked-by: Kirill A. Shutemov
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: Michael Ellerman
    Cc: Andrea Arcangeli
    Cc: Martin Schwidefsky
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Aneesh Kumar K.V
     
  • RAS user space tools like rasdaemon which base on trace event, could
    receive mce error event, but no memory recovery result event. So, I want
    to add this event to make this scenario complete.

    This patch add a event at ras group for memory-failure.

    The output like below:
    # tracer: nop
    #
    # entries-in-buffer/entries-written: 2/2 #P:24
    #
    # _-----=> irqs-off
    # / _----=> need-resched
    # | / _---=> hardirq/softirq
    # || / _--=> preempt-depth
    # ||| / delay
    # TASK-PID CPU# |||| TIMESTAMP FUNCTION
    # | | | |||| | |
    mce-inject-13150 [001] .... 277.019359: memory_failure_event: pfn 0x19869: recovery action for free buddy page: Delayed

    [xiexiuqi@huawei.com: fix build error]
    Signed-off-by: Xie XiuQi
    Reviewed-by: Naoya Horiguchi
    Acked-by: Steven Rostedt
    Cc: Tony Luck
    Cc: Chen Gong
    Cc: Jim Davis
    Signed-off-by: Xie XiuQi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Xie XiuQi
     
  • Change type of action_result's param 3 to enum for type consistency,
    and rename mf_outcome to mf_result for clearly.

    Signed-off-by: Xie XiuQi
    Acked-by: Naoya Horiguchi
    Cc: Chen Gong
    Cc: Jim Davis
    Cc: Steven Rostedt
    Cc: Tony Luck
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Xie XiuQi
     
  • Export 'outcome' and 'action_page_type' to mm.h, so we could use
    this emnus outside.

    This patch is preparation for adding trace events for memory-failure
    recovery action.

    Signed-off-by: Xie XiuQi
    Acked-by: Naoya Horiguchi
    Cc: Chen Gong
    Cc: Jim Davis
    Cc: Steven Rostedt
    Cc: Tony Luck
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Xie XiuQi
     
  • The zonelist locking and the oom_sem are two overlapping locks that are
    used to serialize global OOM killing against different things.

    The historical zonelist locking serializes OOM kills from allocations with
    overlapping zonelists against each other to prevent killing more tasks
    than necessary in the same memory domain. Only when neither tasklists nor
    zonelists from two concurrent OOM kills overlap (tasks in separate memcgs
    bound to separate nodes) are OOM kills allowed to execute in parallel.

    The younger oom_sem is a read-write lock to serialize OOM killing against
    the PM code trying to disable the OOM killer altogether.

    However, the OOM killer is a fairly cold error path, there is really no
    reason to optimize for highly performant and concurrent OOM kills. And
    the oom_sem is just flat-out redundant.

    Replace both locking schemes with a single global mutex serializing OOM
    kills regardless of context.

    Signed-off-by: Johannes Weiner
    Acked-by: Michal Hocko
    Acked-by: David Rientjes
    Cc: Tetsuo Handa
    Cc: Andrea Arcangeli
    Cc: Dave Chinner
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Rename unmark_oom_victim() to exit_oom_victim(). Marking and unmarking
    are related in functionality, but the interface is not symmetrical at
    all: one is an internal OOM killer function used during the killing, the
    other is for an OOM victim to signal its own death on exit later on.
    This has locking implications, see follow-up changes.

    While at it, rename mark_tsk_oom_victim() to mark_oom_victim(), which
    is easier on the eye.

    Signed-off-by: Johannes Weiner
    Acked-by: David Rientjes
    Acked-by: Michal Hocko
    Cc: Tetsuo Handa
    Cc: Andrea Arcangeli
    Cc: Dave Chinner
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • memory_failure() can run in 2 different mode (specified by
    MF_COUNT_INCREASED) in page refcount perspective. When
    MF_COUNT_INCREASED is set, memory_failure() assumes that the caller
    takes a refcount of the target page. And if cleared, memory_failure()
    takes it in it's own.

    In current code, however, refcounting is done differently in each caller.
    For example, madvise_hwpoison() uses get_user_pages_fast() and
    hwpoison_inject() uses get_page_unless_zero(). So this inconsistent
    refcounting causes refcount failure especially for thp tail pages.
    Typical user visible effects are like memory leak or
    VM_BUG_ON_PAGE(!page_count(page)) in isolate_lru_page().

    To fix this refcounting issue, this patch introduces get_hwpoison_page()
    to handle thp tail pages in the same manner for each caller of hwpoison
    code.

    memory_failure() might fail to split thp and in such case it returns
    without completing page isolation. This is not good because PageHWPoison
    on the thp is still set and there's no easy way to unpoison such thps. So
    this patch try to roll back any action to the thp in "non anonymous thp"
    case and "thp split failed" case, expecting an MCE(SRAR) generated by
    later access afterward will properly free such thps.

    [akpm@linux-foundation.org: fix CONFIG_HWPOISON_INJECT=m]
    Signed-off-by: Naoya Horiguchi
    Cc: Andi Kleen
    Cc: Tony Luck
    Cc: "Kirill A. Shutemov"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • Reintroduce 8d63d99a5dfb ("mm: avoid tail page refcounting on non-THP
    compound pages") after removing bogus VM_BUG_ON_PAGE() in
    put_unrefcounted_compound_page().

    THP uses tail page refcounting to be able to split huge pages at any time.
    Tail page refcounting is not needed for other users of compound pages and
    it's harmful because of overhead.

    We try to exclude non-THP pages from tail page refcounting using
    __compound_tail_refcounted() check. It excludes most common non-THP
    compound pages: SL*B and hugetlb, but it doesn't catch rest of __GFP_COMP
    users -- drivers.

    And it's not only about overhead.

    Drivers might want to use compound pages to get refcounting semantics
    suitable for mapping high-order pages to userspace. But tail page
    refcounting breaks it.

    Tail page refcounting uses ->_mapcount in tail pages to store GUP pins on
    them. It means GUP pins would affect page_mapcount() for tail pages.
    It's not a problem for THP, because it never maps tail pages. But unlike
    THP, drivers map parts of compound pages with PTEs and it makes
    page_mapcount() be called for tail pages.

    In particular, GUP pins would shift PSS up and affect /proc/kpagecount for
    such pages. But, I'm not aware about anything which can lead to crash or
    other serious misbehaviour.

    Since currently all THP pages are anonymous and all drivers pages are not,
    we can fix the __compound_tail_refcounted() check by requiring PageAnon()
    to enable tail page refcounting.

    Signed-off-by: Kirill A. Shutemov
    Acked-by: Hugh Dickins
    Reviewed-by: Andrea Arcangeli
    Reported-by: Borislav Petkov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • For !CONFIG_NUMA, hashdist will always be 0, since it's setter is
    otherwise compiled out. So we can save 4 bytes of data and some .text
    (although mostly in __init functions) by only defining it for
    CONFIG_NUMA.

    Signed-off-by: Rasmus Villemoes
    Acked-by: David Rientjes
    Reviewed-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rasmus Villemoes
     
  • Some architectures would like to be triggered when a memory area is moved
    through the mremap system call.

    This patch introduces a new arch_remap() mm hook which is placed in the
    path of mremap, and is called before the old area is unmapped (and the
    arch_unmap() hook is called).

    Signed-off-by: Laurent Dufour
    Cc: "Kirill A. Shutemov"
    Cc: Hugh Dickins
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: Pavel Emelyanov
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: Michael Ellerman
    Cc: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Laurent Dufour
     
  • CRIU is recreating the process memory layout by remapping the checkpointee
    memory area on top of the current process (criu). This includes remapping
    the vDSO to the place it has at checkpoint time.

    However some architectures like powerpc are keeping a reference to the
    vDSO base address to build the signal return stack frame by calling the
    vDSO sigreturn service. So once the vDSO has been moved, this reference
    is no more valid and the signal frame built later are not usable.

    This patch serie is introducing a new mm hook framework, and a new
    arch_remap hook which is called when mremap is done and the mm lock still
    hold. The next patch is adding the vDSO remap and unmap tracking to the
    powerpc architecture.

    This patch (of 3):

    This patch introduces a new set of header file to manage mm hooks:
    - per architecture empty header file (arch/x/include/asm/mm-arch-hooks.h)
    - a generic header (include/linux/mm-arch-hooks.h)

    The architecture which need to overwrite a hook as to redefine it in its
    header file, while architecture which doesn't need have nothing to do.

    The default hooks are defined in the generic header and are used in the
    case the architecture is not defining it.

    In a next step, mm hooks defined in include/asm-generic/mm_hooks.h should
    be moved here.

    Signed-off-by: Laurent Dufour
    Suggested-by: Andrew Morton
    Cc: "Kirill A. Shutemov"
    Cc: Hugh Dickins
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: Pavel Emelyanov
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: Michael Ellerman
    Cc: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Laurent Dufour
     
  • The first is a keyboard-off-by-one, the other two the ordinary mathy kind.

    Signed-off-by: Rasmus Villemoes
    Acked-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rasmus Villemoes
     
  • The slub_debug=PU,kmalloc-xx cannot work because in the
    create_kmalloc_caches() the s->name is created after the
    create_kmalloc_cache() is called. The name is NULL in the
    create_kmalloc_cache() so the kmem_cache_flags() would not set the
    slub_debug flags to the s->flags. The fix here set up a kmalloc_names
    string array for the initialization purpose and delete the dynamic name
    creation of kmalloc_caches.

    [akpm@linux-foundation.org: s/kmalloc_names/kmalloc_info/, tweak comment text]
    Signed-off-by: Gavin Guo
    Acked-by: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Gavin Guo
     
  • Change the default behavior of watchdog so it only runs on the
    housekeeping cores when nohz_full is enabled at build and boot time.
    Allow modifying the set of cores the watchdog is currently running on
    with a new kernel.watchdog_cpumask sysctl.

    In the current system, the watchdog subsystem runs a periodic timer that
    schedules the watchdog kthread to run. However, nohz_full cores are
    designed to allow userspace application code running on those cores to
    have 100% access to the CPU. So the watchdog system prevents the
    nohz_full application code from being able to run the way it wants to,
    thus the motivation to suppress the watchdog on nohz_full cores, which
    this patchset provides by default.

    However, if we disable the watchdog globally, then the housekeeping
    cores can't benefit from the watchdog functionality. So we allow
    disabling it only on some cores. See Documentation/lockup-watchdogs.txt
    for more information.

    [jhubbard@nvidia.com: fix a watchdog crash in some configurations]
    Signed-off-by: Chris Metcalf
    Acked-by: Don Zickus
    Cc: Ingo Molnar
    Cc: Ulrich Obergfell
    Cc: Thomas Gleixner
    Cc: Peter Zijlstra
    Cc: Frederic Weisbecker
    Signed-off-by: John Hubbard
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Chris Metcalf
     
  • This patch series allows the watchdog to run by default only on the
    housekeeping cores when nohz_full is in effect; this seems to be a good
    compromise short of turning it off completely (since the nohz_full cores
    can't tolerate a watchdog).

    To provide customizability, we add /proc/sys/kernel/watchdog_cpumask so
    that the set of cores running the watchdog can be tuned to different
    values after bootup.

    To implement this customizability, we add a new
    smpboot_update_cpumask_percpu_thread() API to the smpboot_thread
    subsystem that lets us park or unpark "unwanted" threads.

    And now that threads can be parked for long periods of time, we tweak the
    /proc//stat and /proc//status code so parked threads aren't
    reported as running, which is otherwise confusing.

    This patch (of 3):

    This change allows some cores to be excluded from running the
    smp_hotplug_thread tasks. The following commit to update
    kernel/watchdog.c to use this functionality is the motivating example, and
    more information on the motivation is provided there.

    A new smp_hotplug_thread field is introduced, "cpumask", which is cpumask
    field managed by the smpboot subsystem that indicates whether or not the
    given smp_hotplug_thread should run on that core; the cpumask is checked
    when deciding whether to unpark the thread.

    To limit the cpumask to less than cpu_possible, you must call
    smpboot_update_cpumask_percpu_thread() after registering.

    Signed-off-by: Chris Metcalf
    Cc: Don Zickus
    Cc: Ingo Molnar
    Cc: Ulrich Obergfell
    Cc: Thomas Gleixner
    Cc: Peter Zijlstra
    Cc: Frederic Weisbecker
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Chris Metcalf
     
  • config_item_init() is only used in item.c

    Signed-off-by: Fabian Frederick
    Cc: Joel Becker
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Fabian Frederick
     
  • should_send_event is no longer part of struct fsnotify_ops, so remove it.

    Signed-off-by: Nikolay Borisov
    Reviewed-by: Jan Kara
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nikolay Borisov
     
  • Pull networking updates from David Miller:

    1) Add TX fast path in mac80211, from Johannes Berg.

    2) Add TSO/GRO support to ibmveth, from Thomas Falcon

    3) Move away from cached routes in ipv6, just like ipv4, from Martin
    KaFai Lau.

    4) Lots of new rhashtable tests, from Thomas Graf.

    5) Run ingress qdisc lockless, from Alexei Starovoitov.

    6) Allow servers to fetch TCP packet headers for SYN packets of new
    connections, for fingerprinting. From Eric Dumazet.

    7) Add mode parameter to pktgen, for testing receive. From Alexei
    Starovoitov.

    8) Cache access optimizations via simplifications of build_skb(), from
    Alexander Duyck.

    9) Move page frag allocator under mm/, also from Alexander.

    10) Add xmit_more support to hv_netvsc, from KY Srinivasan.

    11) Add a counter guard in case we try to perform endless reclassify
    loops in the packet scheduler.

    12) Extern flow dissector to be programmable and use it in new "Flower"
    classifier. From Jiri Pirko.

    13) AF_PACKET fanout rollover fixes, performance improvements, and new
    statistics. From Willem de Bruijn.

    14) Add netdev driver for GENEVE tunnels, from John W Linville.

    15) Add ingress netfilter hooks and filtering, from Pablo Neira Ayuso.

    16) Fix handling of epoll edge triggers in TCP, from Eric Dumazet.

    17) Add an ECN retry fallback for the initial TCP handshake, from Daniel
    Borkmann.

    18) Add tail call support to BPF, from Alexei Starovoitov.

    19) Add several pktgen helper scripts, from Jesper Dangaard Brouer.

    20) Add zerocopy support to AF_UNIX, from Hannes Frederic Sowa.

    21) Favor even port numbers for allocation to connect() requests, and
    odd port numbers for bind(0), in an effort to help avoid
    ip_local_port_range exhaustion. From Eric Dumazet.

    22) Add Cavium ThunderX driver, from Sunil Goutham.

    23) Allow bpf programs to access skb_iif and dev->ifindex SKB metadata,
    from Alexei Starovoitov.

    24) Add support for T6 chips in cxgb4vf driver, from Hariprasad Shenai.

    25) Double TCP Small Queues default to 256K to accomodate situations
    like the XEN driver and wireless aggregation. From Wei Liu.

    26) Add more entropy inputs to flow dissector, from Tom Herbert.

    27) Add CDG congestion control algorithm to TCP, from Kenneth Klette
    Jonassen.

    28) Convert ipset over to RCU locking, from Jozsef Kadlecsik.

    29) Track and act upon link status of ipv4 route nexthops, from Andy
    Gospodarek.

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1670 commits)
    bridge: vlan: flush the dynamically learned entries on port vlan delete
    bridge: multicast: add a comment to br_port_state_selection about blocking state
    net: inet_diag: export IPV6_V6ONLY sockopt
    stmmac: troubleshoot unexpected bits in des0 & des1
    net: ipv4 sysctl option to ignore routes when nexthop link is down
    net: track link-status of ipv4 nexthops
    net: switchdev: ignore unsupported bridge flags
    net: Cavium: Fix MAC address setting in shutdown state
    drivers: net: xgene: fix for ACPI support without ACPI
    ip: report the original address of ICMP messages
    net/mlx5e: Prefetch skb data on RX
    net/mlx5e: Pop cq outside mlx5e_get_cqe
    net/mlx5e: Remove mlx5e_cq.sqrq back-pointer
    net/mlx5e: Remove extra spaces
    net/mlx5e: Avoid TX CQE generation if more xmit packets expected
    net/mlx5e: Avoid redundant dev_kfree_skb() upon NOP completion
    net/mlx5e: Remove re-assignment of wq type in mlx5e_enable_rq()
    net/mlx5e: Use skb_shinfo(skb)->gso_segs rather than counting them
    net/mlx5e: Static mapping of netdev priv resources to/from netdev TX queues
    net/mlx4_en: Use HW counters for rx/tx bytes/packets in PF device
    ...

    Linus Torvalds
     
  • Pull scheduler updates from Thomas Gleixner:
    "This series of scheduler updates depends on sched/core and timers/core
    branches, which are already in your tree:

    - Scheduler balancing overhaul to plug a hard to trigger race which
    causes an oops in the balancer (Peter Zijlstra)

    - Lockdep updates which are related to the balancing updates (Peter
    Zijlstra)"

    * 'sched-hrtimers-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    sched,lockdep: Employ lock pinning
    lockdep: Implement lock pinning
    lockdep: Simplify lock_release()
    sched: Streamline the task migration locking a little
    sched: Move code around
    sched,dl: Fix sched class hopping CBS hole
    sched, dl: Convert switched_{from, to}_dl() / prio_changed_dl() to balance callbacks
    sched,dl: Remove return value from pull_dl_task()
    sched, rt: Convert switched_{from, to}_rt() / prio_changed_rt() to balance callbacks
    sched,rt: Remove return value from pull_rt_task()
    sched: Allow balance callbacks for check_class_changed()
    sched: Use replace normalize_task() with __sched_setscheduler()
    sched: Replace post_schedule with a balance callback list

    Linus Torvalds
     
  • Pull arm64 updates from Catalin Marinas:
    "Mostly refactoring/clean-up:

    - CPU ops and PSCI (Power State Coordination Interface) refactoring
    following the merging of the arm64 ACPI support, together with
    handling of Trusted (secure) OS instances

    - Using fixmap for permanent FDT mapping, removing the initial dtb
    placement requirements (within 512MB from the start of the kernel
    image). This required moving the FDT self reservation out of the
    memreserve processing

    - Idmap (1:1 mapping used for MMU on/off) handling clean-up

    - Removing flush_cache_all() - not safe on ARM unless the MMU is off.
    Last stages of CPU power down/up are handled by firmware already

    - "Alternatives" (run-time code patching) refactoring and support for
    immediate branch patching, GICv3 CPU interface access

    - User faults handling clean-up

    And some fixes:

    - Fix for VDSO building with broken ELF toolchains

    - Fix another case of init_mm.pgd usage for user mappings (during
    ASID roll-over broadcasting)

    - Fix for FPSIMD reloading after CPU hotplug

    - Fix for missing syscall trace exit

    - Workaround for .inst asm bug

    - Compat fix for switching the user tls tpidr_el0 register"

    * tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (42 commits)
    arm64: use private ratelimit state along with show_unhandled_signals
    arm64: show unhandled SP/PC alignment faults
    arm64: vdso: work-around broken ELF toolchains in Makefile
    arm64: kernel: rename __cpu_suspend to keep it aligned with arm
    arm64: compat: print compat_sp instead of sp
    arm64: mm: Fix freeing of the wrong memmap entries with !SPARSEMEM_VMEMMAP
    arm64: entry: fix context tracking for el0_sp_pc
    arm64: defconfig: enable memtest
    arm64: mm: remove reference to tlb.S from comment block
    arm64: Do not attempt to use init_mm in reset_context()
    arm64: KVM: Switch vgic save/restore to alternative_insn
    arm64: alternative: Introduce feature for GICv3 CPU interface
    arm64: psci: fix !CONFIG_HOTPLUG_CPU build warning
    arm64: fix bug for reloading FPSIMD state after CPU hotplug.
    arm64: kernel thread don't need to save fpsimd context.
    arm64: fix missing syscall trace exit
    arm64: alternative: Work around .inst assembler bugs
    arm64: alternative: Merge alternative-asm.h into alternative.h
    arm64: alternative: Allow immediate branch as alternative instruction
    arm64: Rework alternate sequence for ARM erratum 845719
    ...

    Linus Torvalds