28 Oct, 2016

1 commit

  • The per-zone waitqueues exist because of a scalability issue with the
    page waitqueues on some NUMA machines, but it turns out that they hurt
    normal loads, and now with the vmalloced stacks they also end up
    breaking gfs2 that uses a bit_wait on a stack object:

    wait_on_bit(&gh->gh_iflags, HIF_WAIT, TASK_UNINTERRUPTIBLE)

    where 'gh' can be a reference to the local variable 'mount_gh' on the
    stack of fill_super().

    The reason the per-zone hash table breaks for this case is that there is
    no "zone" for virtual allocations, and trying to look up the physical
    page to get at it will fail (with a BUG_ON()).

    It turns out that I actually complained to the mm people about the
    per-zone hash table for another reason just a month ago: the zone lookup
    also hurts the regular use of "unlock_page()" a lot, because the zone
    lookup ends up forcing several unnecessary cache misses and generates
    horrible code.

    As part of that earlier discussion, we had a much better solution for
    the NUMA scalability issue - by just making the page lock have a
    separate contention bit, the waitqueue doesn't even have to be looked at
    for the normal case.

    Peter Zijlstra already has a patch for that, but let's see if anybody
    even notices. In the meantime, let's fix the actual gfs2 breakage by
    simplifying the bitlock waitqueues and removing the per-zone issue.

    Reported-by: Andreas Gruenbacher
    Tested-by: Bob Peterson
    Acked-by: Mel Gorman
    Cc: Peter Zijlstra
    Cc: Andy Lutomirski
    Cc: Steven Whitehouse
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

25 Oct, 2016

2 commits

  • Pull clk fixes from Stephen Boyd:
    "This is the first batch of clk driver fixes for this release.

    We have a handful of fixes for the uniphier clk driver that was
    introduced recently, as well as Kconfig option hiding, module
    autoloading markings, and a few fixes for clk_hw based registration
    patches that went in this merge window"

    * tag 'clk-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux:
    clk: at91: Fix a return value in case of error
    clk: uniphier: rename MIO clock to SD clock for Pro5, PXs2, LD20 SoCs
    clk: uniphier: fix memory overrun bug
    clk: hi6220: use CLK_OF_DECLARE_DRIVER for sysctrl and mediactrl clock init
    clk: mvebu: armada-37xx-periph: Fix the clock gate flag
    clk: bcm2835: Clamp the PLL's requested rate to the hardware limits.
    clk: max77686: fix number of clocks setup for clk_hw based registration
    clk: mvebu: armada-37xx-periph: Fix the clock provider registration
    clk: core: add __init decoration for CLK_OF_DECLARE_DRIVER function
    clk: mediatek: Add hardware dependency
    clk: samsung: clk-exynos-audss: Fix module autoload
    clk: uniphier: fix type of variable passed to regmap_read()
    clk: uniphier: add system clock support for sLD3 SoC

    Linus Torvalds
     
  • This patch unexports the low-level __get_user_pages() function.

    Recent refactoring of the get_user_pages* functions allow flags to be
    passed through get_user_pages() which eliminates the need for access to
    this function from its one user, kvm.

    We can see that the two calls to get_user_pages() which replace
    __get_user_pages() in kvm_main.c are equivalent by examining their call
    stacks:

    get_user_page_nowait():
    get_user_pages(start, 1, flags, page, NULL)
    __get_user_pages_locked(current, current->mm, start, 1, page, NULL, NULL,
    false, flags | FOLL_TOUCH)
    __get_user_pages(current, current->mm, start, 1,
    flags | FOLL_TOUCH | FOLL_GET, page, NULL, NULL)

    check_user_page_hwpoison():
    get_user_pages(addr, 1, flags, NULL, NULL)
    __get_user_pages_locked(current, current->mm, addr, 1, NULL, NULL, NULL,
    false, flags | FOLL_TOUCH)
    __get_user_pages(current, current->mm, addr, 1, flags | FOLL_TOUCH, NULL,
    NULL, NULL)

    Signed-off-by: Lorenzo Stoakes
    Acked-by: Paolo Bonzini
    Acked-by: Michal Hocko
    Signed-off-by: Linus Torvalds

    Lorenzo Stoakes
     

24 Oct, 2016

2 commits

  • Pull SCSI target fixes from Nicholas Bellinger:
    "Here are the outstanding target-pending fixes for v4.9-rc2.

    This includes:

    - Fix v4.1.y+ reference leak regression with concurrent TMR
    ABORT_TASK + session shutdown. (Vaibhav Tandon)

    - Enable tcm_fc w/ SCF_USE_CPUID to avoid host exchange timeouts
    (Hannes)

    - target/user error sense handling fixes. (Andy + MNC + HCH)

    - Fix iscsi-target NOP_OUT error path iscsi_cmd descriptor leak
    (Varun)

    - Two EXTENDED_COPY SCSI status fixes for ESX VAAI (Dinesh Israni +
    Nixon Vincent)

    - Revert a v4.8 residual overflow change, that breaks sg_inq with
    small allocation lengths.

    There are a number of folks stress testing the v4.1.y regression fix
    in their environments, and more folks doing iser-target I/O stress
    testing atop recent v4.x.y code.

    There is also one v4.2.y+ RCU conversion regression related to
    explicit NodeACL configfs changes, that is still being tracked down"

    * git://git.kernel.org/pub/scm/linux/kernel/git/nab/target-pending:
    target/tcm_fc: use CPU affinity for responses
    target/tcm_fc: Update debugging statements to match libfc usage
    target/tcm_fc: return detailed error in ft_sess_create()
    target/tcm_fc: print command pointer in debug message
    target: fix potential race window in target_sess_cmd_list_waiting()
    Revert "target: Fix residual overflow handling in target_complete_cmd_with_length"
    target: Don't override EXTENDED_COPY xcopy_pt_cmd SCSI status code
    target: Make EXTENDED_COPY 0xe4 failure return COPY TARGET DEVICE NOT REACHABLE
    target: Re-add missing SCF_ACK_KREF assignment in v4.1.y
    iscsi-target: fix iscsi cmd leak
    iscsi-target: fix spelling mistake "Unsolicitied" -> "Unsolicited"
    target/user: Fix comments to not refer to data ring
    target/user: Return an error if cmd data size is too large
    target/user: Use sense_reason_t in tcmu_queue_cmd_ring

    Linus Torvalds
     
  • Pull IPMI updates from Corey Minyard:
    "A small bug fix and a new driver for acting as an IPMI device.

    I was on vacation during the merge window (a long vacation) but this
    is a bug fix that should go in and a new driver that shouldn't hurt
    anything.

    This has been in linux-next for a month or so"

    * tag 'for-linus-4.9-2' of git://git.code.sf.net/p/openipmi/linux-ipmi:
    ipmi: fix crash on reading version from proc after unregisted bmc
    ipmi/bt-bmc: remove redundant return value check of platform_get_resource()
    ipmi/bt-bmc: add a dependency on ARCH_ASPEED
    ipmi: Fix ioremap error handling in bt-bmc
    ipmi: add an Aspeed BT IPMI BMC driver

    Linus Torvalds
     

23 Oct, 2016

4 commits

  • Pull timer updates from Thomas Gleixner:
    "This updates contains:

    - A revert which addresses a boot failure on ARM Sun5i platforms

    - A new clocksource driver, which has been delayed beyond rc1 due to
    an interrupt driver issue which was unearthed by this driver. The
    debugging of that issue and the discussion about the proper
    solution made this driver miss the merge window. There is no point
    in delaying it for a full cycle as it completes the basic mainline
    support for the new JCore platform and does not create any risk
    outside of that platform"

    * 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    Revert "clocksource/drivers/timer_sun5i: Replace code by clocksource_mmio_init"
    clocksource: Add J-Core timer/clocksource driver
    of: Add J-Core timer bindings

    Linus Torvalds
     
  • Pull x86 fixes from Ingo Molnar:
    "Three fixes, a hw-enablement and a cross-arch fix/enablement change:

    - SGI/UV fix for older platforms

    - x32 signal handling fix

    - older x86 platform bootup APIC fix

    - AVX512-4VNNIW (Neural Network Instructions) and AVX512-4FMAPS
    (Multiply Accumulation Single precision instructions) enablement.

    - move thread_info back into x86 specific code, to make life easier
    for other architectures trying to make use of
    CONFIG_THREAD_INFO_IN_TASK_STRUCT=y"

    * 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    x86/boot/smp: Don't try to poke disabled/non-existent APIC
    sched/core, x86: Make struct thread_info arch specific again
    x86/signal: Remove bogus user_64bit_mode() check from sigaction_compat_abi()
    x86/platform/UV: Fix support for EFI_OLD_MEMMAP after BIOS callback updates
    x86/cpufeature: Add AVX512_4VNNIW and AVX512_4FMAPS features
    x86/vmware: Skip timer_irq_works() check on VMware

    Linus Torvalds
     
  • Pull vmap stack fixes from Ingo Molnar:
    "This is fallout from CONFIG_HAVE_ARCH_VMAP_STACK=y on x86: stack
    accesses that used to be just somewhat questionable are now totally
    buggy.

    These changes try to do it without breaking the ABI: the fields are
    left there, they are just reporting zero, or reporting narrower
    information (the maps file change)"

    * 'mm-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    mm: Change vm_is_stack_for_task() to vm_is_stack_for_current()
    fs/proc: Stop trying to report thread stacks
    fs/proc: Stop reporting eip and esp in /proc/PID/stat
    mm/numa: Remove duplicated include from mprotect.c

    Linus Torvalds
     
  • Pull irq fixes from Ingo Molnar:
    "Mostly irqchip driver fixes, plus a symbol export"

    * 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    kernel/irq: Export irq_set_parent()
    irqchip/gic: Add missing \n to CPU IF adjustment message
    irqchip/jcore: Don't show Kconfig menu item for driver
    irqchip/eznps: Drop pointless static qualifier in nps400_of_init()
    irqchip/gic-v3-its: Fix entry size mask for GITS_BASER
    irqchip/gic-v3-its: Fix 64bit GIC{R,ITS}_TYPER accesses

    Linus Torvalds
     

22 Oct, 2016

4 commits

  • Pull ACPI fixes from Rafael Wysocki:
    "These fix an issue related to system resume in the new WDAT-based
    watchdog driver and a return value of a stub function in the ACPI CPPC
    framework.

    Specifics:

    - Update the ACPI WDAT-based watchdog driver to ping the hardware
    during system resume to prevent a reset from occurring after the
    resume is complete (Mika Westerberg).

    - Fix the return value of the pcc_mbox_request_channel() stub for
    CONFIG_PCC unset (Hoan Tran)"

    * tag 'acpi-4.9-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
    watchdog: wdat_wdt: Ping the watchdog on resume
    mailbox: PCC: Fix return value of pcc_mbox_request_channel()

    Linus Torvalds
     
  • * acpi-wdat:
    watchdog: wdat_wdt: Ping the watchdog on resume

    * acpi-cppc:
    mailbox: PCC: Fix return value of pcc_mbox_request_channel()

    Rafael J. Wysocki
     
  • …it/maz/arm-platforms into irq/urgent

    Pull GIC updates from Marc Zyngier:

    - Fix for 32bit accesses that should be 64bit on 64bit machines
    - Fix for a field decoding macro
    - Beautify a warning message

    Thomas Gleixner
     
  • Pull block fixes from Jens Axboe:
    "A set of fixes that missed the merge window, mostly due to me being
    away around that time.

    Nothing major here, a mix of nvme cleanups and fixes, and one fix for
    the badblocks handling"

    * 'for-linus' of git://git.kernel.dk/linux-block:
    nvmet: use symbolic constants for CNS values
    nvme: use symbolic constants for CNS values
    nvme.h: add an enum for cns values
    nvme.h: don't use uuid_be
    nvme.h: resync with nvme-cli
    nvme: Add tertiary number to NVME_VS
    nvme : Add sysfs entry for NVMe CMBs when appropriate
    nvme: don't schedule multiple resets
    nvme: Delete created IO queues on reset
    nvme: Stop probing a removed device
    badblocks: fix overlapping check for clearing

    Linus Torvalds
     

21 Oct, 2016

3 commits

  • Pull power management fix from Rafael Wysocki:
    "This fixes the pointer arithmetics mess-up in the cpufreq core
    introduced by one of recent commits and leading to all kinds of
    breakage from kernel crashes to incorrect governor decisions (Sergey
    Senozhatsky)"

    * tag 'pm-4.9-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
    cpufreq: fix overflow in cpufreq_table_find_index_dl()

    Linus Torvalds
     
  • * pm-cpufreq:
    cpufreq: fix overflow in cpufreq_table_find_index_dl()

    Rafael J. Wysocki
     
  • At the hardware level, the J-Core PIT is integrated with the interrupt
    controller, but it is represented as its own device and has an
    independent programming interface. It provides a 12-bit countdown
    timer, which is not presently used, and a periodic timer. The interval
    length for the latter is programmable via a 32-bit throttle register
    whose units are determined by a bus-period register. The periodic
    timer is used to implement both periodic and oneshot clock event
    modes; in oneshot mode the interrupt handler simply disables the timer
    as soon as it fires.

    Despite its device tree node representing an interrupt for the PIT,
    the actual irq generated is programmable, not hard-wired. The driver
    is responsible for programming the PIT to generate the hardware irq
    number that the DT assigns to it.

    On SMP configurations, J-Core provides cpu-local instances of the PIT;
    no broadcast timer is needed. This driver supports the creation of the
    necessary per-cpu clock_event_device instances.

    A nanosecond-resolution clocksource is provided using the J-Core "RTC"
    registers, which give a 64-bit seconds count and 32-bit nanoseconds
    that wrap every second. The driver converts these to a full-range
    32-bit nanoseconds count.

    Signed-off-by: Rich Felker
    Cc: Mark Rutland
    Cc: devicetree@vger.kernel.org
    Cc: linux-sh@vger.kernel.org
    Cc: Daniel Lezcano
    Cc: Rob Herring
    Link: http://lkml.kernel.org/r/b591ff12cc5ebf63d1edc98da26046f95a233814.1476393790.git.dalias@libc.org
    Signed-off-by: Thomas Gleixner

    Rich Felker
     

20 Oct, 2016

8 commits

  • 'best' is always less or equals to 'pos', so `best - pos' returns
    a negative value which is then getting casted to `unsigned int'
    and passed to __cpufreq_driver_target()->acpi_cpufreq_target()
    for policy->freq_table selection. This results in

    BUG: unable to handle kernel paging request at ffff881019b469f8
    IP: [] acpi_cpufreq_target+0x4f/0x190 [acpi_cpufreq]
    PGD 267f067
    PUD 0

    Oops: 0000 [#1] PREEMPT SMP
    CPU: 6 PID: 70 Comm: kworker/6:1 Not tainted 4.9.0-rc1-next-20161017-dbg-dirty
    Workqueue: events dbs_work_handler
    task: ffff88041b808000 task.stack: ffff88041b810000
    RIP: 0010:[] [] acpi_cpufreq_target+0x4f/0x190 [acpi_cpufreq]
    RSP: 0018:ffff88041b813c60 EFLAGS: 00010282
    RAX: ffff880419b46a00 RBX: ffff88041b848400 RCX: ffff880419b20f80
    RDX: 00000000001dff38 RSI: 00000000ffffffff RDI: ffff88041b848400
    RBP: ffff88041b813cb0 R08: 0000000000000006 R09: 0000000000000040
    R10: ffffffff8207f9e0 R11: ffffffff8173595b R12: 0000000000000000
    R13: ffff88041f1dff38 R14: 0000000000262900 R15: 0000000bfffffff4
    FS: 0000000000000000(0000) GS:ffff88041f000000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: ffff881019b469f8 CR3: 000000041a2d3000 CR4: 00000000001406e0
    Stack:
    ffff88041b813cb0 ffffffff813347f9 ffff88041b813ca0 ffffffff81334663
    ffff88041f1d4bc0 ffff88041b848400 0000000000000000 0000000000000000
    0000000000262900 0000000000000000 ffff88041b813d00 ffffffff813355dc
    Call Trace:
    [] ? cpufreq_freq_transition_begin+0xf1/0xfc
    [] ? get_cpu_idle_time+0x97/0xa6
    [] __cpufreq_driver_target+0x3b6/0x44e
    [] cs_dbs_timer+0x11a/0x135
    [] dbs_work_handler+0x39/0x62
    [] process_one_work+0x280/0x4a5
    [] worker_thread+0x24f/0x397
    [] ? rescuer_thread+0x30b/0x30b
    [] ? nl80211_get_key+0x29/0x36a
    [] kthread+0xfc/0x104
    [] ? put_lock_stats.isra.9+0xe/0x20
    [] ? kthread_create_on_node+0x3f/0x3f
    [] ret_from_fork+0x22/0x30
    Code: 56 4d 6b ff 0c 41 55 41 54 53 48 83 ec 28 48 8b 15 ad 1e 00 00 44 8b 41
    08 48 8b 87 c8 00 00 00 49 89 d5 4e 03 2c c5 80 b2 78 81 8b 74 38 04 45
    3b 75 00 75 11 31 c0 83 39 00 0f 84 1c 01 00
    RIP [] acpi_cpufreq_target+0x4f/0x190 [acpi_cpufreq]
    RSP
    CR2: ffff881019b469f8
    ---[ end trace 16d9fc7a17897d37 ]---

    [ rjw: In some cases this bug may also cause incorrect frequencies to
    be selected by cpufreq governors. ]

    Fixes: 899bb6642f2a (cpufreq: skip invalid entries when searching the frequency)
    Link: http://marc.info/?l=linux-kernel&m=147672030714331&w=2
    Reported-and-tested-by: Sedat Dilek
    Reported-and-tested-by: Jörg Otte
    Signed-off-by: Sergey Senozhatsky
    Acked-by: Viresh Kumar
    Cc: 4.8+ # 4.8+
    Signed-off-by: Rafael J. Wysocki

    Sergey Senozhatsky
     
  • The following commit:

    c65eacbe290b ("sched/core: Allow putting thread_info into task_struct")

    ... made 'struct thread_info' a generic struct with only a
    single ::flags member, if CONFIG_THREAD_INFO_IN_TASK_STRUCT=y is
    selected.

    This change however seems to be quite x86 centric, since at least the
    generic preemption code (asm-generic/preempt.h) assumes that struct
    thread_info also has a preempt_count member, which apparently was not
    true for x86.

    We could add a bit more #ifdefs to solve this problem too, but it seems
    to be much simpler to make struct thread_info arch specific
    again. This also makes the conversion to THREAD_INFO_IN_TASK_STRUCT a
    bit easier for architectures that have a couple of arch specific stuff
    in their thread_info definition.

    The arch specific stuff _could_ be moved to thread_struct. However
    keeping them in thread_info makes it easier: accessing thread_info
    members is simple, since it is at the beginning of the task_struct,
    while the thread_struct is at the end. At least on s390 the offsets
    needed to access members of the thread_struct (with task_struct as
    base) are too large for various asm instructions. This is not a
    problem when keeping these members within thread_info.

    Signed-off-by: Heiko Carstens
    Signed-off-by: Mark Rutland
    Acked-by: Thomas Gleixner
    Cc: Andrew Morton
    Cc: Andy Lutomirski
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: keescook@chromium.org
    Cc: linux-arch@vger.kernel.org
    Link: http://lkml.kernel.org/r/1476901693-8492-2-git-send-email-mark.rutland@arm.com
    Signed-off-by: Ingo Molnar

    Heiko Carstens
     
  • Asking for a non-current task's stack can't be done without races
    unless the task is frozen in kernel mode. As far as I know,
    vm_is_stack_for_task() never had a safe non-current use case.

    The __unused annotation is because some KSTK_ESP implementations
    ignore their parameter, which IMO is further justification for this
    patch.

    Signed-off-by: Andy Lutomirski
    Acked-by: Thomas Gleixner
    Cc: Al Viro
    Cc: Andrew Morton
    Cc: Borislav Petkov
    Cc: Brian Gerst
    Cc: Jann Horn
    Cc: Kees Cook
    Cc: Linus Torvalds
    Cc: Linux API
    Cc: Peter Zijlstra
    Cc: Tycho Andersen
    Link: http://lkml.kernel.org/r/4c3f68f426e6c061ca98b4fc7ef85ffbb0a25b0c.1475257877.git.luto@kernel.org
    Signed-off-by: Ingo Molnar

    Andy Lutomirski
     
  • This patch addresses a bug where EXTENDED_COPY across multiple LUNs
    results in a CHECK_CONDITION when the source + destination are not
    located on the same physical node.

    ESX Host environments expect sense COPY_ABORTED w/ COPY TARGET DEVICE
    NOT REACHABLE to be returned when this occurs, in order to signal
    fallback to local copy method.

    As described in section 6.3.3 of spc4r22:

    "If it is not possible to complete processing of a segment because the
    copy manager is unable to establish communications with a copy target
    device, because the copy target device does not respond to INQUIRY,
    or because the data returned in response to INQUIRY indicates
    an unsupported logical unit, then the EXTENDED COPY command shall be
    terminated with CHECK CONDITION status, with the sense key set to
    COPY ABORTED, and the additional sense code set to COPY TARGET DEVICE
    NOT REACHABLE."

    Tested on v4.1.y with ESX v5.5u2+ with BlockCopy across multiple nodes.

    Reported-by: Nixon Vincent
    Tested-by: Nixon Vincent
    Cc: Nixon Vincent
    Tested-by: Dinesh Israni
    Signed-off-by: Dinesh Israni
    Cc: Dinesh Israni
    Cc: stable@vger.kernel.org # 3.14+
    Signed-off-by: Nicholas Bellinger

    Nicholas Bellinger
     
  • Ported over from nvme-cli.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Gabriel Krisman Bertazi
    Reviewed-by: Keith Busch
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • This makes life easier for nvme-cli and we don't really need the uuid
    type anyway to start with.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Gabriel Krisman Bertazi
    Reviewed-by: Jay Freyensee
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • Import a few updates to nvme.h from nvme-cli. This mostly includes a few
    new fields and error codes, but also a few renames that so far are only
    used in user space. Also one field is moved from an array of two le64
    values to one of 16 u8 values so that we can more easily access it.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Keith Busch
    Reviewed-by: Gabriel Krisman Bertazi
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • NVMe 1.2.1 specification adds a tertiary element to the version number.
    This updates the macro and its callers to include the final number and
    fixup a single place in nvmet where the version was generated manually.

    Signed-off-by: Gabriel Krisman Bertazi
    Reviewed-by: Sagi Grimberg
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Gabriel Krisman Bertazi
     

19 Oct, 2016

11 commits

  • Merge the gup_flags cleanups from Lorenzo Stoakes:
    "This patch series adjusts functions in the get_user_pages* family such
    that desired FOLL_* flags are passed as an argument rather than
    implied by flags.

    The purpose of this change is to make the use of FOLL_FORCE explicit
    so it is easier to grep for and clearer to callers that this flag is
    being used. The use of FOLL_FORCE is an issue as it overrides missing
    VM_READ/VM_WRITE flags for the VMA whose pages we are reading
    from/writing to, which can result in surprising behaviour.

    The patch series came out of the discussion around commit 38e088546522
    ("mm: check VMA flags to avoid invalid PROT_NONE NUMA balancing"),
    which addressed a BUG_ON() being triggered when a page was faulted in
    with PROT_NONE set but having been overridden by FOLL_FORCE.
    do_numa_page() was run on the assumption the page _must_ be one marked
    for NUMA node migration as an actual PROT_NONE page would have been
    dealt with prior to this code path, however FOLL_FORCE introduced a
    situation where this assumption did not hold.

    See

    https://marc.info/?l=linux-mm&m=147585445805166

    for the patch proposal"

    Additionally, there's a fix for an ancient bug related to FOLL_FORCE and
    FOLL_WRITE by me.

    [ This branch was rebased recently to add a few more acked-by's and
    reviewed-by's ]

    * gup_flag-cleanups:
    mm: replace access_process_vm() write parameter with gup_flags
    mm: replace access_remote_vm() write parameter with gup_flags
    mm: replace __access_remote_vm() write parameter with gup_flags
    mm: replace get_user_pages_remote() write/force parameters with gup_flags
    mm: replace get_user_pages() write/force parameters with gup_flags
    mm: replace get_vaddr_frames() write/force parameters with gup_flags
    mm: replace get_user_pages_locked() write/force parameters with gup_flags
    mm: replace get_user_pages_unlocked() write/force parameters with gup_flags
    mm: remove write/force parameters from __get_user_pages_unlocked()
    mm: remove write/force parameters from __get_user_pages_locked()
    mm: remove gup_flags FOLL_WRITE games from __get_user_pages()

    Linus Torvalds
     
  • This removes the 'write' argument from access_process_vm() and replaces
    it with 'gup_flags' as use of this function previously silently implied
    FOLL_FORCE, whereas after this patch callers explicitly pass this flag.

    We make this explicit as use of FOLL_FORCE can result in surprising
    behaviour (and hence bugs) within the mm subsystem.

    Signed-off-by: Lorenzo Stoakes
    Acked-by: Jesper Nilsson
    Acked-by: Michal Hocko
    Acked-by: Michael Ellerman
    Signed-off-by: Linus Torvalds

    Lorenzo Stoakes
     
  • This removes the 'write' argument from access_remote_vm() and replaces
    it with 'gup_flags' as use of this function previously silently implied
    FOLL_FORCE, whereas after this patch callers explicitly pass this flag.

    We make this explicit as use of FOLL_FORCE can result in surprising
    behaviour (and hence bugs) within the mm subsystem.

    Signed-off-by: Lorenzo Stoakes
    Acked-by: Michal Hocko
    Signed-off-by: Linus Torvalds

    Lorenzo Stoakes
     
  • This removes the 'write' and 'force' from get_user_pages_remote() and
    replaces them with 'gup_flags' to make the use of FOLL_FORCE explicit in
    callers as use of this flag can result in surprising behaviour (and
    hence bugs) within the mm subsystem.

    Signed-off-by: Lorenzo Stoakes
    Acked-by: Michal Hocko
    Reviewed-by: Jan Kara
    Signed-off-by: Linus Torvalds

    Lorenzo Stoakes
     
  • This removes the 'write' and 'force' from get_user_pages() and replaces
    them with 'gup_flags' to make the use of FOLL_FORCE explicit in callers
    as use of this flag can result in surprising behaviour (and hence bugs)
    within the mm subsystem.

    Signed-off-by: Lorenzo Stoakes
    Acked-by: Christian König
    Acked-by: Jesper Nilsson
    Acked-by: Michal Hocko
    Reviewed-by: Jan Kara
    Signed-off-by: Linus Torvalds

    Lorenzo Stoakes
     
  • This removes the 'write' and 'force' from get_vaddr_frames() and
    replaces them with 'gup_flags' to make the use of FOLL_FORCE explicit in
    callers as use of this flag can result in surprising behaviour (and
    hence bugs) within the mm subsystem.

    Signed-off-by: Lorenzo Stoakes
    Acked-by: Michal Hocko
    Reviewed-by: Jan Kara
    Signed-off-by: Linus Torvalds

    Lorenzo Stoakes
     
  • This removes the 'write' and 'force' use from get_user_pages_locked()
    and replaces them with 'gup_flags' to make the use of FOLL_FORCE
    explicit in callers as use of this flag can result in surprising
    behaviour (and hence bugs) within the mm subsystem.

    Signed-off-by: Lorenzo Stoakes
    Acked-by: Michal Hocko
    Reviewed-by: Jan Kara
    Signed-off-by: Linus Torvalds

    Lorenzo Stoakes
     
  • This removes the 'write' and 'force' use from get_user_pages_unlocked()
    and replaces them with 'gup_flags' to make the use of FOLL_FORCE
    explicit in callers as use of this flag can result in surprising
    behaviour (and hence bugs) within the mm subsystem.

    Signed-off-by: Lorenzo Stoakes
    Reviewed-by: Jan Kara
    Acked-by: Michal Hocko
    Signed-off-by: Linus Torvalds

    Lorenzo Stoakes
     
  • This removes the redundant 'write' and 'force' parameters from
    __get_user_pages_unlocked() to make the use of FOLL_FORCE explicit in
    callers as use of this flag can result in surprising behaviour (and
    hence bugs) within the mm subsystem.

    Signed-off-by: Lorenzo Stoakes
    Acked-by: Paolo Bonzini
    Reviewed-by: Jan Kara
    Acked-by: Michal Hocko
    Signed-off-by: Linus Torvalds

    Lorenzo Stoakes
     
  • This is an ancient bug that was actually attempted to be fixed once
    (badly) by me eleven years ago in commit 4ceb5db9757a ("Fix
    get_user_pages() race for write access") but that was then undone due to
    problems on s390 by commit f33ea7f404e5 ("fix get_user_pages bug").

    In the meantime, the s390 situation has long been fixed, and we can now
    fix it by checking the pte_dirty() bit properly (and do it better). The
    s390 dirty bit was implemented in abf09bed3cce ("s390/mm: implement
    software dirty bits") which made it into v3.9. Earlier kernels will
    have to look at the page state itself.

    Also, the VM has become more scalable, and what used a purely
    theoretical race back then has become easier to trigger.

    To fix it, we introduce a new internal FOLL_COW flag to mark the "yes,
    we already did a COW" rather than play racy games with FOLL_WRITE that
    is very fundamental, and then use the pte dirty flag to validate that
    the FOLL_COW flag is still valid.

    Reported-and-tested-by: Phil "not Paul" Oester
    Acked-by: Hugh Dickins
    Reviewed-by: Michal Hocko
    Cc: Andy Lutomirski
    Cc: Kees Cook
    Cc: Oleg Nesterov
    Cc: Willy Tarreau
    Cc: Nick Piggin
    Cc: Greg Thelen
    Cc: stable@vger.kernel.org
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     
  • Pull perf fixes from Ingo Molnar:
    "Four tooling fixes, two kprobes KASAN related fixes and an x86 PMU
    driver fix/cleanup"

    * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    perf jit: Fix build issue on Ubuntu
    perf jevents: Handle events including .c and .o
    perf/x86/intel: Remove an inconsistent NULL check
    kprobes: Unpoison stack in jprobe_return() for KASAN
    kprobes: Avoid false KASAN reports during stack copy
    perf header: Set nr_numa_nodes only when we parsed all the data
    perf top: Fix refreshing hierarchy entries on TUI

    Linus Torvalds
     

18 Oct, 2016

2 commits

  • The new introduced macro CLK_OF_DECLARE_DRIVER is usually used to
    declare clock driver init functions, which are mostly decorated with
    __init. Add __init decoration for CLK_OF_DECLARE_DRIVER function to
    avoid causing section mismatch warnings on client clock drivers.

    Signed-off-by: Shawn Guo
    Fixes: c7296c51ce5d ("clk: core: New macro CLK_OF_DECLARE_DRIVER")
    Signed-off-by: Stephen Boyd

    Shawn Guo
     
  • pkey_set() and pkey_get() were syscalls present in older versions
    of the protection keys patches. They were fully excised from the
    x86 code, but some cruft was left in the generic syscall code. The
    C++ comments were intended to help to make it more glaring to me to
    fix them before actually submitting them. That technique worked,
    but later than I would have liked.

    I test-compiled this for arm64.

    Fixes: a60f7b69d92c0 ("generic syscalls: Wire up memory protection keys syscalls")
    Signed-off-by: Dave Hansen
    Acked-by: Arnd Bergmann
    Cc: Thomas Gleixner
    Cc: x86@kernel.org
    Cc: linux-arch@vger.kernel.org
    Cc: mgorman@techsingularity.net
    Cc: linux-api@vger.kernel.org
    Cc: linux-mm@kvack.org
    Cc: luto@kernel.org
    Cc: akpm@linux-foundation.org
    Signed-off-by: Linus Torvalds

    Dave Hansen
     

17 Oct, 2016

2 commits


16 Oct, 2016

1 commit

  • I observed false KSAN positives in the sctp code, when
    sctp uses jprobe_return() in jsctp_sf_eat_sack().

    The stray 0xf4 in shadow memory are stack redzones:

    [ ] ==================================================================
    [ ] BUG: KASAN: stack-out-of-bounds in memcmp+0xe9/0x150 at addr ffff88005e48f480
    [ ] Read of size 1 by task syz-executor/18535
    [ ] page:ffffea00017923c0 count:0 mapcount:0 mapping: (null) index:0x0
    [ ] flags: 0x1fffc0000000000()
    [ ] page dumped because: kasan: bad access detected
    [ ] CPU: 1 PID: 18535 Comm: syz-executor Not tainted 4.8.0+ #28
    [ ] Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
    [ ] ffff88005e48f2d0 ffffffff82d2b849 ffffffff0bc91e90 fffffbfff10971e8
    [ ] ffffed000bc91e90 ffffed000bc91e90 0000000000000001 0000000000000000
    [ ] ffff88005e48f480 ffff88005e48f350 ffffffff817d3169 ffff88005e48f370
    [ ] Call Trace:
    [ ] [] dump_stack+0x12e/0x185
    [ ] [] kasan_report+0x489/0x4b0
    [ ] [] __asan_report_load1_noabort+0x19/0x20
    [ ] [] memcmp+0xe9/0x150
    [ ] [] depot_save_stack+0x176/0x5c0
    [ ] [] save_stack+0xb1/0xd0
    [ ] [] kasan_slab_free+0x72/0xc0
    [ ] [] kfree+0xc8/0x2a0
    [ ] [] skb_free_head+0x79/0xb0
    [ ] [] skb_release_data+0x37a/0x420
    [ ] [] skb_release_all+0x4f/0x60
    [ ] [] consume_skb+0x138/0x370
    [ ] [] sctp_chunk_put+0xcb/0x180
    [ ] [] sctp_chunk_free+0x58/0x70
    [ ] [] sctp_inq_pop+0x68f/0xef0
    [ ] [] sctp_assoc_bh_rcv+0xd6/0x4b0
    [ ] [] sctp_inq_push+0x131/0x190
    [ ] [] sctp_backlog_rcv+0xe9/0xa20
    [ ... ]
    [ ] Memory state around the buggy address:
    [ ] ffff88005e48f380: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
    [ ] ffff88005e48f400: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
    [ ] >ffff88005e48f480: f4 f4 00 00 00 00 00 00 00 00 00 00 00 00 00 00
    [ ] ^
    [ ] ffff88005e48f500: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
    [ ] ffff88005e48f580: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
    [ ] ==================================================================

    KASAN stack instrumentation poisons stack redzones on function entry
    and unpoisons them on function exit. If a function exits abnormally
    (e.g. with a longjmp like jprobe_return()), stack redzones are left
    poisoned. Later this leads to random KASAN false reports.

    Unpoison stack redzones in the frames we are going to jump over
    before doing actual longjmp in jprobe_return().

    Signed-off-by: Dmitry Vyukov
    Acked-by: Masami Hiramatsu
    Reviewed-by: Mark Rutland
    Cc: Mark Rutland
    Cc: Catalin Marinas
    Cc: Andrey Ryabinin
    Cc: Lorenzo Pieralisi
    Cc: Alexander Potapenko
    Cc: Will Deacon
    Cc: Andrew Morton
    Cc: Ananth N Mavinakayanahalli
    Cc: Anil S Keshavamurthy
    Cc: "David S. Miller"
    Cc: Masami Hiramatsu
    Cc: kasan-dev@googlegroups.com
    Cc: surovegin@google.com
    Cc: rostedt@goodmis.org
    Link: http://lkml.kernel.org/r/1476454043-101898-1-git-send-email-dvyukov@google.com
    Signed-off-by: Ingo Molnar

    Dmitry Vyukov