01 Nov, 2014

8 commits

  • Pull x86 fixes from Ingo Molnar:
    "Fixes from all around the place:

    - hyper-V 32-bit PAE guest kernel fix
    - two IRQ allocation fixes on certain x86 boards
    - intel-mid boot crash fix
    - intel-quark quirk
    - /proc/interrupts duplicate irq chip name fix
    - cma boot crash fix
    - syscall audit fix
    - boot crash fix with certain TSC configurations (seen on Qemu)
    - smpboot.c build warning fix"

    * 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    x86, pageattr: Prevent overflow in slow_virt_to_phys() for X86_PAE
    ACPI, irq, x86: Return IRQ instead of GSI in mp_register_gsi()
    x86, intel-mid: Create IRQs for APB timers and RTC timers
    x86: Don't enable F00F workaround on Intel Quark processors
    x86/irq: Fix XT-PIC-XT-PIC in /proc/interrupts
    x86, cma: Reserve DMA contiguous area after initmem_init()
    i386/audit: stop scribbling on the stack frame
    x86, apic: Handle a bad TSC more gracefully
    x86: ACPI: Do not translate GSI number if IOAPIC is disabled
    x86/smpboot: Move data structure to its primary usage scope

    Linus Torvalds
     
  • Pull scheduler fixes from Ingo Molnar:
    "Various scheduler fixes all over the place: three SCHED_DL fixes,
    three sched/numa fixes, two generic race fixes and a comment fix"

    * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    sched/dl: Fix preemption checks
    sched: Update comments for CLONE_NEWNS
    sched: stop the unbound recursion in preempt_schedule_context()
    sched/fair: Fix division by zero sysctl_numa_balancing_scan_size
    sched/fair: Care divide error in update_task_scan_period()
    sched/numa: Fix unsafe get_task_struct() in task_numa_assign()
    sched/deadline: Fix races between rt_mutex_setprio() and dl_task_timer()
    sched/deadline: Don't replenish from a !SCHED_DEADLINE entity
    sched: Fix race between task_group and sched_task_group

    Linus Torvalds
     
  • Pull perf fixes from Ingo Molnar:
    "Mostly tooling fixes, plus on the kernel side:

    - a revert for a newly introduced PMU driver which isn't complete yet
    and where we ran out of time with fixes (to be tried again in
    v3.19) - this makes up for a large chunk of the diffstat.

    - compilation warning fixes

    - a printk message fix

    - event_idx usage fixes/cleanups"

    * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    perf probe: Trivial typo fix for --demangle
    perf tools: Fix report -F dso_from for data without branch info
    perf tools: Fix report -F dso_to for data without branch info
    perf tools: Fix report -F symbol_from for data without branch info
    perf tools: Fix report -F symbol_to for data without branch info
    perf tools: Fix report -F mispredict for data without branch info
    perf tools: Fix report -F in_tx for data without branch info
    perf tools: Fix report -F abort for data without branch info
    perf tools: Make CPUINFO_PROC an array to support different kernel versions
    perf callchain: Use global caching provided by libunwind
    perf/x86/intel: Revert incomplete and undocumented Broadwell client support
    perf/x86: Fix compile warnings for intel_uncore
    perf: Fix typos in sample code in the perf_event.h header
    perf: Fix and clean up initialization of pmu::event_idx
    perf: Fix bogus kernel printk
    perf diff: Add missing hists__init() call at tool start

    Linus Torvalds
     
  • Pull futex fixes from Ingo Molnar:
    "This contains two futex fixes: one fixes a race condition, the other
    clarifies shared/private futex comments"

    * 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    futex: Fix a race condition between REQUEUE_PI and task death
    futex: Mention key referencing differences between shared and private futexes

    Linus Torvalds
     
  • Pull core fixes from Ingo Molnar:
    "The tree contains two RCU fixes and a compiler quirk comment fix"

    * 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    rcu: Make rcu_barrier() understand about missing rcuo kthreads
    compiler/gcc4+: Remove inaccurate comment about 'asm goto' miscompiles
    rcu: More on deadlock between CPU hotplug and expedited grace periods

    Linus Torvalds
     
  • Pull timer fixes from Thomas Gleixner:
    "As you requested in the rc2 release mail the timer department serves
    you a few real bug fixes:

    - Fix the probe logic of the architected arm/arm64 timer
    - Plug a stack info leak in posix-timers
    - Prevent a shift out of bounds issue in the clockevents core"

    * 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    ARM/ARM64: arch-timer: fix arch_timer_probed logic
    clockevents: Prevent shift out of bounds
    posix-timers: Fix stack info leak in timer_create()

    Linus Torvalds
     
  • …/git/rostedt/linux-trace

    Pull tracing fix from Steven Rostedt:
    "ARM has system calls outside the NR_syscalls range, and the generic
    tracing system does not support that and without checks, it can cause
    an oops to be reported.

    Rabin Vincent added checks in the return code on syscall events to
    make sure that the system call number is within the range that tracing
    knows about, and if not, simply ignores the system call.

    The system call tracing infrastructure needs to be rewritten to handle
    these cases better, but for now, to keep from oopsing, this patch will
    do"

    * tag 'trace-fixes-v3.18-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
    tracing/syscalls: Ignore numbers outside NR_syscalls' range

    Linus Torvalds
     
  • Pull documentation fixes from Jonathan Corbet:
    "So this is my first pull request since I rashly agreed to look after
    the documentation subtree. It contains some typo fixes, a few minor
    documentation improvements, and, most importantly, fixes for a couple
    of build problems in various bits of sample code.

    I fully intend to start sending pull requests with signed tags.
    However, due to poor planning on my part and the general obnoxiousness
    of life, I'm 2000 miles away from my private key which is sitting on a
    powered-down machine. This should be fixed before my next request.

    Meanwhile git.lwn.net is a machine under my control, the patches are
    all trivial, and all have done time in linux-next"

    * tag 'docs-for-linus' of git://git.lwn.net/linux-2.6:
    Documentation/SubmittingPatches: Reported-by tags and permission
    Documentation: remove outdated references to the linux-next wiki
    Documentation: Restrict TSC test code to x86
    doc: kernel-parameters.txt: Add ide-generic.probe-mask
    vdso: don't require 64-bit math in standalone test
    Documentation: Add CONFIG_DEBUG_KMEMLEAK_DEFAULT_OFF case
    Documentation: Add default kmemleak off case in kernel-parameters.txt
    Docs: Document that the sticky bit is understood by hugetlbfs
    DocBook: Reduce noise from make cleandocs
    Documentation: fix vdso_standalone_test_x86 on 32-bit
    Documentation: dt-bindings: Explain order in patch series
    Documentation/ABI/testing/sysfs-ibft: fix a typo

    Linus Torvalds
     

31 Oct, 2014

4 commits

  • ARM has some private syscalls (for example, set_tls(2)) which lie
    outside the range of NR_syscalls. If any of these are called while
    syscall tracing is being performed, out-of-bounds array access will
    occur in the ftrace and perf sys_{enter,exit} handlers.

    # trace-cmd record -e raw_syscalls:* true && trace-cmd report
    ...
    true-653 [000] 384.675777: sys_enter: NR 192 (0, 1000, 3, 4000022, ffffffff, 0)
    true-653 [000] 384.675812: sys_exit: NR 192 = 1995915264
    true-653 [000] 384.675971: sys_enter: NR 983045 (76f74480, 76f74000, 76f74b28, 76f74480, 76f76f74, 1)
    true-653 [000] 384.675988: sys_exit: NR 983045 = 0
    ...

    # trace-cmd record -e syscalls:* true
    [ 17.289329] Unable to handle kernel paging request at virtual address aaaaaace
    [ 17.289590] pgd = 9e71c000
    [ 17.289696] [aaaaaace] *pgd=00000000
    [ 17.289985] Internal error: Oops: 5 [#1] PREEMPT SMP ARM
    [ 17.290169] Modules linked in:
    [ 17.290391] CPU: 0 PID: 704 Comm: true Not tainted 3.18.0-rc2+ #21
    [ 17.290585] task: 9f4dab00 ti: 9e710000 task.ti: 9e710000
    [ 17.290747] PC is at ftrace_syscall_enter+0x48/0x1f8
    [ 17.290866] LR is at syscall_trace_enter+0x124/0x184

    Fix this by ignoring out-of-NR_syscalls-bounds syscall numbers.

    Commit cd0980fc8add "tracing: Check invalid syscall nr while tracing syscalls"
    added the check for less than zero, but it should have also checked
    for greater than NR_syscalls.

    Link: http://lkml.kernel.org/p/1414620418-29472-1-git-send-email-rabin@rab.in

    Fixes: cd0980fc8add "tracing: Check invalid syscall nr while tracing syscalls"
    Cc: stable@vger.kernel.org # 2.6.33+
    Signed-off-by: Rabin Vincent
    Signed-off-by: Steven Rostedt

    Rabin Vincent
     
  • The man page for open(2) indicates that when O_CREAT is specified, the
    'mode' argument applies only to future accesses to the file:

    Note that this mode applies only to future accesses of the newly
    created file; the open() call that creates a read-only file
    may well return a read/write file descriptor.

    The man page for open(2) implies that 'mode' is treated identically by
    O_CREAT and O_TMPFILE.

    O_TMPFILE, however, behaves differently:

    int fd = open("/tmp", O_TMPFILE | O_RDWR, 0);
    assert(fd == -1);
    assert(errno == EACCES);

    int fd = open("/tmp", O_TMPFILE | O_RDWR, 0600);
    assert(fd > 0);

    For O_CREAT, do_last() sets acc_mode to MAY_OPEN only:

    if (*opened & FILE_CREATED) {
    /* Don't check for write permission, don't truncate */
    open_flag &= ~O_TRUNC;
    will_truncate = false;
    acc_mode = MAY_OPEN;
    path_to_nameidata(path, nd);
    goto finish_open_created;
    }

    But for O_TMPFILE, do_tmpfile() passes the full op->acc_mode to
    may_open().

    This patch lines up the behavior of O_TMPFILE with O_CREAT. After the
    inode is created, may_open() is called with acc_mode = MAY_OPEN, in
    do_tmpfile().

    A different, but related glibc bug revealed the discrepancy:
    https://sourceware.org/bugzilla/show_bug.cgi?id=17523

    The glibc lazily loads the 'mode' argument of open() and openat() using
    va_arg() only if O_CREAT is present in 'flags' (to support both the 2
    argument and the 3 argument forms of open; same idea for openat()).
    However, the glibc ignores the 'mode' argument if O_TMPFILE is in
    'flags'.

    On x86_64, for open(), it magically works anyway, as 'mode' is in
    RDX when entering open(), and is still in RDX on SYSCALL, which is where
    the kernel looks for the 3rd argument of a syscall.

    But openat() is not quite so lucky: 'mode' is in RCX when entering the
    glibc wrapper for openat(), while the kernel looks for the 4th argument
    of a syscall in R10. Indeed, the syscall calling convention differs from
    the regular calling convention in this respect on x86_64. So the kernel
    sees mode = 0 when trying to use glibc openat() with O_TMPFILE, and
    fails with EACCES.

    Signed-off-by: Eric Rannaud
    Acked-by: Andy Lutomirski
    Cc: stable@vger.kernel.org
    Signed-off-by: Linus Torvalds

    Eric Rannaud
     
  • Pull fbdev fixes from Tomi Valkeinen:

    - fix fb console option parsing

    - fixes for OMAPDSS/OMAPFB crashes related to module unloading and
    device/driver binding & unbinding.

    - fix for OMAP HDMI PLL locking failing in certain cases

    - misc minor fixes for atmel lcdfb and OMAP

    * tag 'fbdev-fixes-3.18' of git://git.kernel.org/pub/scm/linux/kernel/git/tomba/linux:
    omap: dss: connector-analog-tv: Add missing module device table
    OMAPDSS: DSI: Fix PLL_SELFEQDCO field width
    OMAPDSS: fix dispc register dump for preload & mflag
    OMAPDSS: DISPC: fix mflag offset
    OMAPDSS: HDMI: fix regsd write
    OMAPDSS: HDMI: fix PLL GO bit handling
    OMAPFB: fix releasing overlays
    OMAPFB: fix overlay disable when freeing resources.
    OMAPDSS: apply: wait pending updates on manager disable
    OMAPFB: remove __exit annotation
    OMAPDSS: set suppress_bind_attrs
    OMAPFB: add missing MODULE_ALIAS()
    drivers: video: fbdev: atmel_lcdfb.c: remove unnecessary header
    video/console: Resolve several shadow warnings
    fbcon: Fix option parsing control flow in fb_console_setup

    Linus Torvalds
     
  • Pull sound fixes from Takashi Iwai:
    "Although the diffstat looks scary, it's just because of the removal of
    the dead code (s6000), thus it must not affect anything serious.

    Other than that, all small fixes. The only core fix is zero-clear for
    a PCM compat ioctl. The rest are driver-specific, bebob, sgtl500,
    adau1761, intel-sst, ad1889 and a few HD-audio quirks as usual"

    * tag 'sound-3.18-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound:
    ALSA: hda - Add workaround for CMI8888 snoop behavior
    ALSA: pcm: Zero-clear reserved fields of PCM status ioctl in compat mode
    ALSA: bebob: Uninitialized id returned by saffirepro_both_clk_src_get
    ALSA: hda/realtek - New SSID for Headset quirk
    ALSA: ad1889: Fix probable mask then right shift defects
    ALSA: bebob: fix wrong decoding of clock information for Terratec PHASE 88 Rack FW
    ALSA: hda/realtek - Update restore default value for ALC283
    ALSA: hda/realtek - Update restore default value for ALC282
    ASoC: fsl: use strncpy() to prevent copying of over-long names
    ASoC: adau1761: Fix input PGA volume
    ASoC: s6000: remove driver
    ASoC: Intel: HSW/BDW only support S16 and S24 formats.
    ASoC: sgtl500: Document the required supplies

    Linus Torvalds
     

30 Oct, 2014

28 commits

  • Tomi Valkeinen
     
  • Without that fix connector-analog-tv driver isn't probed when compiled
    as module.

    Signed-off-by: H. Nikolaus Schaller
    Signed-off-by: Tomi Valkeinen

    Marek BElisko
     
  • …/paulmck/linux-rcu into core/urgent

    Pull two RCU fixes from Paul E. McKenney:

    " - Complete the work of commit dd56af42bd82 (rcu: Eliminate deadlock
    between CPU hotplug and expedited grace periods), which was
    intended to allow synchronize_sched_expedited() to be safely
    used when holding locks acquired by CPU-hotplug notifiers.
    This commit makes the put_online_cpus() avoid the deadlock
    instead of just handling the get_online_cpus().

    - Complete the work of commit 35ce7f29a44a (rcu: Create rcuo
    kthreads only for onlined CPUs), which was intended to allow
    RCU to avoid allocating unneeded kthreads on systems where the
    firmware says that there are more CPUs than are really present.
    This commit makes rcu_barrier() aware of the mismatch, so that
    it doesn't hang waiting for non-existent CPUs. "

    Signed-off-by: Ingo Molnar <mingo@kernel.org>

    Ingo Molnar
     
  • …it/acme/linux into perf/urgent

    Pull perf/urgent fixes from Arnaldo Carvalho de Melo:

    - Fix report -F (abort, in_tx, mispredict, etc) segfaults for sample.data files
    without branch info (Jiri Olsa)

    - Add patch that should have went in a previous patchkit to use global cache
    provided by libunwind (Namhyung Kim)

    - Make CPUINFO_PROC an array to support different kernels, problem
    detected when the information reported via /proc/cpuinfo changed on ARM (Wang Nan)

    - 'perf probe' --demangle typo fix and a new --quiet option (Masami Hiramatsu)

    Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
    Signed-off-by: Ingo Molnar <mingo@kernel.org>

    Ingo Molnar
     
  • Merge misc fixes from Andrew Morton:
    "21 fixes"

    * emailed patches from Andrew Morton : (21 commits)
    mm/balloon_compaction: fix deflation when compaction is disabled
    sh: fix sh770x SCIF memory regions
    zram: avoid NULL pointer access in concurrent situation
    mm/slab_common: don't check for duplicate cache names
    ocfs2: fix d_splice_alias() return code checking
    mm: rmap: split out page_remove_file_rmap()
    mm: memcontrol: fix missed end-writeback page accounting
    mm: page-writeback: inline account_page_dirtied() into single caller
    lib/bitmap.c: fix undefined shift in __bitmap_shift_{left|right}()
    drivers/rtc/rtc-bq32k.c: fix register value
    memory-hotplug: clear pgdat which is allocated by bootmem in try_offline_node()
    drivers/rtc/rtc-s3c.c: fix initialization failure without rtc source clock
    kernel/kmod: fix use-after-free of the sub_info structure
    drivers/rtc/rtc-pm8xxx.c: rework to support pm8941 rtc
    mm, thp: fix collapsing of hugepages on madvise
    drivers: of: add return value to of_reserved_mem_device_init()
    mm: free compound page with correct order
    gcov: add ARM64 to GCOV_PROFILE_ALL
    fsnotify: next_i is freed during fsnotify_unmount_inodes.
    mm/compaction.c: avoid premature range skip in isolate_migratepages_range
    ...

    Linus Torvalds
     
  • If CONFIG_BALLOON_COMPACTION=n balloon_page_insert() does not link pages
    with balloon and doesn't set PagePrivate flag, as a result
    balloon_page_dequeue() cannot get any pages because it thinks that all
    of them are isolated. Without balloon compaction nobody can isolate
    ballooned pages. It's safe to remove this check.

    Fixes: d6d86c0a7f8d ("mm/balloon_compaction: redesign ballooned pages management").
    Signed-off-by: Konstantin Khlebnikov
    Reported-by: Matt Mullins
    Cc: [3.17]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     
  • Resources scif1_resources & scif2_resources overlap. Actual SCIF region
    size is 0x10.

    This is regression from commit d850acf975be ("sh: Declare SCIF register
    base and IRQ as resources")

    Signed-off-by: Andriy Skulysh
    Acked-by: Laurent Pinchart
    Cc: Geert Uytterhoeven
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andriy Skulysh
     
  • There is a rare NULL pointer bug in mem_used_total_show() and
    mem_used_max_store() in concurrent situation, like this:

    zram is not initialized, process A is a mem_used_total reader which runs
    periodically, while process B try to init zram.

    process A process B
    access meta, get a NULL value
    init zram, done
    init_done() is true
    access meta->mem_pool, get a NULL pointer BUG

    This patch fixes this issue.

    Signed-off-by: Weijie Yang
    Acked-by: Minchan Kim
    Acked-by: Sergey Senozhatsky
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Weijie Yang
     
  • The SLUB cache merges caches with the same size and alignment and there
    was long standing bug with this behavior:

    - create the cache named "foo"
    - create the cache named "bar" (which is merged with "foo")
    - delete the cache named "foo" (but it stays allocated because "bar"
    uses it)
    - create the cache named "foo" again - it fails because the name "foo"
    is already used

    That bug was fixed in commit 694617474e33 ("slab_common: fix the check
    for duplicate slab names") by not warning on duplicate cache names when
    the SLUB subsystem is used.

    Recently, cache merging was implemented the with SLAB subsystem too, in
    12220dea07f1 ("mm/slab: support slab merge")). Therefore we need stop
    checking for duplicate names even for the SLAB subsystem.

    This patch fixes the bug by removing the check.

    Signed-off-by: Mikulas Patocka
    Acked-by: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mikulas Patocka
     
  • d_splice_alias() can return a valid dentry, NULL or an ERR_PTR.
    Currently the code checks not for ERR_PTR and will cuase an oops in
    ocfs2_dentry_attach_lock(). Fix this by using IS_ERR_OR_NULL().

    Signed-off-by: Richard Weinberger
    Cc: Mark Fasheh
    Cc: Joel Becker
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Richard Weinberger
     
  • page_remove_rmap() has too many branches on PageAnon() and is hard to
    follow. Move the file part into a separate function.

    Signed-off-by: Johannes Weiner
    Reviewed-by: Michal Hocko
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Commit 0a31bc97c80c ("mm: memcontrol: rewrite uncharge API") changed
    page migration to uncharge the old page right away. The page is locked,
    unmapped, truncated, and off the LRU, but it could race with writeback
    ending, which then doesn't unaccount the page properly:

    test_clear_page_writeback() migration
    wait_on_page_writeback()
    TestClearPageWriteback()
    mem_cgroup_migrate()
    clear PCG_USED
    mem_cgroup_update_page_stat()
    if (PageCgroupUsed(pc))
    decrease memcg pages under writeback

    release pc->mem_cgroup->move_lock

    The per-page statistics interface is heavily optimized to avoid a
    function call and a lookup_page_cgroup() in the file unmap fast path,
    which means it doesn't verify whether a page is still charged before
    clearing PageWriteback() and it has to do it in the stat update later.

    Rework it so that it looks up the page's memcg once at the beginning of
    the transaction and then uses it throughout. The charge will be
    verified before clearing PageWriteback() and migration can't uncharge
    the page as long as that is still set. The RCU lock will protect the
    memcg past uncharge.

    As far as losing the optimization goes, the following test results are
    from a microbenchmark that maps, faults, and unmaps a 4GB sparse file
    three times in a nested fashion, so that there are two negative passes
    that don't account but still go through the new transaction overhead.
    There is no actual difference:

    old: 33.195102545 seconds time elapsed ( +- 0.01% )
    new: 33.199231369 seconds time elapsed ( +- 0.03% )

    The time spent in page_remove_rmap()'s callees still adds up to the
    same, but the time spent in the function itself seems reduced:

    # Children Self Command Shared Object Symbol
    old: 0.12% 0.11% filemapstress [kernel.kallsyms] [k] page_remove_rmap
    new: 0.12% 0.08% filemapstress [kernel.kallsyms] [k] page_remove_rmap

    Signed-off-by: Johannes Weiner
    Acked-by: Michal Hocko
    Cc: Vladimir Davydov
    Cc: [3.17.x]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • A follow-up patch would have changed the call signature. To save the
    trouble, just fold it instead.

    Signed-off-by: Johannes Weiner
    Acked-by: Michal Hocko
    Cc: Vladimir Davydov
    Cc: [3.17.x]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • If __bitmap_shift_left() or __bitmap_shift_right() are asked to shift by
    a multiple of BITS_PER_LONG, they will try to shift a long value by
    BITS_PER_LONG bits which is undefined. Change the functions to avoid
    the undefined shift.

    Coverity id: 1192175
    Coverity id: 1192174
    Signed-off-by: Jan Kara
    Cc: Rasmus Villemoes
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     
  • Fix register value in bq32000 trickle charging.

    Mike reported that I'm using wrong value in one trickle-charging case,
    and after checking docs, I must admit he's right.

    Signed-off-by: Pavel Machek
    Reported-by: Mike Bremford
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pavel Machek
     
  • When hot adding the same memory after hot removal, the following
    messages are shown:

    WARNING: CPU: 20 PID: 6 at mm/page_alloc.c:4968 free_area_init_node+0x3fe/0x426()
    ...
    Call Trace:
    dump_stack+0x46/0x58
    warn_slowpath_common+0x81/0xa0
    warn_slowpath_null+0x1a/0x20
    free_area_init_node+0x3fe/0x426
    hotadd_new_pgdat+0x90/0x110
    add_memory+0xd4/0x200
    acpi_memory_device_add+0x1aa/0x289
    acpi_bus_attach+0xfd/0x204
    acpi_bus_attach+0x178/0x204
    acpi_bus_scan+0x6a/0x90
    acpi_device_hotplug+0xe8/0x418
    acpi_hotplug_work_fn+0x1f/0x2b
    process_one_work+0x14e/0x3f0
    worker_thread+0x11b/0x510
    kthread+0xe1/0x100
    ret_from_fork+0x7c/0xb0

    The detaled explanation is as follows:

    When hot removing memory, pgdat is set to 0 in try_offline_node(). But
    if the pgdat is allocated by bootmem allocator, the clearing step is
    skipped.

    And when hot adding the same memory, the uninitialized pgdat is reused.
    But free_area_init_node() checks wether pgdat is set to zero. As a
    result, free_area_init_node() hits WARN_ON().

    This patch clears pgdat which is allocated by bootmem allocator in
    try_offline_node().

    Signed-off-by: Yasuaki Ishimatsu
    Cc: Zhang Zhen
    Cc: Wang Nan
    Cc: Tang Chen
    Reviewed-by: Toshi Kani
    Cc: Dave Hansen
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yasuaki Ishimatsu
     
  • Fix unconditional initialization failure on non-exynos3250 SoCs.

    Commit df9e26d093d3 ("rtc: s3c: add support for RTC of Exynos3250 SoC")
    introduced rtc source clock support, but also added initialization
    failure on SoCs, which doesn't need such clock.

    Signed-off-by: Marek Szyprowski
    Reviewed-by: Chanwoo Choi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Marek Szyprowski
     
  • Found this in the message log on a s390 system:

    BUG kmalloc-192 (Not tainted): Poison overwritten
    Disabling lock debugging due to kernel taint
    INFO: 0x00000000684761f4-0x00000000684761f7. First byte 0xff instead of 0x6b
    INFO: Allocated in call_usermodehelper_setup+0x70/0x128 age=71 cpu=2 pid=648
    __slab_alloc.isra.47.constprop.56+0x5f6/0x658
    kmem_cache_alloc_trace+0x106/0x408
    call_usermodehelper_setup+0x70/0x128
    call_usermodehelper+0x62/0x90
    cgroup_release_agent+0x178/0x1c0
    process_one_work+0x36e/0x680
    worker_thread+0x2f0/0x4f8
    kthread+0x10a/0x120
    kernel_thread_starter+0x6/0xc
    kernel_thread_starter+0x0/0xc
    INFO: Freed in call_usermodehelper_exec+0x110/0x1b8 age=71 cpu=2 pid=648
    __slab_free+0x94/0x560
    kfree+0x364/0x3e0
    call_usermodehelper_exec+0x110/0x1b8
    cgroup_release_agent+0x178/0x1c0
    process_one_work+0x36e/0x680
    worker_thread+0x2f0/0x4f8
    kthread+0x10a/0x120
    kernel_thread_starter+0x6/0xc
    kernel_thread_starter+0x0/0xc

    There is a use-after-free bug on the subprocess_info structure allocated
    by the user mode helper. In case do_execve() returns with an error
    ____call_usermodehelper() stores the error code to sub_info->retval, but
    sub_info can already have been freed.

    Regarding UMH_NO_WAIT, the sub_info structure can be freed by
    __call_usermodehelper() before the worker thread returns from
    do_execve(), allowing memory corruption when do_execve() failed after
    exec_mmap() is called.

    Regarding UMH_WAIT_EXEC, the call to umh_complete() allows
    call_usermodehelper_exec() to continue which then frees sub_info.

    To fix this race the code needs to make sure that the call to
    call_usermodehelper_freeinfo() is always done after the last store to
    sub_info->retval.

    Signed-off-by: Martin Schwidefsky
    Reviewed-by: Oleg Nesterov
    Cc: Tetsuo Handa
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Martin Schwidefsky
     
  • Adds support for RTC device inside PM8941 PMIC. The RTC in this PMIC
    have two register spaces. Thus the rtc-pm8xxx is slightly reworked to
    reflect these differences.

    The register set for different PMIC chips are selected on DT compatible
    string base.

    [akpm@linux-foundation.org: coding-style fixes]
    [akpm@linux-foundation.org: simplify and fix locking in pm8xxx_rtc_set_time()]
    Signed-off-by: Stanimir Varbanov
    Cc: Alessandro Zummo
    Cc: Stephen Boyd
    Cc: Josh Cartwright
    Cc: Stanimir Varbanov
    Cc: Dan Carpenter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Stanimir Varbanov
     
  • If an anonymous mapping is not allowed to fault thp memory and then
    madvise(MADV_HUGEPAGE) is used after fault, khugepaged will never
    collapse this memory into thp memory.

    This occurs because the madvise(2) handler for thp, hugepage_madvise(),
    clears VM_NOHUGEPAGE on the stack and it isn't stored in vma->vm_flags
    until the final action of madvise_behavior(). This causes the
    khugepaged_enter_vma_merge() to be a no-op in hugepage_madvise() when
    the vma had previously had VM_NOHUGEPAGE set.

    Fix this by passing the correct vma flags to the khugepaged mm slot
    handler. There's no chance khugepaged can run on this vma until after
    madvise_behavior() returns since we hold mm->mmap_sem.

    It would be possible to clear VM_NOHUGEPAGE directly from vma->vm_flags
    in hugepage_advise(), but I didn't want to introduce special case
    behavior into madvise_behavior(). I think it's best to just let it
    always set vma->vm_flags itself.

    Signed-off-by: David Rientjes
    Reported-by: Suleiman Souhlal
    Cc: "Kirill A. Shutemov"
    Cc: Andrea Arcangeli
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • Driver calling of_reserved_mem_device_init() might be interested if the
    initialization has been successful or not, so add support for returning
    error code.

    This fixes a build warining caused by commit 7bfa5ab6fa1b ("drivers:
    dma-coherent: add initialization from device tree"), which has been
    merged without this change and without fixing function return value.

    Fixes: 7bfa5ab6fa1b1 ("drivers: dma-coherent: add initialization from device tree")
    Signed-off-by: Marek Szyprowski
    Acked-by: Arnd Bergmann
    Cc: Michal Nazarewicz
    Cc: Grant Likely
    Cc: Laura Abbott
    Cc: Josh Cartwright
    Cc: Joonsoo Kim
    Cc: Kyungmin Park
    Cc: Russell King
    Cc: Stephen Rothwell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Marek Szyprowski
     
  • Compound page should be freed by put_page() or free_pages() with correct
    order. Not doing so will cause tail pages leaked.

    The compound order can be obtained by compound_order() or use
    HPAGE_PMD_ORDER in our case. Some people would argue the latter is
    faster but I prefer the former which is more general.

    This bug was observed not just on our servers (the worst case we saw is
    11G leaked on a 48G machine) but also on our workstations running Ubuntu
    based distro.

    $ cat /proc/vmstat | grep thp_zero_page_alloc
    thp_zero_page_alloc 55
    thp_zero_page_alloc_failed 0

    This means there is (thp_zero_page_alloc - 1) * (2M - 4K) memory leaked.

    Fixes: 97ae17497e99 ("thp: implement refcounting for huge zero page")
    Signed-off-by: Yu Zhao
    Acked-by: Kirill A. Shutemov
    Cc: Andrea Arcangeli
    Cc: Mel Gorman
    Cc: David Rientjes
    Cc: Bob Liu
    Cc: [3.8+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yu Zhao
     
  • Following up the arm testing of gcov, turns out gcov on ARM64 works fine
    as well. Only change needed is adding ARM64 to Kconfig depends.

    Tested with qemu and mach-virt

    Signed-off-by: Riku Voipio
    Acked-by: Peter Oberparleiter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Riku Voipio
     
  • During file system stress testing on 3.10 and 3.12 based kernels, the
    umount command occasionally hung in fsnotify_unmount_inodes in the
    section of code:

    spin_lock(&inode->i_lock);
    if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) {
    spin_unlock(&inode->i_lock);
    continue;
    }

    As this section of code holds the global inode_sb_list_lock, eventually
    the system hangs trying to acquire the lock.

    Multiple crash dumps showed:

    The inode->i_state == 0x60 and i_count == 0 and i_sb_list would point
    back at itself. As this is not the value of list upon entry to the
    function, the kernel never exits the loop.

    To help narrow down problem, the call to list_del_init in
    inode_sb_list_del was changed to list_del. This poisons the pointers in
    the i_sb_list and causes a kernel to panic if it transverse a freed
    inode.

    Subsequent stress testing paniced in fsnotify_unmount_inodes at the
    bottom of the list_for_each_entry_safe loop showing next_i had become
    free.

    We believe the root cause of the problem is that next_i is being freed
    during the window of time that the list_for_each_entry_safe loop
    temporarily releases inode_sb_list_lock to call fsnotify and
    fsnotify_inode_delete.

    The code in fsnotify_unmount_inodes attempts to prevent the freeing of
    inode and next_i by calling __iget. However, the code doesn't do the
    __iget call on next_i

    if i_count == 0 or
    if i_state & (I_FREEING | I_WILL_FREE)

    The patch addresses this issue by advancing next_i in the above two cases
    until we either find a next_i which we can __iget or we reach the end of
    the list. This makes the handling of next_i more closely match the
    handling of the variable "inode."

    The time to reproduce the hang is highly variable (from hours to days.) We
    ran the stress test on a 3.10 kernel with the proposed patch for a week
    without failure.

    During list_for_each_entry_safe, next_i is becoming free causing
    the loop to never terminate. Advance next_i in those cases where
    __iget is not done.

    Signed-off-by: Jerry Hoemann
    Cc: Jeff Kirsher
    Cc: Ken Helias
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jerry Hoemann
     
  • Commit edc2ca612496 ("mm, compaction: move pageblock checks up from
    isolate_migratepages_range()") commonizes isolate_migratepages variants
    and make them use isolate_migratepages_block().

    isolate_migratepages_block() could stop the execution when enough pages
    are isolated, but, there is no code in isolate_migratepages_range() to
    handle this case. In the result, even if isolate_migratepages_block()
    returns prematurely without checking all pages in the range,

    isolate_migratepages_block() is called repeately on the following
    pageblock and some pages in the previous range are skipped to check.
    Then, CMA is failed frequently due to this fact.

    To fix this problem, this patch let isolate_migratepages_range() know
    the situation that enough pages are isolated and stop the isolation in
    that case.

    Note that isolate_migratepages() has no such problem, because, it always
    stops the isolation after just one call of isolate_migratepages_block().

    Signed-off-by: Joonsoo Kim
    Acked-by: Vlastimil Babka
    Cc: David Rientjes
    Cc: Minchan Kim
    Cc: Michal Nazarewicz
    Cc: Naoya Horiguchi
    Cc: Christoph Lameter
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: Zhang Yanfei
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • Commit ff7ee93f4715 ("cgroup/kmemleak: Annotate alloc_page() for cgroup
    allocations") introduces kmemleak_alloc() for alloc_page_cgroup(), but
    corresponding kmemleak_free() is missing, which makes kmemleak be
    wrongly disabled after memory offlining. Log is pasted at the end of
    this commit message.

    This patch add kmemleak_free() into free_page_cgroup(). During page
    offlining, this patch removes corresponding entries in kmemleak rbtree.
    After that, the freed memory can be allocated again by other subsystems
    without killing kmemleak.

    bash # for x in 1 2 3 4; do echo offline > /sys/devices/system/memory/memory$x/state ; sleep 1; done ; dmesg | grep leak

    Offlined Pages 32768
    kmemleak: Cannot insert 0xffff880016969000 into the object search tree (overlaps existing)
    CPU: 0 PID: 412 Comm: sleep Not tainted 3.17.0-rc5+ #86
    Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
    Call Trace:
    dump_stack+0x46/0x58
    create_object+0x266/0x2c0
    kmemleak_alloc+0x26/0x50
    kmem_cache_alloc+0xd3/0x160
    __sigqueue_alloc+0x49/0xd0
    __send_signal+0xcb/0x410
    send_signal+0x45/0x90
    __group_send_sig_info+0x13/0x20
    do_notify_parent+0x1bb/0x260
    do_exit+0x767/0xa40
    do_group_exit+0x44/0xa0
    SyS_exit_group+0x17/0x20
    system_call_fastpath+0x16/0x1b

    kmemleak: Kernel memory leak detector disabled
    kmemleak: Object 0xffff880016900000 (size 524288):
    kmemleak: comm "swapper/0", pid 0, jiffies 4294667296
    kmemleak: min_count = 0
    kmemleak: count = 0
    kmemleak: flags = 0x1
    kmemleak: checksum = 0
    kmemleak: backtrace:
    log_early+0x63/0x77
    kmemleak_alloc+0x4b/0x50
    init_section_page_cgroup+0x7f/0xf5
    page_cgroup_init+0xc5/0xd0
    start_kernel+0x333/0x408
    x86_64_start_reservations+0x2a/0x2c
    x86_64_start_kernel+0xf5/0xfc

    Fixes: ff7ee93f4715 (cgroup/kmemleak: Annotate alloc_page() for cgroup allocations)
    Signed-off-by: Wang Nan
    Acked-by: Johannes Weiner
    Acked-by: Michal Hocko
    Cc: Steven Rostedt
    Cc: [3.2+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wang Nan
     
  • Pull block layer fixes from Jens Axboe:
    "A small collection of fixes for the current kernel. This contains:

    - Two error handling fixes from Jan Kara. One for null_blk on
    failure to add a device, and the other for the block/scsi_ioctl
    SCSI_IOCTL_SEND_COMMAND fixing up the error jump point.

    - A commit added in the merge window for the bio integrity bits
    unfortunately disabled merging for all requests if
    CONFIG_BLK_DEV_INTEGRITY wasn't set. Reverse the logic, so that
    integrity checking wont disallow merges when not enabled.

    - A fix from Ming Lei for merging and generating too many segments.
    This caused a BUG in virtio_blk.

    - Two error handling printk() fixups from Robert Elliott, improving
    the information given when we rate limit.

    - Error handling fixup on elevator_init() failure from Sudip
    Mukherjee.

    - A fix from Tony Battersby, fixing up a memory leak in the
    scatterlist handling with scsi-mq"

    * 'for-linus' of git://git.kernel.dk/linux-block:
    block: Fix merge logic when CONFIG_BLK_DEV_INTEGRITY is not defined
    lib/scatterlist: fix memory leak with scsi-mq
    block: fix wrong error return in elevator_init()
    scsi: Fix error handling in SCSI_IOCTL_SEND_COMMAND
    null_blk: Cleanup error recovery in null_add_dev()
    blk-merge: recaculate segment if it isn't less than max segments
    fs: clarify rate limit suppressed buffer I/O errors
    fs: merge I/O error prints into one line

    Linus Torvalds
     
  • Pull HID fixes from Jiri Kosina:
    - workarounds for a couple of misbehaving Elan Touchscreens, by Adel
    Gadllah
    - fix for TransducerSerialNumber field implementation, by Jason Gerecke
    - a couple of new HID usages (added by HUT), by Olivier Gay

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/hid:
    HID: input: Fix TransducerSerialNumber implementation
    HID: add keyboard input assist hid usages
    HID: usbhid: enable always-poll quirk for Elan Touchscreen 016f
    HID: usbhid: enable always-poll quirk for Elan Touchscreen 009b

    Linus Torvalds