30 Dec, 2013

1 commit

  • Pull ACPI and power management fixes and new device IDs from Rafael Wysocki:

    - Fix for a cpufreq regression causing stale sysfs files to be left
    behind during system resume if cpufreq_add_dev() fails for one or
    more CPUs from Viresh Kumar.

    - Fix for a bug in cpufreq causing CONFIG_CPU_FREQ_DEFAULT_* to be
    ignored when the intel_pstate driver is used from Jason Baron.

    - System suspend fix for a memory leak in pm_vt_switch_unregister()
    that forgot to release objects after removing them from
    pm_vt_switch_list. From Masami Ichikawa.

    - Intel Valley View device ID and energy unit encoding update for the
    (recently added) Intel RAPL (Running Average Power Limit) driver from
    Jacob Pan.

    - Intel Bay Trail SoC GPIO and ACPI device IDs for the Low Power
    Subsystem (LPSS) ACPI driver from Paul Drews.

    * tag 'pm+acpi-3.13-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
    powercap / RAPL: add support for ValleyView Soc
    PM / sleep: Fix memory leak in pm_vt_switch_unregister().
    cpufreq: Use CONFIG_CPU_FREQ_DEFAULT_* to set initial policy for setpolicy drivers
    cpufreq: remove sysfs files for CPUs which failed to come back after resume
    ACPI: Add BayTrail SoC GPIO and LPSS ACPI IDs

    Linus Torvalds
     

27 Dec, 2013

1 commit


25 Dec, 2013

2 commits

  • Pull cgroup fixes from Tejun Heo:
    "Two fixes. One fixes a bug in the error path of cgroup_create(). The
    other changes cgrp->id lifetime rule so that the id doesn't get
    recycled before all controller states are destroyed. This premature
    id recycling made memcg malfunction"

    * 'for-3.13-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
    cgroup: don't recycle cgroup id until all csses' have been destroyed
    cgroup: fix cgroup_create() error handling path

    Linus Torvalds
     
  • Pull libata fixes from Tejun Heo:
    "There's one interseting commit - "libata, freezer: avoid block device
    removal while system is frozen". It's an ugly hack working around a
    deadlock condition between driver core resume and block layer device
    removal paths through freezer which was made more reproducible by
    writeback being converted to workqueue some releases ago. The bug has
    nothing to do with libata but it's just an workaround which is easy to
    backport. After discussion, Rafael and I seem to agree that we don't
    really need kernel freezables - both kthread and workqueue. There are
    few specific workqueues which constitute PM operations and require
    freezing, which will be converted to use workqueue_set_max_active()
    instead. All other kernel freezer uses are planned to be removed,
    followed by the removal of kthread and workqueue freezer support,
    hopefully.

    Others are device-specific fixes. The most notable is the addition of
    NO_NCQ_TRIM which is used to disable queued TRIM commands to Micro
    M500 SSDs which otherwise suffers data corruption"

    * 'for-3.13-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/libata:
    libata, freezer: avoid block device removal while system is frozen
    libata: implement ATA_HORKAGE_NO_NCQ_TRIM and apply it to Micro M500 SSDs
    libata: disable a disk via libata.force params
    ahci: bail out on ICH6 before using AHCI BAR
    ahci: imx: Explicitly clear IMX6Q_GPR13_SATA_MPLL_CLK_EN
    libata: add ATA_HORKAGE_BROKEN_FPDMA_AA quirk for Seagate Momentus SpinPoint M8

    Linus Torvalds
     

22 Dec, 2013

1 commit

  • kmemleak reported a memory leak as below.

    unreferenced object 0xffff880118f14700 (size 32):
    comm "swapper/0", pid 1, jiffies 4294877401 (age 123.283s)
    hex dump (first 32 bytes):
    00 01 10 00 00 00 ad de 00 02 20 00 00 00 ad de .......... .....
    00 d4 d2 18 01 88 ff ff 01 00 00 00 00 04 00 00 ................
    backtrace:
    [] kmemleak_alloc+0x4e/0xb0
    [] kmem_cache_alloc_trace+0x1ec/0x260
    [] pm_vt_switch_required+0x76/0xb0
    [] register_framebuffer+0x195/0x320
    [] efifb_probe+0x718/0x780
    [] platform_drv_probe+0x45/0xb0
    [] driver_probe_device+0x87/0x3a0
    [] __driver_attach+0x93/0xa0
    [] bus_for_each_dev+0x63/0xa0
    [] driver_attach+0x1e/0x20
    [] bus_add_driver+0x180/0x250
    [] driver_register+0x64/0xf0
    [] __platform_driver_register+0x4a/0x50
    [] efifb_driver_init+0x12/0x14
    [] do_one_initcall+0xfa/0x1b0
    [] kernel_init_freeable+0x17b/0x201

    In pm_vt_switch_required(), "entry" variable is allocated via kmalloc().
    So, in pm_vt_switch_unregister(), it needs to call kfree() when object
    is deleted from list.

    Signed-off-by: Masami Ichikawa
    Reviewed-by: Pavel Machek
    Signed-off-by: Rafael J. Wysocki

    Masami Ichikawa
     

21 Dec, 2013

2 commits

  • In struct page we have enough space to fit long-size page->ptl there,
    but we use dynamically-allocated page->ptl if size(spinlock_t) is larger
    than sizeof(int).

    It hurts 64-bit architectures with CONFIG_GENERIC_LOCKBREAK, where
    sizeof(spinlock_t) == 8, but it easily fits into struct page.

    Signed-off-by: Kirill A. Shutemov
    Acked-by: Hugh Dickins
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • …it/rostedt/linux-trace

    Pull ftrace fix from Steven Rostedt:
    "This fixes a long standing bug in the ftrace profiler. The problem is
    that the profiler only initializes the online CPUs, and not possible
    CPUs. This causes issues if the user takes CPUs online or offline
    while the profiler is running.

    If we online a CPU after starting the profiler, we lose all the trace
    information on the CPU going online.

    If we offline a CPU after running a test and start a new test, it will
    not clear the old data from that CPU.

    This bug causes incorrect data to be reported to the user if they
    online or offline CPUs during the profiling"

    * tag 'trace-fixes-v3.13-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
    ftrace: Initialize the ftrace profiler for each possible cpu

    Linus Torvalds
     

20 Dec, 2013

3 commits

  • Freezable kthreads and workqueues are fundamentally problematic in
    that they effectively introduce a big kernel lock widely used in the
    kernel and have already been the culprit of several deadlock
    scenarios. This is the latest occurrence.

    During resume, libata rescans all the ports and revalidates all
    pre-existing devices. If it determines that a device has gone
    missing, the device is removed from the system which involves
    invalidating block device and flushing bdi while holding driver core
    layer locks. Unfortunately, this can race with the rest of device
    resume. Because freezable kthreads and workqueues are thawed after
    device resume is complete and block device removal depends on
    freezable workqueues and kthreads (e.g. bdi_wq, jbd2) to make
    progress, this can lead to deadlock - block device removal can't
    proceed because kthreads are frozen and kthreads can't be thawed
    because device resume is blocked behind block device removal.

    839a8e8660b6 ("writeback: replace custom worker pool implementation
    with unbound workqueue") made this particular deadlock scenario more
    visible but the underlying problem has always been there - the
    original forker task and jbd2 are freezable too. In fact, this is
    highly likely just one of many possible deadlock scenarios given that
    freezer behaves as a big kernel lock and we don't have any debug
    mechanism around it.

    I believe the right thing to do is getting rid of freezable kthreads
    and workqueues. This is something fundamentally broken. For now,
    implement a funny workaround in libata - just avoid doing block device
    hot[un]plug while the system is frozen. Kernel engineering at its
    finest. :(

    v2: Add EXPORT_SYMBOL_GPL(pm_freezing) for cases where libata is built
    as a module.

    v3: Comment updated and polling interval changed to 10ms as suggested
    by Rafael.

    v4: Add #ifdef CONFIG_FREEZER around the hack as pm_freezing is not
    defined when FREEZER is not configured thus breaking build.
    Reported by kbuild test robot.

    Signed-off-by: Tejun Heo
    Reported-by: Tomaž Šolc
    Reviewed-by: "Rafael J. Wysocki"
    Link: https://bugzilla.kernel.org/show_bug.cgi?id=62801
    Link: http://lkml.kernel.org/r/20131213174932.GA27070@htj.dyndns.org
    Cc: Greg Kroah-Hartman
    Cc: Len Brown
    Cc: Oleg Nesterov
    Cc: stable@vger.kernel.org
    Cc: kbuild test robot

    Tejun Heo
     
  • Pull scheduler fixes from Ingo Molnar:
    "An RT group-scheduling fix and the sched-domains topology setup fix
    from Mel"

    * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    sched/rt: Fix rq's cpupri leak while enqueue/dequeue child RT entities
    sched: Assign correct scheduling domain to 'sd_llc'

    Linus Torvalds
     
  • Pull perf fixes from Ingo Molnar:
    "An ABI documentation fix, and a mixed-PMU perf-info-corruption fix"

    * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    perf: Document the new transaction sample type
    perf: Disable all pmus on unthrottling and rescheduling

    Linus Torvalds
     

19 Dec, 2013

5 commits

  • Merge patches from Andrew Morton:
    "23 fixes and a MAINTAINERS update"

    * emailed patches from Andrew Morton : (24 commits)
    mm/hugetlb: check for pte NULL pointer in __page_check_address()
    fix build with make 3.80
    mm/mempolicy: fix !vma in new_vma_page()
    MAINTAINERS: add Davidlohr as GPT maintainer
    mm/memory-failure.c: recheck PageHuge() after hugetlb page migrate successfully
    mm/compaction: respect ignore_skip_hint in update_pageblock_skip
    mm/mempolicy: correct putback method for isolate pages if failed
    mm: add missing dependency in Kconfig
    sh: always link in helper functions extracted from libgcc
    mm: page_alloc: exclude unreclaimable allocations from zone fairness policy
    mm: numa: defer TLB flush for THP migration as long as possible
    mm: numa: guarantee that tlb_flush_pending updates are visible before page table updates
    mm: fix TLB flush race between migration, and change_protection_range
    mm: numa: avoid unnecessary disruption of NUMA hinting during migration
    mm: numa: clear numa hinting information on mprotect
    sched: numa: skip inaccessible VMAs
    mm: numa: avoid unnecessary work on the failure path
    mm: numa: ensure anon_vma is locked to prevent parallel THP splits
    mm: numa: do not clear PTE for pte_numa update
    mm: numa: do not clear PMD during PTE update scan
    ...

    Linus Torvalds
     
  • There are a few subtle races, between change_protection_range (used by
    mprotect and change_prot_numa) on one side, and NUMA page migration and
    compaction on the other side.

    The basic race is that there is a time window between when the PTE gets
    made non-present (PROT_NONE or NUMA), and the TLB is flushed.

    During that time, a CPU may continue writing to the page.

    This is fine most of the time, however compaction or the NUMA migration
    code may come in, and migrate the page away.

    When that happens, the CPU may continue writing, through the cached
    translation, to what is no longer the current memory location of the
    process.

    This only affects x86, which has a somewhat optimistic pte_accessible.
    All other architectures appear to be safe, and will either always flush,
    or flush whenever there is a valid mapping, even with no permissions
    (SPARC).

    The basic race looks like this:

    CPU A CPU B CPU C

    load TLB entry
    make entry PTE/PMD_NUMA
    fault on entry
    read/write old page
    start migrating page
    change PTE/PMD to new page
    read/write old page [*]
    flush TLB
    reload TLB from new entry
    read/write new page
    lose data

    [*] the old page may belong to a new user at this point!

    The obvious fix is to flush remote TLB entries, by making sure that
    pte_accessible aware of the fact that PROT_NONE and PROT_NUMA memory may
    still be accessible if there is a TLB flush pending for the mm.

    This should fix both NUMA migration and compaction.

    [mgorman@suse.de: fix build]
    Signed-off-by: Rik van Riel
    Signed-off-by: Mel Gorman
    Cc: Alex Thorlton
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rik van Riel
     
  • Inaccessible VMA should not be trapping NUMA hint faults. Skip them.

    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Cc: Alex Thorlton
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Commit 1b3a5d02ee07 ("reboot: move arch/x86 reboot= handling to generic
    kernel") moved reboot= handling to generic code. In the process it also
    removed the code in native_machine_shutdown() which are moving reboot
    process to reboot_cpu/cpu0.

    I guess that thought must have been that all reboot paths are calling
    migrate_to_reboot_cpu(), so we don't need this special handling. But
    kexec reboot path (kernel_kexec()) is not calling
    migrate_to_reboot_cpu() so above change broke kexec. Now reboot can
    happen on non-boot cpu and when INIT is sent in second kerneo to bring
    up BP, it brings down the machine.

    So start calling migrate_to_reboot_cpu() in kexec reboot path to avoid
    this problem.

    Bisected by WANG Chao.

    Reported-by: Matthew Whitehead
    Reported-by: Dave Young
    Signed-off-by: Vivek Goyal
    Tested-by: Baoquan He
    Tested-by: WANG Chao
    Acked-by: H. Peter Anvin
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vivek Goyal
     
  • Pull crypto key patches from David Howells:
    "There are four items:

    - A patch to fix X.509 certificate gathering. The problem was that I
    was coming up with a different path for signing_key.x509 in the
    build directory if it didn't exist to if it did exist. This meant
    that the X.509 cert container object file would be rebuilt on the
    second rebuild in a build directory and the kernel would get
    relinked.

    - Unconditionally remove files generated by SYSTEM_TRUSTED_KEYRING=y
    when doing make mrproper.

    - Actually initialise the persistent-keyring semaphore for
    init_user_ns. I have no idea why this works at all for users in
    the base user namespace unless it's something to do with systemd
    containerising the system.

    - Documentation for module signing"

    * 'keys-devel' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs:
    Add Documentation/module-signing.txt file
    KEYS: fix uninitialized persistent_keyring_register_sem
    KEYS: Remove files generated when SYSTEM_TRUSTED_KEYRING=y
    X.509: Fix certificate gathering

    Linus Torvalds
     

18 Dec, 2013

1 commit


17 Dec, 2013

4 commits

  • This patch touches the RT group scheduling case.

    Functions inc_rt_prio_smp() and dec_rt_prio_smp() change (global) rq's
    priority, while rt_rq passed to them may be not the top-level rt_rq.
    This is wrong, because changing of priority on a child level does not
    guarantee that the priority is the highest all over the rq. So, this
    leak makes RT balancing unusable.

    The short example: the task having the highest priority among all rq's
    RT tasks (no one other task has the same priority) are waking on a
    throttle rt_rq. The rq's cpupri is set to the task's priority
    equivalent, but real rq->rt.highest_prio.curr is less.

    The patch below fixes the problem.

    Signed-off-by: Kirill Tkhai
    Signed-off-by: Peter Zijlstra
    CC: Steven Rostedt
    CC: stable@vger.kernel.org
    Link: http://lkml.kernel.org/r/49231385567953@web4m.yandex.ru
    Signed-off-by: Ingo Molnar

    Kirill Tkhai
     
  • Commit 42eb088e (sched: Avoid NULL dereference on sd_busy) corrected a NULL
    dereference on sd_busy but the fix also altered what scheduling domain it
    used for the 'sd_llc' percpu variable.

    One impact of this is that a task selecting a runqueue may consider
    idle CPUs that are not cache siblings as candidates for running.
    Tasks are then running on CPUs that are not cache hot.

    This was found through bisection where ebizzy threads were not seeing equal
    performance and it looked like a scheduling fairness issue. This patch
    mitigates but does not completely fix the problem on all machines tested
    implying there may be an additional bug or a common root cause. Here are
    the average range of performance seen by individual ebizzy threads. It
    was tested on top of candidate patches related to x86 TLB range flushing.

    4-core machine
    3.13.0-rc3 3.13.0-rc3
    vanilla fixsd-v3r3
    Mean 1 0.00 ( 0.00%) 0.00 ( 0.00%)
    Mean 2 0.34 ( 0.00%) 0.10 ( 70.59%)
    Mean 3 1.29 ( 0.00%) 0.93 ( 27.91%)
    Mean 4 7.08 ( 0.00%) 0.77 ( 89.12%)
    Mean 5 193.54 ( 0.00%) 2.14 ( 98.89%)
    Mean 6 151.12 ( 0.00%) 2.06 ( 98.64%)
    Mean 7 115.38 ( 0.00%) 2.04 ( 98.23%)
    Mean 8 108.65 ( 0.00%) 1.92 ( 98.23%)

    8-core machine
    Mean 1 0.00 ( 0.00%) 0.00 ( 0.00%)
    Mean 2 0.40 ( 0.00%) 0.21 ( 47.50%)
    Mean 3 23.73 ( 0.00%) 0.89 ( 96.25%)
    Mean 4 12.79 ( 0.00%) 1.04 ( 91.87%)
    Mean 5 13.08 ( 0.00%) 2.42 ( 81.50%)
    Mean 6 23.21 ( 0.00%) 69.46 (-199.27%)
    Mean 7 15.85 ( 0.00%) 101.72 (-541.77%)
    Mean 8 109.37 ( 0.00%) 19.13 ( 82.51%)
    Mean 12 124.84 ( 0.00%) 28.62 ( 77.07%)
    Mean 16 113.50 ( 0.00%) 24.16 ( 78.71%)

    It's eliminated for one machine and reduced for another.

    Signed-off-by: Mel Gorman
    Signed-off-by: Peter Zijlstra
    Cc: Alex Shi
    Cc: Andrew Morton
    Cc: Fengguang Wu
    Cc: H Peter Anvin
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/r/20131217092124.GV11295@suse.de
    Signed-off-by: Ingo Molnar

    Mel Gorman
     
  • Currently, only one PMU in a context gets disabled during unthrottling
    and event_sched_{out,in}(), however, events in one context may belong to
    different pmus, which results in PMUs being reprogrammed while they are
    still enabled.

    This means that mixed PMU use [which is rare in itself] resulted in
    potentially completely unreliable results: corrupted events, bogus
    results, etc.

    This patch temporarily disables PMUs that correspond to
    each event in the context while these events are being modified.

    Signed-off-by: Alexander Shishkin
    Reviewed-by: Andi Kleen
    Signed-off-by: Peter Zijlstra
    Cc: Frederic Weisbecker
    Cc: Mike Galbraith
    Cc: Paul Mackerras
    Cc: Stephane Eranian
    Cc: Alexander Shishkin
    Link: http://lkml.kernel.org/r/1387196256-8030-1-git-send-email-alexander.shishkin@linux.intel.com
    Signed-off-by: Ingo Molnar

    Alexander Shishkin
     
  • Hugh reported this bug:

    > CONFIG_MEMCG_SWAP is broken in 3.13-rc. Try something like this:
    >
    > mkdir -p /tmp/tmpfs /tmp/memcg
    > mount -t tmpfs -o size=1G tmpfs /tmp/tmpfs
    > mount -t cgroup -o memory memcg /tmp/memcg
    > mkdir /tmp/memcg/old
    > echo 512M >/tmp/memcg/old/memory.limit_in_bytes
    > echo $$ >/tmp/memcg/old/tasks
    > cp /dev/zero /tmp/tmpfs/zero 2>/dev/null
    > echo $$ >/tmp/memcg/tasks
    > rmdir /tmp/memcg/old
    > sleep 1 # let rmdir work complete
    > mkdir /tmp/memcg/new
    > umount /tmp/tmpfs
    > dmesg | grep WARNING
    > rmdir /tmp/memcg/new
    > umount /tmp/memcg
    >
    > Shows lots of WARNING: CPU: 1 PID: 1006 at kernel/res_counter.c:91
    > res_counter_uncharge_locked+0x1f/0x2f()
    >
    > Breakage comes from 34c00c319ce7 ("memcg: convert to use cgroup id").
    >
    > The lifetime of a cgroup id is different from the lifetime of the
    > css id it replaced: memsw's css_get()s do nothing to hold on to the
    > old cgroup id, it soon gets recycled to a new cgroup, which then
    > mysteriously inherits the old's swap, without any charge for it.

    Instead of removing cgroup id right after all the csses have been
    offlined, we should do that after csses have been destroyed.

    To make sure an invalid css pointer won't be returned after the css
    is destroyed, make sure css_from_id() returns NULL in this case.

    tj: Updated comment to note planned changes for cgrp->id.

    Reported-by: Hugh Dickins
    Signed-off-by: Li Zefan
    Reviewed-by: Michal Hocko
    Signed-off-by: Tejun Heo

    Li Zefan
     

16 Dec, 2013

2 commits

  • Ftrace currently initializes only the online CPUs. This implementation has
    two problems:
    - If we online a CPU after we enable the function profile, and then run the
    test, we will lose the trace information on that CPU.
    Steps to reproduce:
    # echo 0 > /sys/devices/system/cpu/cpu1/online
    # cd /tracing/
    # echo >> set_ftrace_filter
    # echo 1 > function_profile_enabled
    # echo 1 > /sys/devices/system/cpu/cpu1/online
    # run test
    - If we offline a CPU before we enable the function profile, we will not clear
    the trace information when we enable the function profile. It will trouble
    the users.
    Steps to reproduce:
    # cd /tracing/
    # echo >> set_ftrace_filter
    # echo 1 > function_profile_enabled
    # run test
    # cat trace_stat/function*
    # echo 0 > /sys/devices/system/cpu/cpu1/online
    # echo 0 > function_profile_enabled
    # echo 1 > function_profile_enabled
    # cat trace_stat/function*
    # run test
    # cat trace_stat/function*

    So it is better that we initialize the ftrace profiler for each possible cpu
    every time we enable the function profile instead of just the online ones.

    Link: http://lkml.kernel.org/r/1387178401-10619-1-git-send-email-miaox@cn.fujitsu.com

    Cc: stable@vger.kernel.org # 2.6.31+
    Signed-off-by: Miao Xie
    Signed-off-by: Steven Rostedt

    Miao Xie
     
  • Pull PCI updates from Bjorn Helgaas:
    "PCI device hotplug
    - Move device_del() from pci_stop_dev() to pci_destroy_dev() (Rafael
    Wysocki)

    Host bridge drivers
    - Update maintainers for DesignWare, i.MX6, Armada, R-Car (Bjorn
    Helgaas)
    - mvebu: Return 'unsupported' for Interrupt Line and Interrupt Pin
    (Jason Gunthorpe)

    Miscellaneous
    - Avoid unnecessary CPU switch when calling .probe() (Alexander
    Duyck)
    - Revert "workqueue: allow work_on_cpu() to be called recursively"
    (Bjorn Helgaas)
    - Disable Bus Master only on kexec reboot (Khalid Aziz)
    - Omit PCI ID macro strings to shorten quirk names for LTO (Michal
    Marek)"

    * tag 'pci-v3.13-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci:
    MAINTAINERS: Add DesignWare, i.MX6, Armada, R-Car PCI host maintainers
    PCI: Disable Bus Master only on kexec reboot
    PCI: mvebu: Return 'unsupported' for Interrupt Line and Interrupt Pin
    PCI: Omit PCI ID macro strings to shorten quirk names
    PCI: Move device_del() from pci_stop_dev() to pci_destroy_dev()
    Revert "workqueue: allow work_on_cpu() to be called recursively"
    PCI: Avoid unnecessary CPU switch when calling driver .probe() method

    Linus Torvalds
     

13 Dec, 2013

6 commits

  • We run into this bug:
    [ 2736.063245] Unable to handle kernel paging request for data at address 0x00000000
    [ 2736.063293] Faulting instruction address: 0xc00000000037efb0
    [ 2736.063300] Oops: Kernel access of bad area, sig: 11 [#1]
    [ 2736.063303] SMP NR_CPUS=2048 NUMA pSeries
    [ 2736.063310] Modules linked in: sg nfsv3 rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache nf_conntrack_netbios_ns nf_conntrack_broadcast ipt_MASQUERADE ip6table_mangle ip6table_security ip6table_raw ip6t_REJECT iptable_nat nf_nat_ipv4 iptable_mangle iptable_security iptable_raw ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack ebtable_filter ebtables ip6table_filter iptable_filter ip_tables ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 nf_nat nf_conntrack ip6_tables ibmveth pseries_rng nx_crypto nfsd auth_rpcgss nfs_acl lockd sunrpc binfmt_misc xfs libcrc32c dm_service_time sd_mod crc_t10dif crct10dif_common ibmvfc scsi_transport_fc scsi_tgt dm_mirror dm_region_hash dm_log dm_multipath dm_mod
    [ 2736.063383] CPU: 1 PID: 7128 Comm: ssh Not tainted 3.10.0-48.el7.ppc64 #1
    [ 2736.063389] task: c000000131930120 ti: c0000001319a0000 task.ti: c0000001319a0000
    [ 2736.063394] NIP: c00000000037efb0 LR: c0000000006c40f8 CTR: 0000000000000000
    [ 2736.063399] REGS: c0000001319a3870 TRAP: 0300 Not tainted (3.10.0-48.el7.ppc64)
    [ 2736.063403] MSR: 8000000000009032 CR: 28824242 XER: 20000000
    [ 2736.063415] SOFTE: 0
    [ 2736.063418] CFAR: c00000000000908c
    [ 2736.063421] DAR: 0000000000000000, DSISR: 40000000
    [ 2736.063425]
    GPR00: c0000000006c40f8 c0000001319a3af0 c000000001074788 c0000001319a3bf0
    GPR04: 0000000000000000 0000000000000000 0000000000000020 000000000000000a
    GPR08: fffffffe00000002 00000000ffff0000 0000000080000001 c000000000924888
    GPR12: 0000000028824248 c000000007e00400 00001fffffa0f998 0000000000000000
    GPR16: 0000000000000022 00001fffffa0f998 0000010022e92470 0000000000000000
    GPR20: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
    GPR24: 0000000000000000 c000000000f4a828 00003ffffe527108 0000000000000000
    GPR28: c000000000f4a730 c000000000f4a828 0000000000000000 c0000001319a3bf0
    [ 2736.063498] NIP [c00000000037efb0] .__list_add+0x30/0x110
    [ 2736.063504] LR [c0000000006c40f8] .rwsem_down_write_failed+0x78/0x264
    [ 2736.063508] PACATMSCRATCH [800000000280f032]
    [ 2736.063511] Call Trace:
    [ 2736.063516] [c0000001319a3af0] [c0000001319a3b80] 0xc0000001319a3b80 (unreliable)
    [ 2736.063523] [c0000001319a3b80] [c0000000006c40f8] .rwsem_down_write_failed+0x78/0x264
    [ 2736.063530] [c0000001319a3c50] [c0000000006c1bb0] .down_write+0x70/0x78
    [ 2736.063536] [c0000001319a3cd0] [c0000000002e5ffc] .keyctl_get_persistent+0x20c/0x320
    [ 2736.063542] [c0000001319a3dc0] [c0000000002e2388] .SyS_keyctl+0x238/0x260
    [ 2736.063548] [c0000001319a3e30] [c000000000009e7c] syscall_exit+0x0/0x7c
    [ 2736.063553] Instruction dump:
    [ 2736.063556] 7c0802a6 fba1ffe8 fbc1fff0 fbe1fff8 7cbd2b78 7c9e2378 7c7f1b78 f8010010
    [ 2736.063566] f821ff71 e8a50008 7fa52040 40de00c0 7fbd2840 40de0094 7fbff040
    [ 2736.063579] ---[ end trace 2708241785538296 ]---

    It's caused by uninitialized persistent_keyring_register_sem.

    The bug was introduced by commit f36f8c75, two typos are in that commit:
    CONFIG_KEYS_KERBEROS_CACHE should be CONFIG_PERSISTENT_KEYRINGS and
    krb_cache_register_sem should be persistent_keyring_register_sem.

    Signed-off-by: Xiao Guangrong
    Signed-off-by: David Howells

    Xiao Guangrong
     
  • Always remove generated SYSTEM_TRUSTED_KEYRING files while doing make mrproper.

    Signed-off-by: Kirill Tkhai
    Signed-off-by: David Howells

    Kirill Tkhai
     
  • Fix the gathering of certificates from both the source tree and the build tree
    to correctly calculate the pathnames of all the certificates.

    The problem was that if the default generated cert, signing_key.x509, didn't
    exist then it would not have a path attached and if it did, it would have a
    path attached.

    This means that the contents of kernel/.x509.list would change between the
    first compilation in a directory and the second. After the second it would
    remain stable because the signing_key.x509 file exists.

    The consequence was that the kernel would get relinked unconditionally on the
    second recompilation. The second recompilation would also show something like
    this:

    X.509 certificate list changed
    CERTS kernel/x509_certificate_list
    - Including cert /home/torvalds/v2.6/linux/signing_key.x509
    AS kernel/system_certificates.o
    LD kernel/built-in.o

    which is why the relink would happen.

    Unfortunately, it isn't a simple matter of just sticking a path on the front
    of the filename of the certificate in the build directory as make can't then
    work out how to build it.

    So the path has to be prepended to the name for sorting and duplicate
    elimination and then removed for the make rule if it is in the build tree.

    Reported-by: Linus Torvalds
    Signed-off-by: David Howells

    David Howells
     
  • Pull misc keyrings fixes from David Howells:
    "These break down into five sets:

    - A patch to error handling in the big_key type for huge payloads.
    If the payload is larger than the "low limit" and the backing store
    allocation fails, then big_key_instantiate() doesn't clear the
    payload pointers in the key, assuming them to have been previously
    cleared - but only one of them is.

    Unfortunately, the garbage collector still calls big_key_destroy()
    when sees one of the pointers with a weird value in it (and not
    NULL) which it then tries to clean up.

    - Three patches to fix the keyring type:

    * A patch to fix the hash function to correctly divide keyrings off
    from keys in the topology of the tree inside the associative
    array. This is only a problem if searching through nested
    keyrings - and only if the hash function incorrectly puts the a
    keyring outside of the 0 branch of the root node.

    * A patch to fix keyrings' use of the associative array. The
    __key_link_begin() function initially passes a NULL key pointer
    to assoc_array_insert() on the basis that it's holding a place in
    the tree whilst it does more allocation and stuff.

    This is only a problem when a node contains 16 keys that match at
    that level and we want to add an also matching 17th. This should
    easily be manufactured with a keyring full of keyrings (without
    chucking any other sort of key into the mix) - except for (a)
    above which makes it on average adding the 65th keyring.

    * A patch to fix searching down through nested keyrings, where any
    keyring in the set has more than 16 keyrings and none of the
    first keyrings we look through has a match (before the tree
    iteration needs to step to a more distal node).

    Test in keyutils test suite:

    http://git.kernel.org/cgit/linux/kernel/git/dhowells/keyutils.git/commit/?id=8b4ae963ed92523aea18dfbb8cab3f4979e13bd1

    - A patch to fix the big_key type's use of a shmem file as its
    backing store causing audit messages and LSM check failures. This
    is done by setting S_PRIVATE on the file to avoid LSM checks on the
    file (access to the shmem file goes through the keyctl() interface
    and so is gated by the LSM that way).

    This isn't normally a problem if a key is used by the context that
    generated it - and it's currently only used by libkrb5.

    Test in keyutils test suite:

    http://git.kernel.org/cgit/linux/kernel/git/dhowells/keyutils.git/commit/?id=d9a53cbab42c293962f2f78f7190253fc73bd32e

    - A patch to add a generated file to .gitignore.

    - A patch to fix the alignment of the system certificate data such
    that it it works on s390. As I understand it, on the S390 arch,
    symbols must be 2-byte aligned because loading the address discards
    the least-significant bit"

    * tag 'keys-devel-20131210' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs:
    KEYS: correct alignment of system_certificate_list content in assembly file
    Ignore generated file kernel/x509_certificate_list
    security: shmem: implement kernel private shmem inodes
    KEYS: Fix searching of nested keyrings
    KEYS: Fix multiple key add into associative array
    KEYS: Fix the keyring hash function
    KEYS: Pre-clear struct key on allocation

    Linus Torvalds
     
  • When debugging the read-only hugepage case, I was confused by the fact
    that get_futex_key() did an access_ok() only for the non-shared futex
    case, since the user address checking really isn't in any way specific
    to the private key handling.

    Now, it turns out that the shared key handling does effectively do the
    equivalent checks inside get_user_pages_fast() (it doesn't actually
    check the address range on x86, but does check the page protections for
    being a user page). So it wasn't actually a bug, but the fact that we
    treat the address differently for private and shared futexes threw me
    for a loop.

    Just move the check up, so that it gets done for both cases. Also, use
    the 'rw' parameter for the type, even if it doesn't actually matter any
    more (it's a historical artifact of the old racy i386 "page faults from
    kernel space don't check write protections").

    Cc: Thomas Gleixner
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     
  • The hugepage code had the exact same bug that regular pages had in
    commit 7485d0d3758e ("futexes: Remove rw parameter from
    get_futex_key()").

    The regular page case was fixed by commit 9ea71503a8ed ("futex: Fix
    regression with read only mappings"), but the transparent hugepage case
    (added in a5b338f2b0b1: "thp: update futex compound knowledge") case
    remained broken.

    Found by Dave Jones and his trinity tool.

    Reported-and-tested-by: Dave Jones
    Cc: stable@kernel.org # v2.6.38+
    Acked-by: Thomas Gleixner
    Cc: Mel Gorman
    Cc: Darren Hart
    Cc: Andrea Arcangeli
    Cc: Oleg Nesterov
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

11 Dec, 2013

4 commits

  • Christian suffers from a bad BIOS that wrecks his i5's TSC sync. This
    results in him occasionally seeing time going backwards - which
    crashes the scheduler ...

    Most of our time accounting can actually handle that except the most
    common one; the tick time update of sched_fair.

    There is a further problem with that code; previously we assumed that
    because we get a tick every TICK_NSEC our time delta could never
    exceed 32bits and math was simpler.

    However, ever since Frederic managed to get NO_HZ_FULL merged; this is
    no longer the case since now a task can run for a long time indeed
    without getting a tick. It only takes about ~4.2 seconds to overflow
    our u32 in nanoseconds.

    This means we not only need to better deal with time going backwards;
    but also means we need to be able to deal with large deltas.

    This patch reworks the entire code and uses mul_u64_u32_shr() as
    proposed by Andy a long while ago.

    We express our virtual time scale factor in a u32 multiplier and shift
    right and the 32bit mul_u64_u32_shr() implementation reduces to a
    single 32x32->64 multiply if the time delta is still short (common
    case).

    For 64bit a 64x64->128 multiply can be used if ARCH_SUPPORTS_INT128.

    Reported-and-Tested-by: Christian Engelmayer
    Signed-off-by: Peter Zijlstra
    Cc: fweisbec@gmail.com
    Cc: Paul Turner
    Cc: Stanislaw Gruszka
    Cc: Andy Lutomirski
    Cc: Linus Torvalds
    Cc: Andrew Morton
    Link: http://lkml.kernel.org/r/20131118172706.GI3866@twins.programming.kicks-ass.net
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Yinghai reported that he saw a /0 in sg_capacity on his EX parts.
    Make sure to always initialize power_orig now that we actually use it.

    Ideally build_sched_domains() -> init_sched_groups_power() would also
    initialize this; but for some yet unexplained reason some setups seem
    to miss updates there.

    Reported-by: Yinghai Lu
    Tested-by: Yinghai Lu
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/n/tip-l8ng2m9uml6fhibln8wqpom7@git.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Apart from data-type specific alignment constraints, there are also
    architecture-specific alignment requirements.
    For example, on s390 symbols must be on even addresses implying a 2-byte
    alignment. If the system_certificate_list_end symbol is on an odd address
    and if this address is loaded, the least-significant bit is ignored. As a
    result, the load_system_certificate_list() fails to load the certificates
    because of a wrong certificate length calculation.

    To be safe, align system_certificate_list on an 8-byte boundary. Also improve
    the length calculation of the system_certificate_list content. Introduce a
    system_certificate_list_size (8-byte aligned because of unsigned long) variable
    that stores the length. Let the linker calculate this size by introducing
    a start and end label for the certificate content.

    Signed-off-by: Hendrik Brueckner
    Signed-off-by: David Howells

    Hendrik Brueckner
     
  • $ git status
    # On branch pending-rebases
    # Untracked files:
    # (use "git add ..." to include in what will be committed)
    #
    # kernel/x509_certificate_list
    nothing added to commit but untracked files present (use "git add" to track)
    $

    Signed-off-by: Rusty Russell
    Signed-off-by: David Howells

    Rusty Russell
     

08 Dec, 2013

1 commit

  • Add a flag to tell the PCI subsystem that kernel is shutting down in
    preparation to kexec a kernel. Add code in PCI subsystem to use this flag
    to clear Bus Master bit on PCI devices only in case of kexec reboot.

    This fixes a power-off problem on Acer Aspire V5-573G and likely other
    machines and avoids any other issues caused by clearing Bus Master bit on
    PCI devices in normal shutdown path. The problem was introduced by
    b566a22c2332 ("PCI: disable Bus Master on PCI device shutdown").

    This patch is based on discussion at
    http://marc.info/?l=linux-pci&m=138425645204355&w=2

    Link: https://bugzilla.kernel.org/show_bug.cgi?id=63861
    Reported-by: Chang Liu
    Signed-off-by: Khalid Aziz
    Signed-off-by: Bjorn Helgaas
    Acked-by: Konstantin Khlebnikov
    Cc: stable@vger.kernel.org # v3.5+

    Khalid Aziz
     

07 Dec, 2013

2 commits

  • ae7f164a09 ("cgroup: move cgroup->subsys[] assignment to
    online_css()") moved cgroup->subsys[] assignements later in
    cgroup_create() but didn't update error handling path accordingly
    leading to the following oops and leaking later css's after an
    online_css() failure. The oops is from cgroup destruction path being
    invoked on the partially constructed cgroup which is not ready to
    handle empty slots in cgrp->subsys[] array.

    BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
    IP: [] cgroup_destroy_locked+0x118/0x2f0
    PGD a780a067 PUD aadbe067 PMD 0
    Oops: 0000 [#1] SMP
    Modules linked in:
    CPU: 6 PID: 7360 Comm: mkdir Not tainted 3.13.0-rc2+ #69
    Hardware name:
    task: ffff8800b9dbec00 ti: ffff8800a781a000 task.ti: ffff8800a781a000
    RIP: 0010:[] [] cgroup_destroy_locked+0x118/0x2f0
    RSP: 0018:ffff8800a781bd98 EFLAGS: 00010282
    RAX: ffff880586903878 RBX: ffff880586903800 RCX: ffff880586903820
    RDX: ffff880586903860 RSI: ffff8800a781bdb0 RDI: ffff880586903820
    RBP: ffff8800a781bde8 R08: ffff88060e0b8048 R09: ffffffff811d7bc1
    R10: 000000000000008c R11: 0000000000000001 R12: ffff8800a72286c0
    R13: 0000000000000000 R14: ffffffff81cf7a40 R15: 0000000000000001
    FS: 00007f60ecda57a0(0000) GS:ffff8806272c0000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 0000000000000008 CR3: 00000000a7a03000 CR4: 00000000000007e0
    Stack:
    ffff880586903860 ffff880586903910 ffff8800a72286c0 ffff880586903820
    ffffffff81cf7a40 ffff880586903800 ffff88060e0b8018 ffffffff81cf7a40
    ffff8800b9dbec00 ffff8800b9dbf098 ffff8800a781bec8 ffffffff810ef5bf
    Call Trace:
    [] cgroup_mkdir+0x55f/0x5f0
    [] vfs_mkdir+0xee/0x140
    [] SyS_mkdirat+0x6e/0xf0
    [] SyS_mkdir+0x19/0x20
    [] system_call_fastpath+0x16/0x1b

    This patch moves reference bumping inside online_css() loop, clears
    css_ar[] as css's are brought online successfully, and updates
    err_destroy path so that either a css is fully online and destroyed by
    cgroup_destroy_locked() or the error path frees it. This creates a
    duplicate css free logic in the error path but it will be cleaned up
    soon.

    v2: Li pointed out that cgroup_destroy_locked() would do NULL-deref if
    invoked with a cgroup which doesn't have all css's populated.
    Update cgroup_destroy_locked() so that it skips NULL css's.

    Signed-off-by: Tejun Heo
    Acked-by: Li Zefan
    Reported-by: Vladimir Davydov
    Cc: stable@vger.kernel.org # v3.12+

    Tejun Heo
     
  • …t/rostedt/linux-trace

    Pull tracing fix from Steven Rostedt:
    "A regression showed up that there's a large delay when enabling all
    events. This was prevalent when FTRACE_SELFTEST was enabled which
    enables all events several times, and caused the system bootup to
    pause for over a minute.

    This was tracked down to an addition of a synchronize_sched()
    performed when system call tracepoints are unregistered.

    The synchronize_sched() is needed between the unregistering of the
    system call tracepoint and a deletion of a tracing instance buffer.
    But placing the synchronize_sched() in the unreg of *every* system
    call tracepoint is a bit overboard. A single synchronize_sched()
    before the deletion of the instance is sufficient"

    * tag 'trace-fixes-3.13-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
    tracing: Only run synchronize_sched() at instance deletion time

    Linus Torvalds
     

06 Dec, 2013

1 commit

  • It has been reported that boot up with FTRACE_SELFTEST enabled can take a
    very long time. There can be stalls of over a minute.

    This was tracked down to the synchronize_sched() called when a system call
    event is disabled. As the self tests enable and disable thousands of events,
    this makes the synchronize_sched() get called thousands of times.

    The synchornize_sched() was added with d562aff93bfb53 "tracing: Add support
    for SOFT_DISABLE to syscall events" which caused this regression (added
    in 3.13-rc1).

    The synchronize_sched() is to protect against the events being accessed
    when a tracer instance is being deleted. When an instance is being deleted
    all the events associated to it are unregistered. The synchronize_sched()
    makes sure that no more users are running when it finishes.

    Instead of calling synchronize_sched() for all syscall events, we only
    need to call it once, after the events are unregistered and before the
    instance is deleted. The event_mutex is held during this action to
    prevent new users from enabling events.

    Link: http://lkml.kernel.org/r/20131203124120.427b9661@gandalf.local.home

    Reported-by: Petr Mladek
    Acked-by: Tom Zanussi
    Acked-by: Petr Mladek
    Tested-by: Petr Mladek
    Signed-off-by: Steven Rostedt

    Steven Rostedt
     

05 Dec, 2013

1 commit

  • Pull timer fixes from Thomas Gleixner:

    - timekeeping: Cure a subtle drift issue on GENERIC_TIME_VSYSCALL_OLD

    - nohz: Make CONFIG_NO_HZ=n and nohz=off command line option behave the
    same way. Fixes a long standing load accounting wreckage.

    - clocksource/ARM: Kconfig update to avoid ARM=n wreckage

    - clocksource/ARM: Fixlets for the AT91 and SH clocksource/clockevents

    - Trivial documentation update and kzalloc conversion from akpms pile

    * 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    nohz: Fix another inconsistency between CONFIG_NO_HZ=n and nohz=off
    time: Fix 1ns/tick drift w/ GENERIC_TIME_VSYSCALL_OLD
    clocksource: arm_arch_timer: Hide eventstream Kconfig on non-ARM
    clocksource: sh_tmu: Add clk_prepare/unprepare support
    clocksource: sh_tmu: Release clock when sh_tmu_register() fails
    clocksource: sh_mtu2: Add clk_prepare/unprepare support
    clocksource: sh_mtu2: Release clock when sh_mtu2_register() fails
    ARM: at91: rm9200: switch back to clockevents_config_and_register
    tick: Document tick_do_timer_cpu
    timer: Convert kmalloc_node(...GFP_ZERO...) to kzalloc_node(...)
    NOHZ: Check for nohz active instead of nohz enabled

    Linus Torvalds
     

03 Dec, 2013

3 commits

  • Pull irq fixes from Thomas Gleixner:
    - Correction of fuzzy and fragile IRQ_RETVAL macro
    - IRQ related resume fix affecting only XEN
    - ARM/GIC fix for chained GIC controllers

    * 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    irqchip: Gic: fix boot for chained gics
    irq: Enable all irqs unconditionally in irq_resume
    genirq: Correct fuzzy and fragile IRQ_RETVAL() definition

    Linus Torvalds
     
  • Pull scheduler fixes from Ingo Molnar:
    "Various smaller fixlets, all over the place"

    * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    sched/doc: Fix generation of device-drivers
    sched: Expose preempt_schedule_irq()
    sched: Fix a trivial typo in comments
    sched: Remove unused variable in 'struct sched_domain'
    sched: Avoid NULL dereference on sd_busy
    sched: Check sched_domain before computing group power
    MAINTAINERS: Update file patterns in the lockdep and scheduler entries

    Linus Torvalds
     
  • Pull perf fixes from Ingo Molnar:
    "Misc kernel and tooling fixes"

    * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    tools lib traceevent: Fix conversion of pointer to integer of different size
    perf/trace: Properly use u64 to hold event_id
    perf: Remove fragile swevent hlist optimization
    ftrace, perf: Avoid infinite event generation loop
    tools lib traceevent: Fix use of multiple options in processing field
    perf header: Fix possible memory leaks in process_group_desc()
    perf header: Fix bogus group name
    perf tools: Tag thread comm as overriden

    Linus Torvalds