19 Sep, 2013

2 commits


16 Sep, 2013

1 commit

  • sched_info_depart seems to be only called from
    sched_info_switch(), so only on involuntary task switch.

    Fix the comment to match.

    Signed-off-by: Michael S. Tsirkin
    Cc: Peter Zijlstra
    Cc: Frederic Weisbecker
    Cc: KOSAKI Motohiro
    Link: http://lkml.kernel.org/r/20130916083036.GA1113@redhat.com
    Signed-off-by: Ingo Molnar

    Michael S. Tsirkin
     

14 Sep, 2013

1 commit

  • Pull aio changes from Ben LaHaise:
    "First off, sorry for this pull request being late in the merge window.
    Al had raised a couple of concerns about 2 items in the series below.
    I addressed the first issue (the race introduced by Gu's use of
    mm_populate()), but he has not provided any further details on how he
    wants to rework the anon_inode.c changes (which were sent out months
    ago but have yet to be commented on).

    The bulk of the changes have been sitting in the -next tree for a few
    months, with all the issues raised being addressed"

    * git://git.kvack.org/~bcrl/aio-next: (22 commits)
    aio: rcu_read_lock protection for new rcu_dereference calls
    aio: fix race in ring buffer page lookup introduced by page migration support
    aio: fix rcu sparse warnings introduced by ioctx table lookup patch
    aio: remove unnecessary debugging from aio_free_ring()
    aio: table lookup: verify ctx pointer
    staging/lustre: kiocb->ki_left is removed
    aio: fix error handling and rcu usage in "convert the ioctx list to table lookup v3"
    aio: be defensive to ensure request batching is non-zero instead of BUG_ON()
    aio: convert the ioctx list to table lookup v3
    aio: double aio_max_nr in calculations
    aio: Kill ki_dtor
    aio: Kill ki_users
    aio: Kill unneeded kiocb members
    aio: Kill aio_rw_vect_retry()
    aio: Don't use ctx->tail unnecessarily
    aio: io_cancel() no longer returns the io_event
    aio: percpu ioctx refcount
    aio: percpu reqs_available
    aio: reqs_active -> reqs_available
    aio: fix build when migration is disabled
    ...

    Linus Torvalds
     

13 Sep, 2013

12 commits

  • After the last architecture switched to generic hard irqs the config
    options HAVE_GENERIC_HARDIRQS & GENERIC_HARDIRQS and the related code
    for !CONFIG_GENERIC_HARDIRQS can be removed.

    Signed-off-by: Martin Schwidefsky

    Martin Schwidefsky
     
  • Merge more patches from Andrew Morton:
    "The rest of MM. Plus one misc cleanup"

    * emailed patches from Andrew Morton : (35 commits)
    mm/Kconfig: add MMU dependency for MIGRATION.
    kernel: replace strict_strto*() with kstrto*()
    mm, thp: count thp_fault_fallback anytime thp fault fails
    thp: consolidate code between handle_mm_fault() and do_huge_pmd_anonymous_page()
    thp: do_huge_pmd_anonymous_page() cleanup
    thp: move maybe_pmd_mkwrite() out of mk_huge_pmd()
    mm: cleanup add_to_page_cache_locked()
    thp: account anon transparent huge pages into NR_ANON_PAGES
    truncate: drop 'oldsize' truncate_pagecache() parameter
    mm: make lru_add_drain_all() selective
    memcg: document cgroup dirty/writeback memory statistics
    memcg: add per cgroup writeback pages accounting
    memcg: check for proper lock held in mem_cgroup_update_page_stat
    memcg: remove MEMCG_NR_FILE_MAPPED
    memcg: reduce function dereference
    memcg: avoid overflow caused by PAGE_ALIGN
    memcg: rename RESOURCE_MAX to RES_COUNTER_MAX
    memcg: correct RESOURCE_MAX to ULLONG_MAX
    mm: memcg: do not trap chargers with full callstack on OOM
    mm: memcg: rework and document OOM waiting and wakeup
    ...

    Linus Torvalds
     
  • The usage of strict_strto*() is not preferred, because strict_strto*() is
    obsolete. Thus, kstrto*() should be used.

    Signed-off-by: Jingoo Han
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jingoo Han
     
  • This function dereferences res far too often, so optimize it.

    Signed-off-by: Sha Zhengju
    Signed-off-by: Qiang Huang
    Acked-by: Michal Hocko
    Cc: Daisuke Nishimura
    Cc: Jeff Liu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sha Zhengju
     
  • Since PAGE_ALIGN is aligning up(the next page boundary), so after
    PAGE_ALIGN, the value might be overflow, such as write the MAX value to
    *.limit_in_bytes.

    $ cat /cgroup/memory/memory.limit_in_bytes
    18446744073709551615

    # echo 18446744073709551615 > /cgroup/memory/memory.limit_in_bytes
    bash: echo: write error: Invalid argument

    Some user programs might depend on such behaviours(like libcg, we read
    the value in snapshot, then use the value to reset cgroup later), and
    that will cause confusion. So we need to fix it.

    Signed-off-by: Sha Zhengju
    Signed-off-by: Qiang Huang
    Acked-by: Michal Hocko
    Cc: Daisuke Nishimura
    Cc: Jeff Liu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sha Zhengju
     
  • RESOURCE_MAX is far too general name, change it to RES_COUNTER_MAX.

    Signed-off-by: Sha Zhengju
    Signed-off-by: Qiang Huang
    Acked-by: Michal Hocko
    Cc: Daisuke Nishimura
    Cc: Jeff Liu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sha Zhengju
     
  • Pull vfs pile 4 from Al Viro:
    "list_lru pile, mostly"

    This came out of Andrew's pile, Al ended up doing the merge work so that
    Andrew didn't have to.

    Additionally, a few fixes.

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (42 commits)
    super: fix for destroy lrus
    list_lru: dynamically adjust node arrays
    shrinker: Kill old ->shrink API.
    shrinker: convert remaining shrinkers to count/scan API
    staging/lustre/libcfs: cleanup linux-mem.h
    staging/lustre/ptlrpc: convert to new shrinker API
    staging/lustre/obdclass: convert lu_object shrinker to count/scan API
    staging/lustre/ldlm: convert to shrinkers to count/scan API
    hugepage: convert huge zero page shrinker to new shrinker API
    i915: bail out earlier when shrinker cannot acquire mutex
    drivers: convert shrinkers to new count/scan API
    fs: convert fs shrinkers to new scan/count API
    xfs: fix dquot isolation hang
    xfs-convert-dquot-cache-lru-to-list_lru-fix
    xfs: convert dquot cache lru to list_lru
    xfs: rework buffer dispose list tracking
    xfs-convert-buftarg-lru-to-generic-code-fix
    xfs: convert buftarg LRU to generic code
    fs: convert inode and dentry shrinking to be node aware
    vmscan: per-node deferred work
    ...

    Linus Torvalds
     
  • Pull ACPI and power management fixes from Rafael Wysocki:
    "All of these commits are fixes that have emerged recently and some of
    them fix bugs introduced during this merge window.

    Specifics:

    1) ACPI-based PCI hotplug (ACPIPHP) fixes related to spurious events

    After the recent ACPIPHP changes we've seen some interesting
    breakage on a system that triggers device check notifications
    during boot for non-existing devices. Although those
    notifications are really spurious, we should be able to deal with
    them nevertheless and that shouldn't introduce too much overhead.
    Four commits to make that work properly.

    2) Memory hotplug and hibernation mutual exclusion rework

    This was maent to be a cleanup, but it happens to fix a classical
    ABBA deadlock between system suspend/hibernation and ACPI memory
    hotplug which is possible if they are started roughly at the same
    time. Three commits rework memory hotplug so that it doesn't
    acquire pm_mutex and make hibernation use device_hotplug_lock
    which prevents it from racing with memory hotplug.

    3) ACPI Intel LPSS (Low-Power Subsystem) driver crash fix

    The ACPI LPSS driver crashes during boot on Apple Macbook Air with
    Haswell that has slightly unusual BIOS configuration in which one
    of the LPSS device's _CRS method doesn't return all of the
    information expected by the driver. Fix from Mika Westerberg, for
    stable.

    4) ACPICA fix related to Store->ArgX operation

    AML interpreter fix for obscure breakage that causes AML to be
    executed incorrectly on some machines (observed in practice).
    From Bob Moore.

    5) ACPI core fix for PCI ACPI device objects lookup

    There still are cases in which there is more than one ACPI device
    object matching a given PCI device and we don't choose the one
    that the BIOS expects us to choose, so this makes the lookup take
    more criteria into account in those cases.

    6) Fix to prevent cpuidle from crashing in some rare cases

    If the result of cpuidle_get_driver() is NULL, which can happen on
    some systems, cpuidle_driver_ref() will crash trying to use that
    pointer and the Daniel Fu's fix prevents that from happening.

    7) cpufreq fixes related to CPU hotplug

    Stephen Boyd reported a number of concurrency problems with
    cpufreq related to CPU hotplug which are addressed by a series of
    fixes from Srivatsa S Bhat and Viresh Kumar.

    8) cpufreq fix for time conversion in time_in_state attribute

    Time conversion carried out by cpufreq when user space attempts to
    read /sys/devices/system/cpu/cpu*/cpufreq/stats/time_in_state
    won't work correcty if cputime_t doesn't map directly to jiffies.
    Fix from Andreas Schwab.

    9) Revert of a troublesome cpufreq commit

    Commit 7c30ed5 (cpufreq: make sure frequency transitions are
    serialized) was intended to address some known concurrency
    problems in cpufreq related to the ordering of transitions, but
    unfortunately it introduced several problems of its own, so I
    decided to revert it now and address the original problems later
    in a more robust way.

    10) Intel Haswell CPU models for intel_pstate from Nell Hardcastle.

    11) cpufreq fixes related to system suspend/resume

    The recent cpufreq changes that made it preserve CPU sysfs
    attributes over suspend/resume cycles introduced a possible NULL
    pointer dereference that caused it to crash during the second
    attempt to suspend. Three commits from Srivatsa S Bhat fix that
    problem and a couple of related issues.

    12) cpufreq locking fix

    cpufreq_policy_restore() should acquire the lock for reading, but
    it acquires it for writing. Fix from Lan Tianyu"

    * tag 'pm+acpi-fixes-3.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (25 commits)
    cpufreq: Acquire the lock in cpufreq_policy_restore() for reading
    cpufreq: Prevent problems in update_policy_cpu() if last_cpu == new_cpu
    cpufreq: Restructure if/else block to avoid unintended behavior
    cpufreq: Fix crash in cpufreq-stats during suspend/resume
    intel_pstate: Add Haswell CPU models
    Revert "cpufreq: make sure frequency transitions are serialized"
    cpufreq: Use signed type for 'ret' variable, to store negative error values
    cpufreq: Remove temporary fix for race between CPU hotplug and sysfs-writes
    cpufreq: Synchronize the cpufreq store_*() routines with CPU hotplug
    cpufreq: Invoke __cpufreq_remove_dev_finish() after releasing cpu_hotplug.lock
    cpufreq: Split __cpufreq_remove_dev() into two parts
    cpufreq: Fix wrong time unit conversion
    cpufreq: serialize calls to __cpufreq_governor()
    cpufreq: don't allow governor limits to be changed when it is disabled
    ACPI / bind: Prefer device objects with _STA to those without it
    ACPI / hotplug / PCI: Avoid parent bus rescans on spurious device checks
    ACPI / hotplug / PCI: Use _OST to notify firmware about notify status
    ACPI / hotplug / PCI: Avoid doing too much for spurious notifies
    ACPICA: Fix for a Store->ArgX when ArgX contains a reference to a field.
    ACPI / hotplug / PCI: Don't trim devices before scanning the namespace
    ...

    Linus Torvalds
     
  • Pull perf fixes from Ingo Molnar:
    "Various fixes.

    The -g perf report lockup you reported is only partially addressed,
    patches that fix the excessive runtime are still being worked on"

    * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    perf/x86: Fix uncore PCI fixed counter handling
    uprobes: Fix utask->depth accounting in handle_trampoline()
    perf/x86: Add constraint for IVB CYCLE_ACTIVITY:CYCLES_LDM_PENDING
    perf: Fix up MMAP2 buffer space reservation
    perf tools: Add attr->mmap2 support
    perf kvm: Fix sample_type manipulation
    perf evlist: Fix id pos in perf_evlist__open()
    perf trace: Handle perf.data files with no tracepoints
    perf session: Separate progress bar update when processing events
    perf trace: Check if MAP_32BIT is defined
    perf hists: Fix formatting of long symbol names
    perf evlist: Fix parsing with no sample_id_all bit set
    perf tools: Add test for parsing with no sample_id_all bit
    perf trace: Check control+C more often

    Linus Torvalds
     
  • Pull scheduler fix from Ingo Molnar:
    "Performance regression fix"

    * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    sched: Fix load balancing performance regression in should_we_balance()

    Linus Torvalds
     
  • Emmanuel reported that /proc/sched_debug didn't report the right PIDs
    when using namespaces, cure this.

    Reported-by: Emmanuel Deloget
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/20130909110141.GM31370@twins.programming.kicks-ass.net
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • There is a small race between copy_process() and cgroup_attach_task()
    where child->se.parent,cfs_rq points to invalid (old) ones.

    parent doing fork() | someone moving the parent to another cgroup
    -------------------------------+---------------------------------------------
    copy_process()
    + dup_task_struct()
    -> parent->se is copied to child->se.
    se.parent,cfs_rq of them point to old ones.

    cgroup_attach_task()
    + cgroup_task_migrate()
    -> parent->cgroup is updated.
    + cpu_cgroup_attach()
    + sched_move_task()
    + task_move_group_fair()
    +- set_task_rq()
    -> se.parent,cfs_rq of parent
    are updated.

    + cgroup_fork()
    -> parent->cgroup is copied to child->cgroup. (*1)
    + sched_fork()
    + task_fork_fair()
    -> se.parent,cfs_rq of child are accessed
    while they point to old ones. (*2)

    In the worst case, this bug can lead to "use-after-free" and cause a panic,
    because it's new cgroup's refcount that is incremented at (*1),
    so the old cgroup(and related data) can be freed before (*2).

    In fact, a panic caused by this bug was originally caught in RHEL6.4.

    BUG: unable to handle kernel NULL pointer dereference at (null)
    IP: [] sched_slice+0x6e/0xa0
    [...]
    Call Trace:
    [] place_entity+0x75/0xa0
    [] task_fork_fair+0xaa/0x160
    [] sched_fork+0x6b/0x140
    [] copy_process+0x5b2/0x1450
    [] ? wake_up_new_task+0xd9/0x130
    [] do_fork+0x94/0x460
    [] ? sys_wait4+0xae/0x100
    [] sys_clone+0x28/0x30
    [] stub_clone+0x13/0x20
    [] ? system_call_fastpath+0x16/0x1b

    Signed-off-by: Daisuke Nishimura
    Signed-off-by: Peter Zijlstra
    Cc:
    Link: http://lkml.kernel.org/r/039601ceae06$733d3130$59b79390$@mxp.nes.nec.co.jp
    Signed-off-by: Ingo Molnar

    Daisuke Nishimura
     

12 Sep, 2013

24 commits

  • Currently utask->depth is simply the number of allocated/pending
    return_instance's in uprobe_task->return_instances list.

    handle_trampoline() should decrement this counter every time we
    handle/free an instance, but due to typo it does this only if
    ->chained == T. This means that in the likely case this counter
    is never decremented and the probed task can't report more than
    MAX_URETPROBE_DEPTH events.

    Reported-by: Mikhail Kulemin
    Reported-by: Hemant Kumar Shaw
    Signed-off-by: Oleg Nesterov
    Acked-by: Anton Arapov
    Cc: masami.hiramatsu.pt@hitachi.com
    Cc: srikar@linux.vnet.ibm.com
    Cc: systemtap@sourceware.org
    Cc: stable@vger.kernel.org
    Link: http://lkml.kernel.org/r/20130911154726.GA8093@redhat.com
    Signed-off-by: Ingo Molnar

    Oleg Nesterov
     
  • Gerlando Falauto reported that when HRTICK is enabled, it is
    possible to trigger system deadlocks. These were hard to
    reproduce, as HRTICK has been broken in the past, but seemed
    to be connected to the timekeeping_seq lock.

    Since seqlock/seqcount's aren't supported w/ lockdep, I added
    some extra spinlock based locking and triggered the following
    lockdep output:

    [ 15.849182] ntpd/4062 is trying to acquire lock:
    [ 15.849765] (&(&pool->lock)->rlock){..-...}, at: [] __queue_work+0x145/0x480
    [ 15.850051]
    [ 15.850051] but task is already holding lock:
    [ 15.850051] (timekeeper_lock){-.-.-.}, at: [] do_adjtimex+0x7f/0x100

    [ 15.850051] Chain exists of: &(&pool->lock)->rlock --> &p->pi_lock --> timekeeper_lock
    [ 15.850051] Possible unsafe locking scenario:
    [ 15.850051]
    [ 15.850051] CPU0 CPU1
    [ 15.850051] ---- ----
    [ 15.850051] lock(timekeeper_lock);
    [ 15.850051] lock(&p->pi_lock);
    [ 15.850051] lock(timekeeper_lock);
    [ 15.850051] lock(&(&pool->lock)->rlock);
    [ 15.850051]
    [ 15.850051] *** DEADLOCK ***

    The deadlock was introduced by 06c017fdd4dc48451a ("timekeeping:
    Hold timekeepering locks in do_adjtimex and hardpps") in 3.10

    This patch avoids this deadlock, by moving the call to
    schedule_delayed_work() outside of the timekeeper lock
    critical section.

    Reported-by: Gerlando Falauto
    Tested-by: Lin Ming
    Signed-off-by: John Stultz
    Cc: Mathieu Desnoyers
    Cc: stable #3.11, 3.10
    Link: http://lkml.kernel.org/r/1378943457-27314-1-git-send-email-john.stultz@linaro.org
    Signed-off-by: Ingo Molnar

    John Stultz
     
  • Since the panic handlers may produce additional information (via printk)
    for the kernel log, it should be reported as part of the panic output
    saved by kmsg_dump(). Without this re-ordering, nothing that adds
    information to a panic will show up in pstore's view when kmsg_dump runs,
    and is therefore not visible to crash reporting tools that examine pstore
    output.

    Signed-off-by: Kees Cook
    Cc: Anton Vorontsov
    Cc: Colin Cross
    Acked-by: Tony Luck
    Cc: Stephen Boyd
    Cc: Vikram Mulukutla
    Cc: Peter Zijlstra
    Cc: Rusty Russell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kees Cook
     
  • Code can not run here forever, so remove the unnecessary return.

    Signed-off-by: Xishi Qiu
    Suggested-by: Zhang Yanfei
    Reviewed-by: Simon Horman
    Reviewed-by: Zhang Yanfei
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Xishi Qiu
     
  • __ptrace_may_access() checks get_dumpable/ptrace_has_cap/etc if task !=
    current, this can can lead to surprising results.

    For example, a sub-thread can't readlink("/proc/self/exe") if the
    executable is not readable. setup_new_exec()->would_dump() notices that
    inode_permission(MAY_READ) fails and then it does
    set_dumpable(suid_dumpable). After that get_dumpable() fails.

    (It is not clear why proc_pid_readlink() checks get_dumpable(), perhaps we
    could add PTRACE_MODE_NODUMPABLE)

    Change __ptrace_may_access() to use same_thread_group() instead of "task
    == current". Any security check is pointless when the tasks share the
    same ->mm.

    Signed-off-by: Mark Grondona
    Signed-off-by: Ben Woodard
    Signed-off-by: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mark Grondona
     
  • The current two insn slot caches both use module_alloc/module_free to
    allocate and free insn slot cache pages.

    For s390 this is not sufficient since there is the need to allocate insn
    slots that are either within the vmalloc module area or within dma memory.

    Therefore add a mechanism which allows to specify an own allocator for an
    own insn slot cache.

    Signed-off-by: Heiko Carstens
    Acked-by: Masami Hiramatsu
    Cc: Ananth N Mavinakayanahalli
    Cc: Ingo Molnar
    Cc: Martin Schwidefsky
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Heiko Carstens
     
  • The current kpropes insn caches allocate memory areas for insn slots
    with module_alloc(). The assumption is that the kernel image and module
    area are both within the same +/- 2GB memory area.

    This however is not true for s390 where the kernel image resides within
    the first 2GB (DMA memory area), but the module area is far away in the
    vmalloc area, usually somewhere close below the 4TB area.

    For new pc relative instructions s390 needs insn slots that are within
    +/- 2GB of each area. That way we can patch displacements of
    pc-relative instructions within the insn slots just like x86 and
    powerpc.

    The module area works already with the normal insn slot allocator,
    however there is currently no way to get insn slots that are within the
    first 2GB on s390 (aka DMA area).

    Therefore this patch set modifies the kprobes insn slot cache code in
    order to allow to specify a custom allocator for the insn slot cache
    pages. In addition architecure can now have private insn slot caches
    withhout the need to modify common code.

    Patch 1 unifies and simplifies the current insn and optinsn caches
    implementation. This is a preparation which allows to add more
    insn caches in a simple way.

    Patch 2 adds the possibility to specify a custom allocator.

    Patch 3 makes s390 use the new insn slot mechanisms and adds support for
    pc-relative instructions with long displacements.

    This patch (of 3):

    The two insn caches (insn, and optinsn) each have an own mutex and
    alloc/free functions (get_[opt]insn_slot() / free_[opt]insn_slot()).

    Since there is the need for yet another insn cache which satifies dma
    allocations on s390, unify and simplify the current implementation:

    - Move the per insn cache mutex into struct kprobe_insn_cache.
    - Move the alloc/free functions to kprobe.h so they are simply
    wrappers for the generic __get_insn_slot/__free_insn_slot functions.
    The implementation is done with a DEFINE_INSN_CACHE_OPS() macro
    which provides the alloc/free functions for each cache if needed.
    - move the struct kprobe_insn_cache to kprobe.h which allows to generate
    architecture specific insn slot caches outside of the core kprobes
    code.

    Signed-off-by: Heiko Carstens
    Cc: Masami Hiramatsu
    Cc: Ananth N Mavinakayanahalli
    Cc: Ingo Molnar
    Cc: Martin Schwidefsky
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Heiko Carstens
     
  • No functional changes, just comments.

    Signed-off-by: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Trivial. Remove the unnecessary "work = NULL" initialization and turn
    read_barrier_depends() into smp_read_barrier_depends() in
    task_work_cancel().

    Signed-off-by: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • As in commit f21afc25f9ed ("smp.h: Use local_irq_{save,restore}() in
    !SMP version of on_each_cpu()"), we don't want to enable irqs if they
    are not already enabled.

    I don't know of any bugs currently caused by this unconditional
    local_irq_enable(), but I want to use this function in MIPS/OCTEON early
    boot (when we have early_boot_irqs_disabled). This also makes this
    function have similar semantics to on_each_cpu() which is good in
    itself.

    Signed-off-by: David Daney
    Cc: Gilad Ben-Yossef
    Cc: Christoph Lameter
    Cc: Chris Metcalf
    Cc: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Daney
     
  • At least on ARM no-MMU the extable is empty and so there is nothing to
    sort. So add a check for the table to be empty which effectively only
    changes that the misleading pr_notice is suppressed.

    Signed-off-by: Uwe Kleine-König
    Cc: Ingo Molnar
    Cc: David Daney
    Cc: "H. Peter Anvin"
    Cc: Borislav Petkov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Uwe Kleine-König
     
  • All of the other non-trivial !SMP versions of functions in smp.h are
    out-of-line in up.c. Move on_each_cpu() there as well.

    This allows us to get rid of the #include . The
    drawback is that this makes both the x86_64 and i386 defconfig !SMP
    kernels about 200 bytes larger each.

    Signed-off-by: David Daney
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Daney
     
  • The SMP version of this function doesn't unconditionally enable irqs, so
    neither should this !SMP version. There are no know problems caused by
    this, but we make the change for consistency's sake.

    Signed-off-by: David Daney
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Daney
     
  • As in commit f21afc25f9ed ("smp.h: Use local_irq_{save,restore}() in
    !SMP version of on_each_cpu()"), we don't want to enable irqs if they
    are not already enabled. There are currently no known problematical
    callers of these functions, but since it is a known failure pattern, we
    preemptively fix them.

    Since they are not trivial functions, make them non-inline by moving
    them to up.c. This also makes it so we don't have to fix #include
    dependancies for preempt_{disable,enable}.

    Signed-off-by: David Daney
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Daney
     
  • When running with GENERIC_LOCKBREAK=y, the locking implementations emit
    calls to arch_{read,write,spin}_relax when spinning on a contended lock
    in order to allow architectures to favour the CPU owning the lock if
    possible.

    In reality, everybody apart from PowerPC and S390 just does cpu_relax()
    here, so make that the default behaviour and allow it to be overridden
    if required.

    Signed-off-by: Will Deacon
    Cc: Benjamin Herrenschmidt
    Cc: Martin Schwidefsky
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Will Deacon
     
  • When failure occurs in hotplug_cfd(), need release related resources, or
    will cause memory leak.

    Signed-off-by: Chen Gang
    Acked-by: Wang YanQing
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Chen Gang
     
  • const has to use __initconst, not __initdata

    Signed-off-by: Andi Kleen
    Acked-by: David Howells
    Cc: Rusty Russell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andi Kleen
     
  • I found the following pattern that leads in to interesting findings:

    grep -r "ret.*|=.*__put_user" *
    grep -r "ret.*|=.*__get_user" *
    grep -r "ret.*|=.*__copy" *

    The __put_user() calls in compat_ioctl.c, ptrace compat, signal compat,
    since those appear in compat code, we could probably expect the kernel
    addresses not to be reachable in the lower 32-bit range, so I think they
    might not be exploitable.

    For the "__get_user" cases, I don't think those are exploitable: the worse
    that can happen is that the kernel will copy kernel memory into in-kernel
    buffers, and will fail immediately afterward.

    The alpha csum_partial_copy_from_user() seems to be missing the
    access_ok() check entirely. The fix is inspired from x86. This could
    lead to information leak on alpha. I also noticed that many architectures
    map csum_partial_copy_from_user() to csum_partial_copy_generic(), but I
    wonder if the latter is performing the access checks on every
    architectures.

    Signed-off-by: Mathieu Desnoyers
    Cc: Richard Henderson
    Cc: Ivan Kokshaysky
    Cc: Matt Turner
    Cc: Jens Axboe
    Cc: Oleg Nesterov
    Cc: David Miller
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mathieu Desnoyers
     
  • Now hugepage migration is enabled, although restricted on pmd-based
    hugepages for now (due to lack of testing.) So we should allocate
    migratable hugepages from ZONE_MOVABLE if possible.

    This patch makes GFP flags in hugepage allocation dependent on migration
    support, not only the value of hugepages_treat_as_movable. It provides no
    change on the behavior for architectures which do not support hugepage
    migration,

    Signed-off-by: Naoya Horiguchi
    Acked-by: Andi Kleen
    Reviewed-by: Wanpeng Li
    Cc: Hillf Danton
    Cc: Mel Gorman
    Cc: Hugh Dickins
    Cc: KOSAKI Motohiro
    Cc: Michal Hocko
    Cc: Rik van Riel
    Cc: "Aneesh Kumar K.V"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • Use "zone_end_pfn()" instead of "zone->zone_start_pfn + zone->spanned_pages".
    Simplify the code, no functional change.

    [akpm@linux-foundation.org: fix build]
    Signed-off-by: Xishi Qiu
    Cc: Cody P Schafer
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Xishi Qiu
     
  • Simple cleanup. Every user of vma_set_policy() does the same work, this
    looks a bit annoying imho. And the new trivial helper which does
    mpol_dup() + vma_set_policy() to simplify the callers.

    Signed-off-by: Oleg Nesterov
    Cc: KOSAKI Motohiro
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Andi Kleen
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • do_fork() denies CLONE_THREAD | CLONE_PARENT if NEWUSER | NEWPID.

    Then later copy_process() denies CLONE_SIGHAND if the new process will
    be in a different pid namespace (task_active_pid_ns() doesn't match
    current->nsproxy->pid_ns).

    This looks confusing and inconsistent. CLONE_NEWPID is very similar to
    the case when ->pid_ns was already unshared, we want the same
    restrictions so copy_process() should also nack CLONE_PARENT.

    And it would be better to deny CLONE_NEWUSER && CLONE_SIGHAND as well
    just for consistency.

    Kill the "CLONE_NEWUSER | CLONE_NEWPID" check in do_fork() and change
    copy_process() to do the same check along with ->pid_ns check we already
    have.

    Signed-off-by: Oleg Nesterov
    Acked-by: Andy Lutomirski
    Cc: "Eric W. Biederman"
    Cc: Colin Walters
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Commit 8382fcac1b81 ("pidns: Outlaw thread creation after
    unshare(CLONE_NEWPID)") nacks CLONE_NEWPID if the forking process
    unshared pid_ns. This is correct but unnecessary, copy_pid_ns() does
    the same check.

    Remove the CLONE_NEWPID check to cleanup the code and prepare for the
    next change.

    Test-case:

    static int child(void *arg)
    {
    return 0;
    }

    static char stack[16 * 1024];

    int main(void)
    {
    pid_t pid;

    assert(unshare(CLONE_NEWUSER | CLONE_NEWPID) == 0);

    pid = clone(child, stack + sizeof(stack) / 2,
    CLONE_NEWPID | SIGCHLD, NULL);
    assert(pid < 0 && errno == EINVAL);

    return 0;
    }

    clone(CLONE_NEWPID) correctly fails with or without this change.

    Signed-off-by: Oleg Nesterov
    Acked-by: Andy Lutomirski
    Cc: "Eric W. Biederman"
    Cc: Colin Walters
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Commit 8382fcac1b81 ("pidns: Outlaw thread creation after
    unshare(CLONE_NEWPID)") nacks CLONE_VM if the forking process unshared
    pid_ns, this obviously breaks vfork:

    int main(void)
    {
    assert(unshare(CLONE_NEWUSER | CLONE_NEWPID) == 0);
    assert(vfork() >= 0);
    _exit(0);
    return 0;
    }

    fails without this patch.

    Change this check to use CLONE_SIGHAND instead. This also forbids
    CLONE_THREAD automatically, and this is what the comment implies.

    We could probably even drop CLONE_SIGHAND and use CLONE_THREAD, but it
    would be safer to not do this. The current check denies CLONE_SIGHAND
    implicitely and there is no reason to change this.

    Eric said "CLONE_SIGHAND is fine. CLONE_THREAD would be even better.
    Having shared signal handling between two different pid namespaces is
    the case that we are fundamentally guarding against."

    Signed-off-by: Oleg Nesterov
    Reported-by: Colin Walters
    Acked-by: Andy Lutomirski
    Reviewed-by: "Eric W. Biederman"
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov