17 Sep, 2013

2 commits


14 Sep, 2013

1 commit

  • Pull aio changes from Ben LaHaise:
    "First off, sorry for this pull request being late in the merge window.
    Al had raised a couple of concerns about 2 items in the series below.
    I addressed the first issue (the race introduced by Gu's use of
    mm_populate()), but he has not provided any further details on how he
    wants to rework the anon_inode.c changes (which were sent out months
    ago but have yet to be commented on).

    The bulk of the changes have been sitting in the -next tree for a few
    months, with all the issues raised being addressed"

    * git://git.kvack.org/~bcrl/aio-next: (22 commits)
    aio: rcu_read_lock protection for new rcu_dereference calls
    aio: fix race in ring buffer page lookup introduced by page migration support
    aio: fix rcu sparse warnings introduced by ioctx table lookup patch
    aio: remove unnecessary debugging from aio_free_ring()
    aio: table lookup: verify ctx pointer
    staging/lustre: kiocb->ki_left is removed
    aio: fix error handling and rcu usage in "convert the ioctx list to table lookup v3"
    aio: be defensive to ensure request batching is non-zero instead of BUG_ON()
    aio: convert the ioctx list to table lookup v3
    aio: double aio_max_nr in calculations
    aio: Kill ki_dtor
    aio: Kill ki_users
    aio: Kill unneeded kiocb members
    aio: Kill aio_rw_vect_retry()
    aio: Don't use ctx->tail unnecessarily
    aio: io_cancel() no longer returns the io_event
    aio: percpu ioctx refcount
    aio: percpu reqs_available
    aio: reqs_active -> reqs_available
    aio: fix build when migration is disabled
    ...

    Linus Torvalds
     

13 Sep, 2013

10 commits

  • After the last architecture switched to generic hard irqs the config
    options HAVE_GENERIC_HARDIRQS & GENERIC_HARDIRQS and the related code
    for !CONFIG_GENERIC_HARDIRQS can be removed.

    Signed-off-by: Martin Schwidefsky

    Martin Schwidefsky
     
  • Merge more patches from Andrew Morton:
    "The rest of MM. Plus one misc cleanup"

    * emailed patches from Andrew Morton : (35 commits)
    mm/Kconfig: add MMU dependency for MIGRATION.
    kernel: replace strict_strto*() with kstrto*()
    mm, thp: count thp_fault_fallback anytime thp fault fails
    thp: consolidate code between handle_mm_fault() and do_huge_pmd_anonymous_page()
    thp: do_huge_pmd_anonymous_page() cleanup
    thp: move maybe_pmd_mkwrite() out of mk_huge_pmd()
    mm: cleanup add_to_page_cache_locked()
    thp: account anon transparent huge pages into NR_ANON_PAGES
    truncate: drop 'oldsize' truncate_pagecache() parameter
    mm: make lru_add_drain_all() selective
    memcg: document cgroup dirty/writeback memory statistics
    memcg: add per cgroup writeback pages accounting
    memcg: check for proper lock held in mem_cgroup_update_page_stat
    memcg: remove MEMCG_NR_FILE_MAPPED
    memcg: reduce function dereference
    memcg: avoid overflow caused by PAGE_ALIGN
    memcg: rename RESOURCE_MAX to RES_COUNTER_MAX
    memcg: correct RESOURCE_MAX to ULLONG_MAX
    mm: memcg: do not trap chargers with full callstack on OOM
    mm: memcg: rework and document OOM waiting and wakeup
    ...

    Linus Torvalds
     
  • The usage of strict_strto*() is not preferred, because strict_strto*() is
    obsolete. Thus, kstrto*() should be used.

    Signed-off-by: Jingoo Han
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jingoo Han
     
  • This function dereferences res far too often, so optimize it.

    Signed-off-by: Sha Zhengju
    Signed-off-by: Qiang Huang
    Acked-by: Michal Hocko
    Cc: Daisuke Nishimura
    Cc: Jeff Liu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sha Zhengju
     
  • Since PAGE_ALIGN is aligning up(the next page boundary), so after
    PAGE_ALIGN, the value might be overflow, such as write the MAX value to
    *.limit_in_bytes.

    $ cat /cgroup/memory/memory.limit_in_bytes
    18446744073709551615

    # echo 18446744073709551615 > /cgroup/memory/memory.limit_in_bytes
    bash: echo: write error: Invalid argument

    Some user programs might depend on such behaviours(like libcg, we read
    the value in snapshot, then use the value to reset cgroup later), and
    that will cause confusion. So we need to fix it.

    Signed-off-by: Sha Zhengju
    Signed-off-by: Qiang Huang
    Acked-by: Michal Hocko
    Cc: Daisuke Nishimura
    Cc: Jeff Liu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sha Zhengju
     
  • RESOURCE_MAX is far too general name, change it to RES_COUNTER_MAX.

    Signed-off-by: Sha Zhengju
    Signed-off-by: Qiang Huang
    Acked-by: Michal Hocko
    Cc: Daisuke Nishimura
    Cc: Jeff Liu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sha Zhengju
     
  • Pull vfs pile 4 from Al Viro:
    "list_lru pile, mostly"

    This came out of Andrew's pile, Al ended up doing the merge work so that
    Andrew didn't have to.

    Additionally, a few fixes.

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (42 commits)
    super: fix for destroy lrus
    list_lru: dynamically adjust node arrays
    shrinker: Kill old ->shrink API.
    shrinker: convert remaining shrinkers to count/scan API
    staging/lustre/libcfs: cleanup linux-mem.h
    staging/lustre/ptlrpc: convert to new shrinker API
    staging/lustre/obdclass: convert lu_object shrinker to count/scan API
    staging/lustre/ldlm: convert to shrinkers to count/scan API
    hugepage: convert huge zero page shrinker to new shrinker API
    i915: bail out earlier when shrinker cannot acquire mutex
    drivers: convert shrinkers to new count/scan API
    fs: convert fs shrinkers to new scan/count API
    xfs: fix dquot isolation hang
    xfs-convert-dquot-cache-lru-to-list_lru-fix
    xfs: convert dquot cache lru to list_lru
    xfs: rework buffer dispose list tracking
    xfs-convert-buftarg-lru-to-generic-code-fix
    xfs: convert buftarg LRU to generic code
    fs: convert inode and dentry shrinking to be node aware
    vmscan: per-node deferred work
    ...

    Linus Torvalds
     
  • Pull ACPI and power management fixes from Rafael Wysocki:
    "All of these commits are fixes that have emerged recently and some of
    them fix bugs introduced during this merge window.

    Specifics:

    1) ACPI-based PCI hotplug (ACPIPHP) fixes related to spurious events

    After the recent ACPIPHP changes we've seen some interesting
    breakage on a system that triggers device check notifications
    during boot for non-existing devices. Although those
    notifications are really spurious, we should be able to deal with
    them nevertheless and that shouldn't introduce too much overhead.
    Four commits to make that work properly.

    2) Memory hotplug and hibernation mutual exclusion rework

    This was maent to be a cleanup, but it happens to fix a classical
    ABBA deadlock between system suspend/hibernation and ACPI memory
    hotplug which is possible if they are started roughly at the same
    time. Three commits rework memory hotplug so that it doesn't
    acquire pm_mutex and make hibernation use device_hotplug_lock
    which prevents it from racing with memory hotplug.

    3) ACPI Intel LPSS (Low-Power Subsystem) driver crash fix

    The ACPI LPSS driver crashes during boot on Apple Macbook Air with
    Haswell that has slightly unusual BIOS configuration in which one
    of the LPSS device's _CRS method doesn't return all of the
    information expected by the driver. Fix from Mika Westerberg, for
    stable.

    4) ACPICA fix related to Store->ArgX operation

    AML interpreter fix for obscure breakage that causes AML to be
    executed incorrectly on some machines (observed in practice).
    From Bob Moore.

    5) ACPI core fix for PCI ACPI device objects lookup

    There still are cases in which there is more than one ACPI device
    object matching a given PCI device and we don't choose the one
    that the BIOS expects us to choose, so this makes the lookup take
    more criteria into account in those cases.

    6) Fix to prevent cpuidle from crashing in some rare cases

    If the result of cpuidle_get_driver() is NULL, which can happen on
    some systems, cpuidle_driver_ref() will crash trying to use that
    pointer and the Daniel Fu's fix prevents that from happening.

    7) cpufreq fixes related to CPU hotplug

    Stephen Boyd reported a number of concurrency problems with
    cpufreq related to CPU hotplug which are addressed by a series of
    fixes from Srivatsa S Bhat and Viresh Kumar.

    8) cpufreq fix for time conversion in time_in_state attribute

    Time conversion carried out by cpufreq when user space attempts to
    read /sys/devices/system/cpu/cpu*/cpufreq/stats/time_in_state
    won't work correcty if cputime_t doesn't map directly to jiffies.
    Fix from Andreas Schwab.

    9) Revert of a troublesome cpufreq commit

    Commit 7c30ed5 (cpufreq: make sure frequency transitions are
    serialized) was intended to address some known concurrency
    problems in cpufreq related to the ordering of transitions, but
    unfortunately it introduced several problems of its own, so I
    decided to revert it now and address the original problems later
    in a more robust way.

    10) Intel Haswell CPU models for intel_pstate from Nell Hardcastle.

    11) cpufreq fixes related to system suspend/resume

    The recent cpufreq changes that made it preserve CPU sysfs
    attributes over suspend/resume cycles introduced a possible NULL
    pointer dereference that caused it to crash during the second
    attempt to suspend. Three commits from Srivatsa S Bhat fix that
    problem and a couple of related issues.

    12) cpufreq locking fix

    cpufreq_policy_restore() should acquire the lock for reading, but
    it acquires it for writing. Fix from Lan Tianyu"

    * tag 'pm+acpi-fixes-3.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (25 commits)
    cpufreq: Acquire the lock in cpufreq_policy_restore() for reading
    cpufreq: Prevent problems in update_policy_cpu() if last_cpu == new_cpu
    cpufreq: Restructure if/else block to avoid unintended behavior
    cpufreq: Fix crash in cpufreq-stats during suspend/resume
    intel_pstate: Add Haswell CPU models
    Revert "cpufreq: make sure frequency transitions are serialized"
    cpufreq: Use signed type for 'ret' variable, to store negative error values
    cpufreq: Remove temporary fix for race between CPU hotplug and sysfs-writes
    cpufreq: Synchronize the cpufreq store_*() routines with CPU hotplug
    cpufreq: Invoke __cpufreq_remove_dev_finish() after releasing cpu_hotplug.lock
    cpufreq: Split __cpufreq_remove_dev() into two parts
    cpufreq: Fix wrong time unit conversion
    cpufreq: serialize calls to __cpufreq_governor()
    cpufreq: don't allow governor limits to be changed when it is disabled
    ACPI / bind: Prefer device objects with _STA to those without it
    ACPI / hotplug / PCI: Avoid parent bus rescans on spurious device checks
    ACPI / hotplug / PCI: Use _OST to notify firmware about notify status
    ACPI / hotplug / PCI: Avoid doing too much for spurious notifies
    ACPICA: Fix for a Store->ArgX when ArgX contains a reference to a field.
    ACPI / hotplug / PCI: Don't trim devices before scanning the namespace
    ...

    Linus Torvalds
     
  • Pull perf fixes from Ingo Molnar:
    "Various fixes.

    The -g perf report lockup you reported is only partially addressed,
    patches that fix the excessive runtime are still being worked on"

    * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    perf/x86: Fix uncore PCI fixed counter handling
    uprobes: Fix utask->depth accounting in handle_trampoline()
    perf/x86: Add constraint for IVB CYCLE_ACTIVITY:CYCLES_LDM_PENDING
    perf: Fix up MMAP2 buffer space reservation
    perf tools: Add attr->mmap2 support
    perf kvm: Fix sample_type manipulation
    perf evlist: Fix id pos in perf_evlist__open()
    perf trace: Handle perf.data files with no tracepoints
    perf session: Separate progress bar update when processing events
    perf trace: Check if MAP_32BIT is defined
    perf hists: Fix formatting of long symbol names
    perf evlist: Fix parsing with no sample_id_all bit set
    perf tools: Add test for parsing with no sample_id_all bit
    perf trace: Check control+C more often

    Linus Torvalds
     
  • Pull scheduler fix from Ingo Molnar:
    "Performance regression fix"

    * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    sched: Fix load balancing performance regression in should_we_balance()

    Linus Torvalds
     

12 Sep, 2013

23 commits

  • Currently utask->depth is simply the number of allocated/pending
    return_instance's in uprobe_task->return_instances list.

    handle_trampoline() should decrement this counter every time we
    handle/free an instance, but due to typo it does this only if
    ->chained == T. This means that in the likely case this counter
    is never decremented and the probed task can't report more than
    MAX_URETPROBE_DEPTH events.

    Reported-by: Mikhail Kulemin
    Reported-by: Hemant Kumar Shaw
    Signed-off-by: Oleg Nesterov
    Acked-by: Anton Arapov
    Cc: masami.hiramatsu.pt@hitachi.com
    Cc: srikar@linux.vnet.ibm.com
    Cc: systemtap@sourceware.org
    Cc: stable@vger.kernel.org
    Link: http://lkml.kernel.org/r/20130911154726.GA8093@redhat.com
    Signed-off-by: Ingo Molnar

    Oleg Nesterov
     
  • Since the panic handlers may produce additional information (via printk)
    for the kernel log, it should be reported as part of the panic output
    saved by kmsg_dump(). Without this re-ordering, nothing that adds
    information to a panic will show up in pstore's view when kmsg_dump runs,
    and is therefore not visible to crash reporting tools that examine pstore
    output.

    Signed-off-by: Kees Cook
    Cc: Anton Vorontsov
    Cc: Colin Cross
    Acked-by: Tony Luck
    Cc: Stephen Boyd
    Cc: Vikram Mulukutla
    Cc: Peter Zijlstra
    Cc: Rusty Russell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kees Cook
     
  • Code can not run here forever, so remove the unnecessary return.

    Signed-off-by: Xishi Qiu
    Suggested-by: Zhang Yanfei
    Reviewed-by: Simon Horman
    Reviewed-by: Zhang Yanfei
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Xishi Qiu
     
  • __ptrace_may_access() checks get_dumpable/ptrace_has_cap/etc if task !=
    current, this can can lead to surprising results.

    For example, a sub-thread can't readlink("/proc/self/exe") if the
    executable is not readable. setup_new_exec()->would_dump() notices that
    inode_permission(MAY_READ) fails and then it does
    set_dumpable(suid_dumpable). After that get_dumpable() fails.

    (It is not clear why proc_pid_readlink() checks get_dumpable(), perhaps we
    could add PTRACE_MODE_NODUMPABLE)

    Change __ptrace_may_access() to use same_thread_group() instead of "task
    == current". Any security check is pointless when the tasks share the
    same ->mm.

    Signed-off-by: Mark Grondona
    Signed-off-by: Ben Woodard
    Signed-off-by: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mark Grondona
     
  • The current two insn slot caches both use module_alloc/module_free to
    allocate and free insn slot cache pages.

    For s390 this is not sufficient since there is the need to allocate insn
    slots that are either within the vmalloc module area or within dma memory.

    Therefore add a mechanism which allows to specify an own allocator for an
    own insn slot cache.

    Signed-off-by: Heiko Carstens
    Acked-by: Masami Hiramatsu
    Cc: Ananth N Mavinakayanahalli
    Cc: Ingo Molnar
    Cc: Martin Schwidefsky
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Heiko Carstens
     
  • The current kpropes insn caches allocate memory areas for insn slots
    with module_alloc(). The assumption is that the kernel image and module
    area are both within the same +/- 2GB memory area.

    This however is not true for s390 where the kernel image resides within
    the first 2GB (DMA memory area), but the module area is far away in the
    vmalloc area, usually somewhere close below the 4TB area.

    For new pc relative instructions s390 needs insn slots that are within
    +/- 2GB of each area. That way we can patch displacements of
    pc-relative instructions within the insn slots just like x86 and
    powerpc.

    The module area works already with the normal insn slot allocator,
    however there is currently no way to get insn slots that are within the
    first 2GB on s390 (aka DMA area).

    Therefore this patch set modifies the kprobes insn slot cache code in
    order to allow to specify a custom allocator for the insn slot cache
    pages. In addition architecure can now have private insn slot caches
    withhout the need to modify common code.

    Patch 1 unifies and simplifies the current insn and optinsn caches
    implementation. This is a preparation which allows to add more
    insn caches in a simple way.

    Patch 2 adds the possibility to specify a custom allocator.

    Patch 3 makes s390 use the new insn slot mechanisms and adds support for
    pc-relative instructions with long displacements.

    This patch (of 3):

    The two insn caches (insn, and optinsn) each have an own mutex and
    alloc/free functions (get_[opt]insn_slot() / free_[opt]insn_slot()).

    Since there is the need for yet another insn cache which satifies dma
    allocations on s390, unify and simplify the current implementation:

    - Move the per insn cache mutex into struct kprobe_insn_cache.
    - Move the alloc/free functions to kprobe.h so they are simply
    wrappers for the generic __get_insn_slot/__free_insn_slot functions.
    The implementation is done with a DEFINE_INSN_CACHE_OPS() macro
    which provides the alloc/free functions for each cache if needed.
    - move the struct kprobe_insn_cache to kprobe.h which allows to generate
    architecture specific insn slot caches outside of the core kprobes
    code.

    Signed-off-by: Heiko Carstens
    Cc: Masami Hiramatsu
    Cc: Ananth N Mavinakayanahalli
    Cc: Ingo Molnar
    Cc: Martin Schwidefsky
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Heiko Carstens
     
  • No functional changes, just comments.

    Signed-off-by: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Trivial. Remove the unnecessary "work = NULL" initialization and turn
    read_barrier_depends() into smp_read_barrier_depends() in
    task_work_cancel().

    Signed-off-by: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • As in commit f21afc25f9ed ("smp.h: Use local_irq_{save,restore}() in
    !SMP version of on_each_cpu()"), we don't want to enable irqs if they
    are not already enabled.

    I don't know of any bugs currently caused by this unconditional
    local_irq_enable(), but I want to use this function in MIPS/OCTEON early
    boot (when we have early_boot_irqs_disabled). This also makes this
    function have similar semantics to on_each_cpu() which is good in
    itself.

    Signed-off-by: David Daney
    Cc: Gilad Ben-Yossef
    Cc: Christoph Lameter
    Cc: Chris Metcalf
    Cc: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Daney
     
  • At least on ARM no-MMU the extable is empty and so there is nothing to
    sort. So add a check for the table to be empty which effectively only
    changes that the misleading pr_notice is suppressed.

    Signed-off-by: Uwe Kleine-König
    Cc: Ingo Molnar
    Cc: David Daney
    Cc: "H. Peter Anvin"
    Cc: Borislav Petkov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Uwe Kleine-König
     
  • All of the other non-trivial !SMP versions of functions in smp.h are
    out-of-line in up.c. Move on_each_cpu() there as well.

    This allows us to get rid of the #include . The
    drawback is that this makes both the x86_64 and i386 defconfig !SMP
    kernels about 200 bytes larger each.

    Signed-off-by: David Daney
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Daney
     
  • The SMP version of this function doesn't unconditionally enable irqs, so
    neither should this !SMP version. There are no know problems caused by
    this, but we make the change for consistency's sake.

    Signed-off-by: David Daney
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Daney
     
  • As in commit f21afc25f9ed ("smp.h: Use local_irq_{save,restore}() in
    !SMP version of on_each_cpu()"), we don't want to enable irqs if they
    are not already enabled. There are currently no known problematical
    callers of these functions, but since it is a known failure pattern, we
    preemptively fix them.

    Since they are not trivial functions, make them non-inline by moving
    them to up.c. This also makes it so we don't have to fix #include
    dependancies for preempt_{disable,enable}.

    Signed-off-by: David Daney
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Daney
     
  • When running with GENERIC_LOCKBREAK=y, the locking implementations emit
    calls to arch_{read,write,spin}_relax when spinning on a contended lock
    in order to allow architectures to favour the CPU owning the lock if
    possible.

    In reality, everybody apart from PowerPC and S390 just does cpu_relax()
    here, so make that the default behaviour and allow it to be overridden
    if required.

    Signed-off-by: Will Deacon
    Cc: Benjamin Herrenschmidt
    Cc: Martin Schwidefsky
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Will Deacon
     
  • When failure occurs in hotplug_cfd(), need release related resources, or
    will cause memory leak.

    Signed-off-by: Chen Gang
    Acked-by: Wang YanQing
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Chen Gang
     
  • const has to use __initconst, not __initdata

    Signed-off-by: Andi Kleen
    Acked-by: David Howells
    Cc: Rusty Russell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andi Kleen
     
  • I found the following pattern that leads in to interesting findings:

    grep -r "ret.*|=.*__put_user" *
    grep -r "ret.*|=.*__get_user" *
    grep -r "ret.*|=.*__copy" *

    The __put_user() calls in compat_ioctl.c, ptrace compat, signal compat,
    since those appear in compat code, we could probably expect the kernel
    addresses not to be reachable in the lower 32-bit range, so I think they
    might not be exploitable.

    For the "__get_user" cases, I don't think those are exploitable: the worse
    that can happen is that the kernel will copy kernel memory into in-kernel
    buffers, and will fail immediately afterward.

    The alpha csum_partial_copy_from_user() seems to be missing the
    access_ok() check entirely. The fix is inspired from x86. This could
    lead to information leak on alpha. I also noticed that many architectures
    map csum_partial_copy_from_user() to csum_partial_copy_generic(), but I
    wonder if the latter is performing the access checks on every
    architectures.

    Signed-off-by: Mathieu Desnoyers
    Cc: Richard Henderson
    Cc: Ivan Kokshaysky
    Cc: Matt Turner
    Cc: Jens Axboe
    Cc: Oleg Nesterov
    Cc: David Miller
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mathieu Desnoyers
     
  • Now hugepage migration is enabled, although restricted on pmd-based
    hugepages for now (due to lack of testing.) So we should allocate
    migratable hugepages from ZONE_MOVABLE if possible.

    This patch makes GFP flags in hugepage allocation dependent on migration
    support, not only the value of hugepages_treat_as_movable. It provides no
    change on the behavior for architectures which do not support hugepage
    migration,

    Signed-off-by: Naoya Horiguchi
    Acked-by: Andi Kleen
    Reviewed-by: Wanpeng Li
    Cc: Hillf Danton
    Cc: Mel Gorman
    Cc: Hugh Dickins
    Cc: KOSAKI Motohiro
    Cc: Michal Hocko
    Cc: Rik van Riel
    Cc: "Aneesh Kumar K.V"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • Use "zone_end_pfn()" instead of "zone->zone_start_pfn + zone->spanned_pages".
    Simplify the code, no functional change.

    [akpm@linux-foundation.org: fix build]
    Signed-off-by: Xishi Qiu
    Cc: Cody P Schafer
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Xishi Qiu
     
  • Simple cleanup. Every user of vma_set_policy() does the same work, this
    looks a bit annoying imho. And the new trivial helper which does
    mpol_dup() + vma_set_policy() to simplify the callers.

    Signed-off-by: Oleg Nesterov
    Cc: KOSAKI Motohiro
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Andi Kleen
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • do_fork() denies CLONE_THREAD | CLONE_PARENT if NEWUSER | NEWPID.

    Then later copy_process() denies CLONE_SIGHAND if the new process will
    be in a different pid namespace (task_active_pid_ns() doesn't match
    current->nsproxy->pid_ns).

    This looks confusing and inconsistent. CLONE_NEWPID is very similar to
    the case when ->pid_ns was already unshared, we want the same
    restrictions so copy_process() should also nack CLONE_PARENT.

    And it would be better to deny CLONE_NEWUSER && CLONE_SIGHAND as well
    just for consistency.

    Kill the "CLONE_NEWUSER | CLONE_NEWPID" check in do_fork() and change
    copy_process() to do the same check along with ->pid_ns check we already
    have.

    Signed-off-by: Oleg Nesterov
    Acked-by: Andy Lutomirski
    Cc: "Eric W. Biederman"
    Cc: Colin Walters
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Commit 8382fcac1b81 ("pidns: Outlaw thread creation after
    unshare(CLONE_NEWPID)") nacks CLONE_NEWPID if the forking process
    unshared pid_ns. This is correct but unnecessary, copy_pid_ns() does
    the same check.

    Remove the CLONE_NEWPID check to cleanup the code and prepare for the
    next change.

    Test-case:

    static int child(void *arg)
    {
    return 0;
    }

    static char stack[16 * 1024];

    int main(void)
    {
    pid_t pid;

    assert(unshare(CLONE_NEWUSER | CLONE_NEWPID) == 0);

    pid = clone(child, stack + sizeof(stack) / 2,
    CLONE_NEWPID | SIGCHLD, NULL);
    assert(pid < 0 && errno == EINVAL);

    return 0;
    }

    clone(CLONE_NEWPID) correctly fails with or without this change.

    Signed-off-by: Oleg Nesterov
    Acked-by: Andy Lutomirski
    Cc: "Eric W. Biederman"
    Cc: Colin Walters
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Commit 8382fcac1b81 ("pidns: Outlaw thread creation after
    unshare(CLONE_NEWPID)") nacks CLONE_VM if the forking process unshared
    pid_ns, this obviously breaks vfork:

    int main(void)
    {
    assert(unshare(CLONE_NEWUSER | CLONE_NEWPID) == 0);
    assert(vfork() >= 0);
    _exit(0);
    return 0;
    }

    fails without this patch.

    Change this check to use CLONE_SIGHAND instead. This also forbids
    CLONE_THREAD automatically, and this is what the comment implies.

    We could probably even drop CLONE_SIGHAND and use CLONE_THREAD, but it
    would be safer to not do this. The current check denies CLONE_SIGHAND
    implicitely and there is no reason to change this.

    Eric said "CLONE_SIGHAND is fine. CLONE_THREAD would be even better.
    Having shared signal handling between two different pid namespaces is
    the case that we are fundamentally guarding against."

    Signed-off-by: Oleg Nesterov
    Reported-by: Colin Walters
    Acked-by: Andy Lutomirski
    Reviewed-by: "Eric W. Biederman"
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

11 Sep, 2013

4 commits

  • The ino_generation field was added in the PERF_RECORD_MMAP2 record in
    the 13d7a24 cset but no space for it was allocated, corrupting the
    PERF_FORMAT_{TIME,CPU,TID,etc} area (sample_type/sample_id_all), fix it.

    Detected with one of the regression tests done by 'perf test':

    [root@sandy ~]# perf test -v 7
    7: Validate PERF_RECORD_* events & perf_sample fields :
    --- start ---
    61315294449606 0 PERF_RECORD_SAMPLE
    61315294453161 0 PERF_RECORD_SAMPLE
    61315294454441 0 PERF_RECORD_SAMPLE
    61315294455709 0 PERF_RECORD_SAMPLE
    61315295600899 0 PERF_RECORD_COMM: sleep:6500
    27917287430500 342521613 PERF_RECORD_MMAP2 6500/6500: [0x400000(0x7000) @ 0 00:1d 311442 9016]: /usr/bin/sleep
    MMAP2 going backwards in time, prev=61315295600899, curr=27917287430500
    MMAP2 with unexpected cpu, expected 0, got 342521613
    MMAP2 with unexpected pid, expected 6500, got 1701606191
    MMAP2 with unexpected tid, expected 6500, got 28773
    27917287430500 342561333 PERF_RECORD_MMAP2 6500/6500: [0x3b7e000000(0x223000) @ 0 00:1d 309186 9016]: /usr/lib64/ld-2.16.so
    MMAP2 with unexpected cpu, expected 0, got 342561333
    MMAP2 with unexpected pid, expected 6500, got 1932408369
    MMAP2 with unexpected tid, expected 6500, got 111
    27917287430500 342600095 PERF_RECORD_MMAP2 6500/6500: [0x7fffbd7dc000(0x1000) @ 0x7fffbd7dc000 00:00 0 0]: [vdso]
    MMAP2 with unexpected cpu, expected 0, got 342600095
    MMAP2 with unexpected pid, expected 6500, got 1935963739
    MMAP2 with unexpected tid, expected 6500, got 23919
    27917287430500 342882834 PERF_RECORD_MMAP2 6500/6500: [0x3b7e400000(0x3b8000) @ 0 00:1d 309187 9016]: /usr/lib64/libc-2.16.so
    MMAP2 with unexpected cpu, expected 0, got 342882834
    MMAP2 with unexpected pid, expected 6500, got 909192754
    MMAP2 with unexpected tid, expected 6500, got 7303982
    61316297195411 0 PERF_RECORD_EXIT(6500:6500):(6500:6500)
    ---- end ----
    Validate PERF_RECORD_* events & perf_sample fields: FAILED!
    [root@sandy ~]#

    After this patch:

    [root@sandy ~]# perf test 7
    7: Validate PERF_RECORD_* events & perf_sample fields : Ok
    [root@sandy ~]#

    Acked-by: Peter Zijlstra
    Acked-by: Stephane Eranian
    Cc: Adrian Hunter
    Cc: David Ahern
    Cc: Frederic Weisbecker
    Cc: Jiri Olsa
    Cc: Mike Galbraith
    Cc: Paul Mackerras
    Cc: Peter Zijlstra
    Cc: Stephane Eranian
    Link: http://lkml.kernel.org/n/tip-heeuv986b8ha7whqg4o3he7c@git.kernel.org
    Signed-off-by: Arnaldo Carvalho de Melo

    Arnaldo Carvalho de Melo
     
  • This series reworks our current object cache shrinking infrastructure in
    two main ways:

    * Noticing that a lot of users copy and paste their own version of LRU
    lists for objects, we put some effort in providing a generic version.
    It is modeled after the filesystem users: dentries, inodes, and xfs
    (for various tasks), but we expect that other users could benefit in
    the near future with little or no modification. Let us know if you
    have any issues.

    * The underlying list_lru being proposed automatically and
    transparently keeps the elements in per-node lists, and is able to
    manipulate the node lists individually. Given this infrastructure, we
    are able to modify the up-to-now hammer called shrink_slab to proceed
    with node-reclaim instead of always searching memory from all over like
    it has been doing.

    Per-node lru lists are also expected to lead to less contention in the lru
    locks on multi-node scans, since we are now no longer fighting for a
    global lock. The locks usually disappear from the profilers with this
    change.

    Although we have no official benchmarks for this version - be our guest to
    independently evaluate this - earlier versions of this series were
    performance tested (details at
    http://permalink.gmane.org/gmane.linux.kernel.mm/100537) yielding no
    visible performance regressions while yielding a better qualitative
    behavior in NUMA machines.

    With this infrastructure in place, we can use the list_lru entry point to
    provide memcg isolation and per-memcg targeted reclaim. Historically,
    those two pieces of work have been posted together. This version presents
    only the infrastructure work, deferring the memcg work for a later time,
    so we can focus on getting this part tested. You can see more about the
    history of such work at http://lwn.net/Articles/552769/

    Dave Chinner (18):
    dcache: convert dentry_stat.nr_unused to per-cpu counters
    dentry: move to per-sb LRU locks
    dcache: remove dentries from LRU before putting on dispose list
    mm: new shrinker API
    shrinker: convert superblock shrinkers to new API
    list: add a new LRU list type
    inode: convert inode lru list to generic lru list code.
    dcache: convert to use new lru list infrastructure
    list_lru: per-node list infrastructure
    shrinker: add node awareness
    fs: convert inode and dentry shrinking to be node aware
    xfs: convert buftarg LRU to generic code
    xfs: rework buffer dispose list tracking
    xfs: convert dquot cache lru to list_lru
    fs: convert fs shrinkers to new scan/count API
    drivers: convert shrinkers to new count/scan API
    shrinker: convert remaining shrinkers to count/scan API
    shrinker: Kill old ->shrink API.

    Glauber Costa (7):
    fs: bump inode and dentry counters to long
    super: fix calculation of shrinkable objects for small numbers
    list_lru: per-node API
    vmscan: per-node deferred work
    i915: bail out earlier when shrinker cannot acquire mutex
    hugepage: convert huge zero page shrinker to new shrinker API
    list_lru: dynamically adjust node arrays

    This patch:

    There are situations in very large machines in which we can have a large
    quantity of dirty inodes, unused dentries, etc. This is particularly true
    when umounting a filesystem, where eventually since every live object will
    eventually be discarded.

    Dave Chinner reported a problem with this while experimenting with the
    shrinker revamp patchset. So we believe it is time for a change. This
    patch just moves int to longs. Machines where it matters should have a
    big long anyway.

    Signed-off-by: Glauber Costa
    Cc: Dave Chinner
    Cc: "Theodore Ts'o"
    Cc: Adrian Hunter
    Cc: Al Viro
    Cc: Artem Bityutskiy
    Cc: Arve Hjønnevåg
    Cc: Carlos Maiolino
    Cc: Christoph Hellwig
    Cc: Chuck Lever
    Cc: Daniel Vetter
    Cc: Dave Chinner
    Cc: David Rientjes
    Cc: Gleb Natapov
    Cc: Greg Thelen
    Cc: J. Bruce Fields
    Cc: Jan Kara
    Cc: Jerome Glisse
    Cc: John Stultz
    Cc: KAMEZAWA Hiroyuki
    Cc: Kent Overstreet
    Cc: Kirill A. Shutemov
    Cc: Marcelo Tosatti
    Cc: Mel Gorman
    Cc: Steven Whitehouse
    Cc: Thomas Hellstrom
    Cc: Trond Myklebust
    Signed-off-by: Andrew Morton
    Signed-off-by: Al Viro

    Glauber Costa
     
  • * acpi-hotplug:
    PM / hibernate / memory hotplug: Rework mutual exclusion
    PM / hibernate: Create memory bitmaps after freezing user space
    ACPI / scan: Change ordering of locks for device hotplug

    Rafael J. Wysocki
     
  • Pull vfs pile 3 (of many) from Al Viro:
    "Waiman's conversion of d_path() and bits related to it,
    kern_path_mountpoint(), several cleanups and fixes (exportfs
    one is -stable fodder, IMO).

    There definitely will be more... ;-/"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    split read_seqretry_or_unlock(), convert d_walk() to resulting primitives
    dcache: Translating dentry into pathname without taking rename_lock
    autofs4 - fix device ioctl mount lookup
    introduce kern_path_mountpoint()
    rename user_path_umountat() to user_path_mountpoint_at()
    take unlazy_walk() into umount_lookup_last()
    Kill indirect include of file.h from eventfd.h, use fdget() in cgroup.c
    prune_super(): sb->s_op is never NULL
    exportfs: don't assume that ->iterate() won't feed us too long entries
    afs: get rid of redundant ->d_name.len checks

    Linus Torvalds