22 Feb, 2013

2 commits

  • Pull misc ia64 bits from Tony Luck.

    * tag 'please-pull-misc-3.9' of git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux:
    MAINTAINERS: update SGI & ia64 Altix stuff
    sysctl: Enable IA64 "ignore-unaligned-usertrap" to be used cross-arch

    Linus Torvalds
     
  • Pull driver core patches from Greg Kroah-Hartman:
    "Here is the big driver core merge for 3.9-rc1

    There are two major series here, both of which touch lots of drivers
    all over the kernel, and will cause you some merge conflicts:

    - add a new function called devm_ioremap_resource() to properly be
    able to check return values.

    - remove CONFIG_EXPERIMENTAL

    Other than those patches, there's not much here, some minor fixes and
    updates"

    Fix up trivial conflicts

    * tag 'driver-core-3.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (221 commits)
    base: memory: fix soft/hard_offline_page permissions
    drivercore: Fix ordering between deferred_probe and exiting initcalls
    backlight: fix class_find_device() arguments
    TTY: mark tty_get_device call with the proper const values
    driver-core: constify data for class_find_device()
    firmware: Ignore abort check when no user-helper is used
    firmware: Reduce ifdef CONFIG_FW_LOADER_USER_HELPER
    firmware: Make user-mode helper optional
    firmware: Refactoring for splitting user-mode helper code
    Driver core: treat unregistered bus_types as having no devices
    watchdog: Convert to devm_ioremap_resource()
    thermal: Convert to devm_ioremap_resource()
    spi: Convert to devm_ioremap_resource()
    power: Convert to devm_ioremap_resource()
    mtd: Convert to devm_ioremap_resource()
    mmc: Convert to devm_ioremap_resource()
    mfd: Convert to devm_ioremap_resource()
    media: Convert to devm_ioremap_resource()
    iommu: Convert to devm_ioremap_resource()
    drm: Convert to devm_ioremap_resource()
    ...

    Linus Torvalds
     

20 Feb, 2013

3 commits

  • Pull async changes from Tejun Heo:
    "These are followups for the earlier deadlock issue involving async
    ending up waiting for itself through block requesting module[1]. The
    following changes are made by these commits.

    - Instead of requesting default elevator on each request_queue init,
    block now requests it once early during boot.

    - Kmod triggers warning if invoked from an async worker.

    - Async synchronization implementation has been reimplemented. It's
    a lot simpler now."

    * 'for-3.9-async' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
    async: initialise list heads to fix crash
    async: replace list of active domains with global list of pending items
    async: keep pending tasks on async_domain and remove async_pending
    async: use ULLONG_MAX for infinity cookie value
    async: bring sanity to the use of words domain and running
    async, kmod: warn on synchronous request_module() from async workers
    block: don't request module during elevator init
    init, block: try to load default elevator module early during boot

    Linus Torvalds
     
  • Pull scheduler changes from Ingo Molnar:
    "Main changes:

    - scheduler side full-dynticks (user-space execution is undisturbed
    and receives no timer IRQs) preparation changes that convert the
    cputime accounting code to be full-dynticks ready, from Frederic
    Weisbecker.

    - Initial sched.h split-up changes, by Clark Williams

    - select_idle_sibling() performance improvement by Mike Galbraith:

    " 1 tbench pair (worst case) in a 10 core + SMT package:

    pre 15.22 MB/sec 1 procs
    post 252.01 MB/sec 1 procs "

    - sched_rr_get_interval() ABI fix/change. We think this detail is not
    used by apps (so it's not an ABI in practice), but lets keep it
    under observation.

    - misc RT scheduling cleanups, optimizations"

    * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (24 commits)
    sched/rt: Add header to
    cputime: Remove irqsave from seqlock readers
    sched, powerpc: Fix sched.h split-up build failure
    cputime: Restore CPU_ACCOUNTING config defaults for PPC64
    sched/rt: Move rt specific bits into new header file
    sched/rt: Add a tuning knob to allow changing SCHED_RR timeslice
    sched: Move sched.h sysctl bits into separate header
    sched: Fix signedness bug in yield_to()
    sched: Fix select_idle_sibling() bouncing cow syndrome
    sched/rt: Further simplify pick_rt_task()
    sched/rt: Do not account zero delta_exec in update_curr_rt()
    cputime: Safely read cputime of full dynticks CPUs
    kvm: Prepare to add generic guest entry/exit callbacks
    cputime: Use accessors to read task cputime stats
    cputime: Allow dynamic switch between tick/virtual based cputime accounting
    cputime: Generic on-demand virtual cputime accounting
    cputime: Move default nsecs_to_cputime() to jiffies based cputime file
    cputime: Librarize per nsecs resolution cputime definitions
    cputime: Avoid multiplication overflow on utime scaling
    context_tracking: Export context state for generic vtime
    ...

    Fix up conflict in kernel/context_tracking.c due to comment additions.

    Linus Torvalds
     
  • Pull irq core changes from Ingo Molnar:
    "The biggest changes are the IRQ-work and printk changes from Frederic
    Weisbecker, which prepare the code for 'full dynticks' (the ability to
    stop or slow down the periodic tick arbitrarily, not just in idle time
    as today):

    - Don't stop tick with irq works pending. This fix is generally
    useful and concerns archs that can't raise self IPIs.

    - Flush irq works before CPU offlining.

    - Introduce "lazy" irq works that can wait for the next tick to be
    executed, unless it's stopped.

    - Implement klogd wake up using irq work. This removes the ad-hoc
    printk_tick()/printk_needs_cpu() hooks and make it working even in
    dynticks mode.

    - Cleanups and fixes."

    * 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    genirq: Export enable/disable_percpu_irq()
    arch Kconfig: Remove references to IRQ_PER_CPU
    irq_work: Remove return value from the irq_work_queue() function
    genirq: Avoid deadlock in spurious handling
    printk: Wake up klogd using irq_work
    irq_work: Make self-IPIs optable
    irq_work: Warn if there's still work on cpu_down
    irq_work: Flush work on CPU_DYING
    irq_work: Don't stop the tick with pending works
    nohz: Add API to check tick state
    irq_work: Remove CONFIG_HAVE_IRQ_WORK
    irq_work: Fix racy check on work pending flag
    irq_work: Fix racy IRQ_WORK_BUSY flag setting

    Linus Torvalds
     

08 Feb, 2013

3 commits

  • Commit abf917cd91cb ("cputime: Generic on-demand virtual cputime
    accounting") inadvertantly changed the default CPU_ACCOUNTING
    config for PPC64. Repair that.

    Signed-off-by: Stephen Rothwell
    Acked-by: Frederic Weisbecker
    Cc: Li Zhong
    Cc: Namhyung Kim
    Cc: Paul E. McKenney
    Cc: Paul Gortmaker
    Cc: Peter Zijlstra
    Cc: Steven Rostedt
    Cc: ppc-dev
    Cc: Benjamin Herrenschmidt
    Link: http://lkml.kernel.org/r/20130208141938.f31b7b9e1acac5bbe769ee4c@canb.auug.org.au
    Signed-off-by: Ingo Molnar

    Stephen Rothwell
     
  • Move rt scheduler definitions out of include/linux/sched.h into
    new file include/linux/sched/rt.h

    Signed-off-by: Clark Williams
    Cc: Peter Zijlstra
    Cc: Steven Rostedt
    Link: http://lkml.kernel.org/r/20130207094707.7b9f825f@riff.lan
    Signed-off-by: Ingo Molnar

    Clark Williams
     
  • Move the sysctl-related bits from include/linux/sched.h into
    a new file: include/linux/sched/sysctl.h. Then update source
    files requiring access to those bits by including the new
    header file.

    Signed-off-by: Clark Williams
    Cc: Peter Zijlstra
    Cc: Steven Rostedt
    Link: http://lkml.kernel.org/r/20130207094659.06dced96@riff.lan
    Signed-off-by: Ingo Molnar

    Clark Williams
     

07 Feb, 2013

1 commit

  • All in-kernel users of class_find_device() don't really need mutable
    data for match callback.

    In two places (kernel/power/suspend_test.c, drivers/scsi/osd/osd_uld.c)
    this patch changes match callbacks to use const search data.

    The const is propagated to rtc_class_open() and power_supply_get_by_name()
    parameters.

    Note that there's a dev reference leak in suspend_test.c that's not
    touched in this patch.

    Signed-off-by: Michał Mirosław
    Acked-by: Grant Likely
    Signed-off-by: Greg Kroah-Hartman

    Michał Mirosław
     

05 Feb, 2013

3 commits

  • …x/kernel/git/frederic/linux-dynticks into sched/core

    Pull full-dynticks (user-space execution is undisturbed and
    receives no timer IRQs) preparation changes that convert the
    cputime accounting code to be full-dynticks ready,
    from Frederic Weisbecker:

    "This implements the cputime accounting on full dynticks CPUs.

    Typical cputime stats infrastructure relies on the timer tick and
    its periodic polling on the CPU to account the amount of time
    spent by the CPUs and the tasks per high level domains such as
    userspace, kernelspace, guest, ...

    Now we are preparing to implement full dynticks capability on
    Linux for Real Time and HPC users who want full CPU isolation.
    This feature requires a cputime accounting that doesn't depend
    on the timer tick.

    To implement it, this new cputime infrastructure plugs into
    kernel/user/guest boundaries to take snapshots of cputime and
    flush these to the stats when needed. This performs pretty
    much like CONFIG_VIRT_CPU_ACCOUNTING except that context location
    and cputime snaphots are synchronized between write and read
    side such that the latter can safely retrieve the pending tickless
    cputime of a task and add it to its latest cputime snapshot to
    return the correct result to the user."

    Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
    Signed-off-by: Ingo Molnar <mingo@kernel.org>

    Ingo Molnar
     
  • Conflicts:
    kernel/irq_work.c

    Add support for printk in full dynticks CPU.

    * Don't stop tick with irq works pending. This
    fix is generally useful and concerns archs that
    can't raise self IPIs.

    * Flush irq works before CPU offlining.

    * Introduce "lazy" irq works that can wait for the
    next tick to be executed, unless it's stopped.

    * Implement klogd wake up using irq work. This
    removes the ad-hoc printk_tick()/printk_needs_cpu()
    hooks and make it working even in dynticks mode.

    Signed-off-by: Frederic Weisbecker

    Frederic Weisbecker
     
  • …/linux-rcu into core/rcu

    Pull RCU updates from Paul E. McKenney:

    1. Changes to rcutorture and to RCU documentation. Posted to LKML at
    https://lkml.org/lkml/2013/1/26/188.

    2. Enhancements to uniprocessor handling in tiny RCU. Posted to LKML
    at https://lkml.org/lkml/2013/1/27/2.

    3. Tag RCU callbacks with grace-period number to simplify callback
    advancement. Posted to LKML at https://lkml.org/lkml/2013/1/26/203.

    4. Miscellaneous fixes. Posted to LKML at https://lkml.org/lkml/2013/1/26/204.

    Signed-off-by: Ingo Molnar <mingo@kernel.org>

    Ingo Molnar
     

31 Jan, 2013

2 commits

  • Pull x86 fixes from Peter Anvin:
    "This is a collection of miscellaneous fixes, the most important one is
    the fix for the Samsung laptop bricking issue (auto-blacklisting the
    samsung-laptop driver); the efi_enabled() changes you see below are
    prerequisites for that fix.

    The other issues fixed are booting on OLPC XO-1.5, an UV fix, NMI
    debugging, and requiring CAP_SYS_RAWIO for MSR references, just as
    with I/O port references."

    * 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    samsung-laptop: Disable on EFI hardware
    efi: Make 'efi_enabled' a function to query EFI facilities
    smp: Fix SMP function call empty cpu mask race
    x86/msr: Add capabilities check
    x86/dma-debug: Bump PREALLOC_DMA_DEBUG_ENTRIES
    x86/olpc: Fix olpc-xo1-sci.c build errors
    arch/x86/platform/uv: Fix incorrect tlb flush all issue
    x86-64: Fix unwind annotations in recent NMI changes
    x86-32: Start out cr0 clean, disable paging before modifying cr3/4

    Linus Torvalds
     
  • Originally 'efi_enabled' indicated whether a kernel was booted from
    EFI firmware. Over time its semantics have changed, and it now
    indicates whether or not we are booted on an EFI machine with
    bit-native firmware, e.g. 64-bit kernel with 64-bit firmware.

    The immediate motivation for this patch is the bug report at,

    https://bugs.launchpad.net/ubuntu-cdimage/+bug/1040557

    which details how running a platform driver on an EFI machine that is
    designed to run under BIOS can cause the machine to become
    bricked. Also, the following report,

    https://bugzilla.kernel.org/show_bug.cgi?id=47121

    details how running said driver can also cause Machine Check
    Exceptions. Drivers need a new means of detecting whether they're
    running on an EFI machine, as sadly the expression,

    if (!efi_enabled)

    hasn't been a sufficient condition for quite some time.

    Users actually want to query 'efi_enabled' for different reasons -
    what they really want access to is the list of available EFI
    facilities.

    For instance, the x86 reboot code needs to know whether it can invoke
    the ResetSystem() function provided by the EFI runtime services, while
    the ACPI OSL code wants to know whether the EFI config tables were
    mapped successfully. There are also checks in some of the platform
    driver code to simply see if they're running on an EFI machine (which
    would make it a bad idea to do BIOS-y things).

    This patch is a prereq for the samsung-laptop fix patch.

    Cc: David Airlie
    Cc: Corentin Chary
    Cc: Matthew Garrett
    Cc: Dave Jiang
    Cc: Olof Johansson
    Cc: Peter Jones
    Cc: Colin Ian King
    Cc: Steve Langasek
    Cc: Tony Luck
    Cc: Konrad Rzeszutek Wilk
    Cc: Rafael J. Wysocki
    Cc:
    Signed-off-by: Matt Fleming
    Signed-off-by: H. Peter Anvin

    Matt Fleming
     

29 Jan, 2013

2 commits

  • The TINY_PREEMPT_RCU is complex, does not provide that much memory
    savings, and therefore TREE_PREEMPT_RCU should be used instead. The
    systems where the difference between TINY_PREEMPT_RCU and TREE_PREEMPT_RCU
    are quite small compared to the memory footprint of CONFIG_PREEMPT.

    This commit therefore takes a first step towards eliminating
    TINY_PREEMPT_RCU by allowing TREE_PREEMPT_RCU to be configured on !SMP
    systems.

    Signed-off-by: Paul E. McKenney
    Reviewed-by: Josh Triplett

    Paul E. McKenney
     
  • Tiny RCU has historically omitted RCU CPU stall warnings in order to
    reduce memory requirements, however, lack of these warnings caused
    Thomas Gleixner some debugging pain recently. Therefore, this commit
    adds RCU CPU stall warnings to tiny RCU if RCU_TRACE=y. This keeps
    the memory footprint small, while still enabling CPU stall warnings
    in kernels built to enable them.

    Updated to include Josh Triplett's suggested use of RCU_STALL_COMMON
    config variable to simplify #if expressions.

    Reported-by: Thomas Gleixner
    Signed-off-by: Paul E. McKenney
    Signed-off-by: Paul E. McKenney
    Reviewed-by: Josh Triplett

    Paul E. McKenney
     

28 Jan, 2013

1 commit

  • If we want to stop the tick further idle, we need to be
    able to account the cputime without using the tick.

    Virtual based cputime accounting solves that problem by
    hooking into kernel/user boundaries.

    However implementing CONFIG_VIRT_CPU_ACCOUNTING require
    low level hooks and involves more overhead. But we already
    have a generic context tracking subsystem that is required
    for RCU needs by archs which plan to shut down the tick
    outside idle.

    This patch implements a generic virtual based cputime
    accounting that relies on these generic kernel/user hooks.

    There are some upsides of doing this:

    - This requires no arch code to implement CONFIG_VIRT_CPU_ACCOUNTING
    if context tracking is already built (already necessary for RCU in full
    tickless mode).

    - We can rely on the generic context tracking subsystem to dynamically
    (de)activate the hooks, so that we can switch anytime between virtual
    and tick based accounting. This way we don't have the overhead
    of the virtual accounting when the tick is running periodically.

    And one downside:

    - There is probably more overhead than a native virtual based cputime
    accounting. But this relies on hooks that are already set anyway.

    Signed-off-by: Frederic Weisbecker
    Cc: Andrew Morton
    Cc: Ingo Molnar
    Cc: Li Zhong
    Cc: Namhyung Kim
    Cc: Paul E. McKenney
    Cc: Paul Gortmaker
    Cc: Peter Zijlstra
    Cc: Steven Rostedt
    Cc: Thomas Gleixner

    Frederic Weisbecker
     

26 Jan, 2013

1 commit


24 Jan, 2013

2 commits


21 Jan, 2013

1 commit

  • Pull misc syscall fixes from Al Viro:

    - compat syscall fixes (discussed back in December)

    - a couple of "make life easier for sigaltstack stuff by reducing
    inter-tree dependencies"

    - fix up compiler/asmlinkage calling convention disagreement of
    sys_clone()

    - misc

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal:
    sys_clone() needs asmlinkage_protect
    make sure that /linuxrc has std{in,out,err}
    x32: fix sigtimedwait
    x32: fix waitid()
    switch compat_sys_wait4() and compat_sys_waitid() to COMPAT_SYSCALL_DEFINE
    switch compat_sys_sigaltstack() to COMPAT_SYSCALL_DEFINE
    CONFIG_GENERIC_SIGALTSTACK build breakage with asm-generic/syscalls.h
    Ensure that kernel_init_freeable() is not inlined into non __init code

    Linus Torvalds
     

20 Jan, 2013

1 commit


19 Jan, 2013

1 commit

  • This patch adds default module loading and uses it to load the default
    block elevator. During boot, it's called right after initramfs or
    initrd is made available and right before control is passed to
    userland. This ensures that as long as the modules are available in
    the usual places in initramfs, initrd or the root filesystem, the
    default modules are loaded as soon as possible.

    This will replace the on-demand elevator module loading from elevator
    init path.

    v2: Fixed build breakage when !CONFIG_BLOCK. Reported by kbuild test
    robot.

    Signed-off-by: Tejun Heo
    Cc: Jens Axboe
    Cc: Arjan van de Ven
    Cc: Linus Torvalds
    Cc: Alex Riesen
    Cc: Fengguang We

    Tejun Heo
     

18 Jan, 2013

1 commit


17 Jan, 2013

1 commit

  • In commit 281dc5c5ec0f ("Give up on pushing CC_OPTIMIZE_FOR_SIZE") we
    already changed the actual default value, but the help-text still
    suggested 'y'. Fix the help text too, for all the same reasons.

    Sadly, -Os keeps on generating some very suboptimal code for certain
    cases, to the point where any I$ miss upside is swamped by the downside.
    The main ones are:

    - using "rep movsb" for memcpy, even on CPU's where that is
    horrendously bad for performance.

    - not honoring branch prediction information, so any I$ footprint you
    win from smaller code, you lose from less code density in the I$.

    - using divide instructions when that is very expensive.

    Signed-off-by: Kirill Smelkov
    Signed-off-by: Linus Torvalds

    Kirill Smelkov
     

12 Jan, 2013

2 commits

  • The CONFIG_EXPERIMENTAL config item has not carried much meaning for a
    while now and is almost always enabled by default. As agreed during the
    Linux kernel summit, remove it from any "depends on" lines in Kconfigs.

    CC: "Eric W. Biederman"
    CC: Serge Hallyn
    CC: "Paul E. McKenney"
    CC: Andrew Morton
    CC: Frederic Weisbecker
    Signed-off-by: Kees Cook
    Acked-by: Serge Hallyn

    Kees Cook
     
  • This config item has not carried much meaning for a while now and is
    almost always enabled by default (especially in distro builds). As agreed
    during the Linux kernel summit, it should be removed. As a first step,
    remove it from being listed, and default it to on. Once it has been
    removed from all subsystem Kconfigs, it will be dropped entirely.

    For items that really are experimental, maintainers should use "default
    n", optionally include "(EXPERIMENTAL)" in the title, and add language to
    the help text indicating why the item should be considered experimental.

    For items that are dangerously experimental, the maintainer is encouraged
    to follow the above title recommendation, add stronger language to the
    help text, and optionally use (depending on the extent of the danger,
    from least to most dangerous): printk(), add_taint(TAINT_WARN),
    add_taint(TAINT_CRAP), WARN_ON(1), and CONFIG_BROKEN.

    CC: Greg KH
    CC: "Eric W. Biederman"
    CC: Serge Hallyn
    CC: "Paul E. McKenney"
    CC: Andrew Morton
    CC: Frederic Weisbecker
    Signed-off-by: Kees Cook
    Acked-by: Serge Hallyn
    Acked-by: Greg Kroah-Hartman
    Reviewed-by: Paul E. McKenney

    Kees Cook
     

10 Jan, 2013

1 commit

  • IA64 defines /proc/sys/kernel/ignore-unaligned-usertrap to control
    verbose warnings on unaligned access emulation.

    Although the exact mechanics of what to do with sysctl (ignore/shout)
    are arch specific, this change enables the sysctl to be usable cross-arch.

    Signed-off-by: Vineet Gupta
    Cc: Fenghua Yu
    Cc: "Eric W. Biederman"
    Cc: Serge Hallyn
    Signed-off-by: Tony Luck

    Vineet Gupta
     

26 Dec, 2012

1 commit

  • Commit d6b2123802d "make sure that we always have a return path from
    kernel_execve()" reshuffled kernel_init()/init_post() to ensure that
    kernel_execve() has a caller to return to.

    It removed __init annotation for kernel_init() and introduced/calls a
    new routine kernel_init_freeable(). Latter however is inlined by any
    reasonable compiler (ARC gcc 4.4 in this case), causing slight code
    bloat.

    This patch forces kernel_init_freeable() as noinline reducing the .text

    bloat-o-meter vmlinux vmlinux_new
    add/remove: 1/0 grow/shrink: 0/1 up/down: 374/-334 (40)
    function old new delta
    kernel_init_freeable - 374 +374 (.init.text)
    kernel_init 628 294 -334 (.text)

    Signed-off-by: Al Viro

    Vineet Gupta
     

21 Dec, 2012

1 commit

  • Pull signal handling cleanups from Al Viro:
    "sigaltstack infrastructure + conversion for x86, alpha and um,
    COMPAT_SYSCALL_DEFINE infrastructure.

    Note that there are several conflicts between "unify
    SS_ONSTACK/SS_DISABLE definitions" and UAPI patches in mainline;
    resolution is trivial - just remove definitions of SS_ONSTACK and
    SS_DISABLED from arch/*/uapi/asm/signal.h; they are all identical and
    include/uapi/linux/signal.h contains the unified variant."

    Fixed up conflicts as per Al.

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal:
    alpha: switch to generic sigaltstack
    new helpers: __save_altstack/__compat_save_altstack, switch x86 and um to those
    generic compat_sys_sigaltstack()
    introduce generic sys_sigaltstack(), switch x86 and um to it
    new helper: compat_user_stack_pointer()
    new helper: restore_altstack()
    unify SS_ONSTACK/SS_DISABLE definitions
    new helper: current_user_stack_pointer()
    missing user_stack_pointer() instances
    Bury the conditionals from kernel_thread/kernel_execve series
    COMPAT_SYSCALL_DEFINE: infrastructure

    Linus Torvalds
     

20 Dec, 2012

1 commit

  • All architectures have
    CONFIG_GENERIC_KERNEL_THREAD
    CONFIG_GENERIC_KERNEL_EXECVE
    __ARCH_WANT_SYS_EXECVE
    None of them have __ARCH_WANT_KERNEL_EXECVE and there are only two callers
    of kernel_execve() (which is a trivial wrapper for do_execve() now) left.
    Kill the conditionals and make both callers use do_execve().

    Signed-off-by: Al Viro

    Al Viro
     

19 Dec, 2012

2 commits

  • The page allocator is able to bind a page to a memcg when it is
    allocated. But for the caches, we'd like to have as many objects as
    possible in a page belonging to the same cache.

    This is done in this patch by calling memcg_kmem_get_cache in the
    beginning of every allocation function. This function is patched out by
    static branches when kernel memory controller is not being used.

    It assumes that the task allocating, which determines the memcg in the
    page allocator, belongs to the same cgroup throughout the whole process.
    Misaccounting can happen if the task calls memcg_kmem_get_cache() while
    belonging to a cgroup, and later on changes. This is considered
    acceptable, and should only happen upon task migration.

    Before the cache is created by the memcg core, there is also a possible
    imbalance: the task belongs to a memcg, but the cache being allocated from
    is the global cache, since the child cache is not yet guaranteed to be
    ready. This case is also fine, since in this case the GFP_KMEMCG will not
    be passed and the page allocator will not attempt any cgroup accounting.

    Signed-off-by: Glauber Costa
    Cc: Christoph Lameter
    Cc: David Rientjes
    Cc: Frederic Weisbecker
    Cc: Greg Thelen
    Cc: Johannes Weiner
    Cc: JoonSoo Kim
    Cc: KAMEZAWA Hiroyuki
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Pekka Enberg
    Cc: Rik van Riel
    Cc: Suleiman Souhlal
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Glauber Costa
     
  • Add the basic infrastructure for the accounting of kernel memory. To
    control that, the following files are created:

    * memory.kmem.usage_in_bytes
    * memory.kmem.limit_in_bytes
    * memory.kmem.failcnt
    * memory.kmem.max_usage_in_bytes

    They have the same meaning of their user memory counterparts. They
    reflect the state of the "kmem" res_counter.

    Per cgroup kmem memory accounting is not enabled until a limit is set for
    the group. Once the limit is set the accounting cannot be disabled for
    that group. This means that after the patch is applied, no behavioral
    changes exists for whoever is still using memcg to control their memory
    usage, until memory.kmem.limit_in_bytes is set for the first time.

    We always account to both user and kernel resource_counters. This
    effectively means that an independent kernel limit is in place when the
    limit is set to a lower value than the user memory. A equal or higher
    value means that the user limit will always hit first, meaning that kmem
    is effectively unlimited.

    People who want to track kernel memory but not limit it, can set this
    limit to a very high number (like RESOURCE_MAX - 1page - that no one will
    ever hit, or equal to the user memory)

    [akpm@linux-foundation.org: MEMCG_MMEM only works with slab and slub]
    Signed-off-by: Glauber Costa
    Acked-by: Kamezawa Hiroyuki
    Acked-by: Michal Hocko
    Cc: Johannes Weiner
    Cc: Tejun Heo
    Cc: Christoph Lameter
    Cc: David Rientjes
    Cc: Frederic Weisbecker
    Cc: Greg Thelen
    Cc: JoonSoo Kim
    Cc: Mel Gorman
    Cc: Pekka Enberg
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Glauber Costa
     

18 Dec, 2012

2 commits

  • Pull user namespace changes from Eric Biederman:
    "While small this set of changes is very significant with respect to
    containers in general and user namespaces in particular. The user
    space interface is now complete.

    This set of changes adds support for unprivileged users to create user
    namespaces and as a user namespace root to create other namespaces.
    The tyranny of supporting suid root preventing unprivileged users from
    using cool new kernel features is broken.

    This set of changes completes the work on setns, adding support for
    the pid, user, mount namespaces.

    This set of changes includes a bunch of basic pid namespace
    cleanups/simplifications. Of particular significance is the rework of
    the pid namespace cleanup so it no longer requires sending out
    tendrils into all kinds of unexpected cleanup paths for operation. At
    least one case of broken error handling is fixed by this cleanup.

    The files under /proc//ns/ have been converted from regular files
    to magic symlinks which prevents incorrect caching by the VFS,
    ensuring the files always refer to the namespace the process is
    currently using and ensuring that the ptrace_mayaccess permission
    checks are always applied.

    The files under /proc//ns/ have been given stable inode numbers
    so it is now possible to see if different processes share the same
    namespaces.

    Through the David Miller's net tree are changes to relax many of the
    permission checks in the networking stack to allowing the user
    namespace root to usefully use the networking stack. Similar changes
    for the mount namespace and the pid namespace are coming through my
    tree.

    Two small changes to add user namespace support were commited here adn
    in David Miller's -net tree so that I could complete the work on the
    /proc//ns/ files in this tree.

    Work remains to make it safe to build user namespaces and 9p, afs,
    ceph, cifs, coda, gfs2, ncpfs, nfs, nfsd, ocfs2, and xfs so the
    Kconfig guard remains in place preventing that user namespaces from
    being built when any of those filesystems are enabled.

    Future design work remains to allow root users outside of the initial
    user namespace to mount more than just /proc and /sys."

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (38 commits)
    proc: Usable inode numbers for the namespace file descriptors.
    proc: Fix the namespace inode permission checks.
    proc: Generalize proc inode allocation
    userns: Allow unprivilged mounts of proc and sysfs
    userns: For /proc/self/{uid,gid}_map derive the lower userns from the struct file
    procfs: Print task uids and gids in the userns that opened the proc file
    userns: Implement unshare of the user namespace
    userns: Implent proc namespace operations
    userns: Kill task_user_ns
    userns: Make create_new_namespaces take a user_ns parameter
    userns: Allow unprivileged use of setns.
    userns: Allow unprivileged users to create new namespaces
    userns: Allow setting a userns mapping to your current uid.
    userns: Allow chown and setgid preservation
    userns: Allow unprivileged users to create user namespaces.
    userns: Ignore suid and sgid on binaries if the uid or gid can not be mapped
    userns: fix return value on mntns_install() failure
    vfs: Allow unprivileged manipulation of the mount namespace.
    vfs: Only support slave subtrees across different user namespaces
    vfs: Add a user namespace reference from struct mnt_namespace
    ...

    Linus Torvalds
     
  • Pull block driver update from Jens Axboe:
    "Now that the core bits are in, here are the driver bits for 3.8. The
    branch contains:

    - A huge pile of drbd bits that were dumped from the 3.7 merge
    window. Following that, it was both made perfectly clear that
    there is going to be no more over-the-wall pulls and how the
    situation on individual pulls can be improved.

    - A few cleanups from Akinobu Mita for drbd and cciss.

    - Queue improvement for loop from Lukas. This grew into adding a
    generic interface for waiting/checking an even with a specific
    lock, allowing this to be pulled out of md and now loop and drbd is
    also using it.

    - A few fixes for xen back/front block driver from Roger Pau Monne.

    - Partition improvements from Stephen Warren, allowing partiion UUID
    to be used as an identifier."

    * 'for-3.8/drivers' of git://git.kernel.dk/linux-block: (609 commits)
    drbd: update Kconfig to match current dependencies
    drbd: Fix drbdsetup wait-connect, wait-sync etc... commands
    drbd: close race between drbd_set_role and drbd_connect
    drbd: respect no-md-barriers setting also when changed online via disk-options
    drbd: Remove obsolete check
    drbd: fixup after wait_even_lock_irq() addition to generic code
    loop: Limit the number of requests in the bio list
    wait: add wait_event_lock_irq() interface
    xen-blkfront: free allocated page
    xen-blkback: move free persistent grants code
    block: partition: msdos: provide UUIDs for partitions
    init: reduce PARTUUID min length to 1 from 36
    block: store partition_meta_info.uuid as a string
    cciss: use check_signature()
    cciss: cleanup bitops usage
    drbd: use copy_highpage
    drbd: if the replication link breaks during handshake, keep retrying
    drbd: check return of kmalloc in receive_uuids
    drbd: Broadcast sync progress no more often than once per second
    drbd: don't try to clear bits once the disk has failed
    ...

    Linus Torvalds
     

17 Dec, 2012

1 commit

  • Pull Automatic NUMA Balancing bare-bones from Mel Gorman:
    "There are three implementations for NUMA balancing, this tree
    (balancenuma), numacore which has been developed in tip/master and
    autonuma which is in aa.git.

    In almost all respects balancenuma is the dumbest of the three because
    its main impact is on the VM side with no attempt to be smart about
    scheduling. In the interest of getting the ball rolling, it would be
    desirable to see this much merged for 3.8 with the view to building
    scheduler smarts on top and adapting the VM where required for 3.9.

    The most recent set of comparisons available from different people are

    mel: https://lkml.org/lkml/2012/12/9/108
    mingo: https://lkml.org/lkml/2012/12/7/331
    tglx: https://lkml.org/lkml/2012/12/10/437
    srikar: https://lkml.org/lkml/2012/12/10/397

    The results are a mixed bag. In my own tests, balancenuma does
    reasonably well. It's dumb as rocks and does not regress against
    mainline. On the other hand, Ingo's tests shows that balancenuma is
    incapable of converging for this workloads driven by perf which is bad
    but is potentially explained by the lack of scheduler smarts. Thomas'
    results show balancenuma improves on mainline but falls far short of
    numacore or autonuma. Srikar's results indicate we all suffer on a
    large machine with imbalanced node sizes.

    My own testing showed that recent numacore results have improved
    dramatically, particularly in the last week but not universally.
    We've butted heads heavily on system CPU usage and high levels of
    migration even when it shows that overall performance is better.
    There are also cases where it regresses. Of interest is that for
    specjbb in some configurations it will regress for lower numbers of
    warehouses and show gains for higher numbers which is not reported by
    the tool by default and sometimes missed in treports. Recently I
    reported for numacore that the JVM was crashing with
    NullPointerExceptions but currently it's unclear what the source of
    this problem is. Initially I thought it was in how numacore batch
    handles PTEs but I'm no longer think this is the case. It's possible
    numacore is just able to trigger it due to higher rates of migration.

    These reports were quite late in the cycle so I/we would like to start
    with this tree as it contains much of the code we can agree on and has
    not changed significantly over the last 2-3 weeks."

    * tag 'balancenuma-v11' of git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux-balancenuma: (50 commits)
    mm/rmap, migration: Make rmap_walk_anon() and try_to_unmap_anon() more scalable
    mm/rmap: Convert the struct anon_vma::mutex to an rwsem
    mm: migrate: Account a transhuge page properly when rate limiting
    mm: numa: Account for failed allocations and isolations as migration failures
    mm: numa: Add THP migration for the NUMA working set scanning fault case build fix
    mm: numa: Add THP migration for the NUMA working set scanning fault case.
    mm: sched: numa: Delay PTE scanning until a task is scheduled on a new node
    mm: sched: numa: Control enabling and disabling of NUMA balancing if !SCHED_DEBUG
    mm: sched: numa: Control enabling and disabling of NUMA balancing
    mm: sched: Adapt the scanning rate if a NUMA hinting fault does not migrate
    mm: numa: Use a two-stage filter to restrict pages being migrated for unlikely tasknode relationships
    mm: numa: migrate: Set last_nid on newly allocated page
    mm: numa: split_huge_page: Transfer last_nid on tail page
    mm: numa: Introduce last_nid to the page frame
    sched: numa: Slowly increase the scanning period as NUMA faults are handled
    mm: numa: Rate limit setting of pte_numa if node is saturated
    mm: numa: Rate limit the amount of memory that is migrated between nodes
    mm: numa: Structures for Migrate On Fault per NUMA migration rate limiting
    mm: numa: Migrate pages handled during a pmd_numa hinting fault
    mm: numa: Migrate on reference policy
    ...

    Linus Torvalds
     

16 Dec, 2012

1 commit

  • This reverts commit bd52276fa1d4 ("x86-64/efi: Use EFI to deal with
    platform wall clock (again)"), and the two supporting commits:

    da5a108d05b4: "x86/kernel: remove tboot 1:1 page table creation code"

    185034e72d59: "x86, efi: 1:1 pagetable mapping for virtual EFI calls")

    as they all depend semantically on commit 53b87cf088e2 ("x86, mm:
    Include the entire kernel memory map in trampoline_pgd") that got
    reverted earlier due to the problems it caused.

    This was pointed out by Yinghai Lu, and verified by me on my Macbook Air
    that uses EFI.

    Pointed-out-by: Yinghai Lu
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

15 Dec, 2012

1 commit

  • Pull x86 EFI update from Peter Anvin:
    "EFI tree, from Matt Fleming. Most of the patches are the new efivarfs
    filesystem by Matt Garrett & co. The balance are support for EFI
    wallclock in the absence of a hardware-specific driver, and various
    fixes and cleanups."

    * 'core-efi-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (24 commits)
    efivarfs: Make efivarfs_fill_super() static
    x86, efi: Check table header length in efi_bgrt_init()
    efivarfs: Use query_variable_info() to limit kmalloc()
    efivarfs: Fix return value of efivarfs_file_write()
    efivarfs: Return a consistent error when efivarfs_get_inode() fails
    efivarfs: Make 'datasize' unsigned long
    efivarfs: Add unique magic number
    efivarfs: Replace magic number with sizeof(attributes)
    efivarfs: Return an error if we fail to read a variable
    efi: Clarify GUID length calculations
    efivarfs: Implement exclusive access for {get,set}_variable
    efivarfs: efivarfs_fill_super() ensure we clean up correctly on error
    efivarfs: efivarfs_fill_super() ensure we free our temporary name
    efivarfs: efivarfs_fill_super() fix inode reference counts
    efivarfs: efivarfs_create() ensure we drop our reference on inode on error
    efivarfs: efivarfs_file_read ensure we free data in error paths
    x86-64/efi: Use EFI to deal with platform wall clock (again)
    x86/kernel: remove tboot 1:1 page table creation code
    x86, efi: 1:1 pagetable mapping for virtual EFI calls
    x86, mm: Include the entire kernel memory map in trampoline_pgd
    ...

    Linus Torvalds
     

13 Dec, 2012

1 commit

  • N_HIGH_MEMORY stands for the nodes that has normal or high memory.
    N_MEMORY stands for the nodes that has any memory.

    The code here need to handle with the nodes which have memory, we should
    use N_MEMORY instead.

    Signed-off-by: Lai Jiangshan
    Signed-off-by: Wen Congyang
    Cc: Christoph Lameter
    Cc: Hillf Danton
    Cc: Lin Feng
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lai Jiangshan
     

11 Dec, 2012

1 commit