13 Mar, 2013

1 commit


02 Mar, 2013

1 commit

  • Pull new ARC architecture from Vineet Gupta:
    "Initial ARC Linux port with some fixes on top for 3.9-rc1:

    I would like to introduce the Linux port to ARC Processors (from
    Synopsys) for 3.9-rc1. The patch-set has been discussed on the public
    lists since Nov and has received a fair bit of review, specially from
    Arnd, tglx, Al and other subsystem maintainers for DeviceTree, kgdb...

    The arch bits are in arch/arc, some asm-generic changes (acked by
    Arnd), a minor change to PARISC (acked by Helge).

    The series is a touch bigger for a new port for 2 main reasons:

    1. It enables a basic kernel in first sub-series and adds
    ptrace/kgdb/.. later

    2. Some of the fallout of review (DeviceTree support, multi-platform-
    image support) were added on top of orig series, primarily to
    record the revision history.

    This updated pull request additionally contains

    - fixes due to our GNU tools catching up with the new syscall/ptrace
    ABI

    - some (minor) cross-arch Kconfig updates."

    * tag 'arc-v3.9-rc1-late' of git://git.kernel.org/pub/scm/linux/kernel/git/vgupta/arc: (82 commits)
    ARC: split elf.h into uapi and export it for userspace
    ARC: Fixup the current ABI version
    ARC: gdbserver using regset interface possibly broken
    ARC: Kconfig cleanup tracking cross-arch Kconfig pruning in merge window
    ARC: make a copy of flat DT
    ARC: [plat-arcfpga] DT arc-uart bindings change: "baud" => "current-speed"
    ARC: Ensure CONFIG_VIRT_TO_BUS is not enabled
    ARC: Fix pt_orig_r8 access
    ARC: [3.9] Fallout of hlist iterator update
    ARC: 64bit RTSC timestamp hardware issue
    ARC: Don't fiddle with non-existent caches
    ARC: Add self to MAINTAINERS
    ARC: Provide a default serial.h for uart drivers needing BASE_BAUD
    ARC: [plat-arcfpga] defconfig for fully loaded ARC Linux
    ARC: [Review] Multi-platform image #8: platform registers SMP callbacks
    ARC: [Review] Multi-platform image #7: SMP common code to use callbacks
    ARC: [Review] Multi-platform image #6: cpu-to-dma-addr optional
    ARC: [Review] Multi-platform image #5: NR_IRQS defined by ARC core
    ARC: [Review] Multi-platform image #4: Isolate platform headers
    ARC: [Review] Multi-platform image #3: switch to board callback
    ...

    Linus Torvalds
     

26 Feb, 2013

2 commits

  • Pull user namespace and namespace infrastructure changes from Eric W Biederman:
    "This set of changes starts with a few small enhnacements to the user
    namespace. reboot support, allowing more arbitrary mappings, and
    support for mounting devpts, ramfs, tmpfs, and mqueuefs as just the
    user namespace root.

    I do my best to document that if you care about limiting your
    unprivileged users that when you have the user namespace support
    enabled you will need to enable memory control groups.

    There is a minor bug fix to prevent overflowing the stack if someone
    creates way too many user namespaces.

    The bulk of the changes are a continuation of the kuid/kgid push down
    work through the filesystems. These changes make using uids and gids
    typesafe which ensures that these filesystems are safe to use when
    multiple user namespaces are in use. The filesystems converted for
    3.9 are ceph, 9p, afs, ocfs2, gfs2, ncpfs, nfs, nfsd, and cifs. The
    changes for these filesystems were a little more involved so I split
    the changes into smaller hopefully obviously correct changes.

    XFS is the only filesystem that remains. I was hoping I could get
    that in this release so that user namespace support would be enabled
    with an allyesconfig or an allmodconfig but it looks like the xfs
    changes need another couple of days before it they are ready."

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (93 commits)
    cifs: Enable building with user namespaces enabled.
    cifs: Convert struct cifs_ses to use a kuid_t and a kgid_t
    cifs: Convert struct cifs_sb_info to use kuids and kgids
    cifs: Modify struct smb_vol to use kuids and kgids
    cifs: Convert struct cifsFileInfo to use a kuid
    cifs: Convert struct cifs_fattr to use kuid and kgids
    cifs: Convert struct tcon_link to use a kuid.
    cifs: Modify struct cifs_unix_set_info_args to hold a kuid_t and a kgid_t
    cifs: Convert from a kuid before printing current_fsuid
    cifs: Use kuids and kgids SID to uid/gid mapping
    cifs: Pass GLOBAL_ROOT_UID and GLOBAL_ROOT_GID to keyring_alloc
    cifs: Use BUILD_BUG_ON to validate uids and gids are the same size
    cifs: Override unmappable incoming uids and gids
    nfsd: Enable building with user namespaces enabled.
    nfsd: Properly compare and initialize kuids and kgids
    nfsd: Store ex_anon_uid and ex_anon_gid as kuids and kgids
    nfsd: Modify nfsd4_cb_sec to use kuids and kgids
    nfsd: Handle kuids and kgids in the nfs4acl to posix_acl conversion
    nfsd: Convert nfsxdr to use kuids and kgids
    nfsd: Convert nfs3xdr to use kuids and kgids
    ...

    Linus Torvalds
     
  • Pull module update from Rusty Russell:
    "The sweeping change is to make add_taint() explicitly indicate whether
    to disable lockdep, but it's a mechanical change."

    * tag 'modules-next-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux:
    MODSIGN: Add option to not sign modules during modules_install
    MODSIGN: Add -s option to sign-file
    MODSIGN: Specify the hash algorithm on sign-file command line
    MODSIGN: Simplify Makefile with a Kconfig helper
    module: clean up load_module a little more.
    modpost: Ignore ARC specific non-alloc sections
    module: constify within_module_*
    taint: add explicit flag to show whether lock dep is still OK.
    module: printk message when module signature fail taints kernel.

    Linus Torvalds
     

22 Feb, 2013

2 commits

  • Pull misc ia64 bits from Tony Luck.

    * tag 'please-pull-misc-3.9' of git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux:
    MAINTAINERS: update SGI & ia64 Altix stuff
    sysctl: Enable IA64 "ignore-unaligned-usertrap" to be used cross-arch

    Linus Torvalds
     
  • Pull driver core patches from Greg Kroah-Hartman:
    "Here is the big driver core merge for 3.9-rc1

    There are two major series here, both of which touch lots of drivers
    all over the kernel, and will cause you some merge conflicts:

    - add a new function called devm_ioremap_resource() to properly be
    able to check return values.

    - remove CONFIG_EXPERIMENTAL

    Other than those patches, there's not much here, some minor fixes and
    updates"

    Fix up trivial conflicts

    * tag 'driver-core-3.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (221 commits)
    base: memory: fix soft/hard_offline_page permissions
    drivercore: Fix ordering between deferred_probe and exiting initcalls
    backlight: fix class_find_device() arguments
    TTY: mark tty_get_device call with the proper const values
    driver-core: constify data for class_find_device()
    firmware: Ignore abort check when no user-helper is used
    firmware: Reduce ifdef CONFIG_FW_LOADER_USER_HELPER
    firmware: Make user-mode helper optional
    firmware: Refactoring for splitting user-mode helper code
    Driver core: treat unregistered bus_types as having no devices
    watchdog: Convert to devm_ioremap_resource()
    thermal: Convert to devm_ioremap_resource()
    spi: Convert to devm_ioremap_resource()
    power: Convert to devm_ioremap_resource()
    mtd: Convert to devm_ioremap_resource()
    mmc: Convert to devm_ioremap_resource()
    mfd: Convert to devm_ioremap_resource()
    media: Convert to devm_ioremap_resource()
    iommu: Convert to devm_ioremap_resource()
    drm: Convert to devm_ioremap_resource()
    ...

    Linus Torvalds
     

20 Feb, 2013

2 commits

  • Pull scheduler changes from Ingo Molnar:
    "Main changes:

    - scheduler side full-dynticks (user-space execution is undisturbed
    and receives no timer IRQs) preparation changes that convert the
    cputime accounting code to be full-dynticks ready, from Frederic
    Weisbecker.

    - Initial sched.h split-up changes, by Clark Williams

    - select_idle_sibling() performance improvement by Mike Galbraith:

    " 1 tbench pair (worst case) in a 10 core + SMT package:

    pre 15.22 MB/sec 1 procs
    post 252.01 MB/sec 1 procs "

    - sched_rr_get_interval() ABI fix/change. We think this detail is not
    used by apps (so it's not an ABI in practice), but lets keep it
    under observation.

    - misc RT scheduling cleanups, optimizations"

    * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (24 commits)
    sched/rt: Add header to
    cputime: Remove irqsave from seqlock readers
    sched, powerpc: Fix sched.h split-up build failure
    cputime: Restore CPU_ACCOUNTING config defaults for PPC64
    sched/rt: Move rt specific bits into new header file
    sched/rt: Add a tuning knob to allow changing SCHED_RR timeslice
    sched: Move sched.h sysctl bits into separate header
    sched: Fix signedness bug in yield_to()
    sched: Fix select_idle_sibling() bouncing cow syndrome
    sched/rt: Further simplify pick_rt_task()
    sched/rt: Do not account zero delta_exec in update_curr_rt()
    cputime: Safely read cputime of full dynticks CPUs
    kvm: Prepare to add generic guest entry/exit callbacks
    cputime: Use accessors to read task cputime stats
    cputime: Allow dynamic switch between tick/virtual based cputime accounting
    cputime: Generic on-demand virtual cputime accounting
    cputime: Move default nsecs_to_cputime() to jiffies based cputime file
    cputime: Librarize per nsecs resolution cputime definitions
    cputime: Avoid multiplication overflow on utime scaling
    context_tracking: Export context state for generic vtime
    ...

    Fix up conflict in kernel/context_tracking.c due to comment additions.

    Linus Torvalds
     
  • Pull irq core changes from Ingo Molnar:
    "The biggest changes are the IRQ-work and printk changes from Frederic
    Weisbecker, which prepare the code for 'full dynticks' (the ability to
    stop or slow down the periodic tick arbitrarily, not just in idle time
    as today):

    - Don't stop tick with irq works pending. This fix is generally
    useful and concerns archs that can't raise self IPIs.

    - Flush irq works before CPU offlining.

    - Introduce "lazy" irq works that can wait for the next tick to be
    executed, unless it's stopped.

    - Implement klogd wake up using irq work. This removes the ad-hoc
    printk_tick()/printk_needs_cpu() hooks and make it working even in
    dynticks mode.

    - Cleanups and fixes."

    * 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    genirq: Export enable/disable_percpu_irq()
    arch Kconfig: Remove references to IRQ_PER_CPU
    irq_work: Remove return value from the irq_work_queue() function
    genirq: Avoid deadlock in spurious handling
    printk: Wake up klogd using irq_work
    irq_work: Make self-IPIs optable
    irq_work: Warn if there's still work on cpu_down
    irq_work: Flush work on CPU_DYING
    irq_work: Don't stop the tick with pending works
    nohz: Add API to check tick state
    irq_work: Remove CONFIG_HAVE_IRQ_WORK
    irq_work: Fix racy check on work pending flag
    irq_work: Fix racy IRQ_WORK_BUSY flag setting

    Linus Torvalds
     

16 Feb, 2013

1 commit

  • PARISC defines /proc/sys/kernel/unaligned-trap to runtime toggle
    unaligned access emulation.

    The exact mechanics of enablig/disabling are still arch specific, we can
    make the sysctl usable by other arches.

    Signed-off-by: Vineet Gupta
    Acked-by: Helge Deller
    Cc: "James E.J. Bottomley"
    Cc: Helge Deller
    Cc: "Eric W. Biederman"
    Cc: Serge Hallyn

    Vineet Gupta
     

13 Feb, 2013

8 commits


12 Feb, 2013

2 commits


08 Feb, 2013

1 commit

  • Commit abf917cd91cb ("cputime: Generic on-demand virtual cputime
    accounting") inadvertantly changed the default CPU_ACCOUNTING
    config for PPC64. Repair that.

    Signed-off-by: Stephen Rothwell
    Acked-by: Frederic Weisbecker
    Cc: Li Zhong
    Cc: Namhyung Kim
    Cc: Paul E. McKenney
    Cc: Paul Gortmaker
    Cc: Peter Zijlstra
    Cc: Steven Rostedt
    Cc: ppc-dev
    Cc: Benjamin Herrenschmidt
    Link: http://lkml.kernel.org/r/20130208141938.f31b7b9e1acac5bbe769ee4c@canb.auug.org.au
    Signed-off-by: Ingo Molnar

    Stephen Rothwell
     

05 Feb, 2013

2 commits

  • Conflicts:
    kernel/irq_work.c

    Add support for printk in full dynticks CPU.

    * Don't stop tick with irq works pending. This
    fix is generally useful and concerns archs that
    can't raise self IPIs.

    * Flush irq works before CPU offlining.

    * Introduce "lazy" irq works that can wait for the
    next tick to be executed, unless it's stopped.

    * Implement klogd wake up using irq work. This
    removes the ad-hoc printk_tick()/printk_needs_cpu()
    hooks and make it working even in dynticks mode.

    Signed-off-by: Frederic Weisbecker

    Frederic Weisbecker
     
  • …/linux-rcu into core/rcu

    Pull RCU updates from Paul E. McKenney:

    1. Changes to rcutorture and to RCU documentation. Posted to LKML at
    https://lkml.org/lkml/2013/1/26/188.

    2. Enhancements to uniprocessor handling in tiny RCU. Posted to LKML
    at https://lkml.org/lkml/2013/1/27/2.

    3. Tag RCU callbacks with grace-period number to simplify callback
    advancement. Posted to LKML at https://lkml.org/lkml/2013/1/26/203.

    4. Miscellaneous fixes. Posted to LKML at https://lkml.org/lkml/2013/1/26/204.

    Signed-off-by: Ingo Molnar <mingo@kernel.org>

    Ingo Molnar
     

29 Jan, 2013

2 commits

  • The TINY_PREEMPT_RCU is complex, does not provide that much memory
    savings, and therefore TREE_PREEMPT_RCU should be used instead. The
    systems where the difference between TINY_PREEMPT_RCU and TREE_PREEMPT_RCU
    are quite small compared to the memory footprint of CONFIG_PREEMPT.

    This commit therefore takes a first step towards eliminating
    TINY_PREEMPT_RCU by allowing TREE_PREEMPT_RCU to be configured on !SMP
    systems.

    Signed-off-by: Paul E. McKenney
    Reviewed-by: Josh Triplett

    Paul E. McKenney
     
  • Tiny RCU has historically omitted RCU CPU stall warnings in order to
    reduce memory requirements, however, lack of these warnings caused
    Thomas Gleixner some debugging pain recently. Therefore, this commit
    adds RCU CPU stall warnings to tiny RCU if RCU_TRACE=y. This keeps
    the memory footprint small, while still enabling CPU stall warnings
    in kernels built to enable them.

    Updated to include Josh Triplett's suggested use of RCU_STALL_COMMON
    config variable to simplify #if expressions.

    Reported-by: Thomas Gleixner
    Signed-off-by: Paul E. McKenney
    Signed-off-by: Paul E. McKenney
    Reviewed-by: Josh Triplett

    Paul E. McKenney
     

28 Jan, 2013

1 commit

  • If we want to stop the tick further idle, we need to be
    able to account the cputime without using the tick.

    Virtual based cputime accounting solves that problem by
    hooking into kernel/user boundaries.

    However implementing CONFIG_VIRT_CPU_ACCOUNTING require
    low level hooks and involves more overhead. But we already
    have a generic context tracking subsystem that is required
    for RCU needs by archs which plan to shut down the tick
    outside idle.

    This patch implements a generic virtual based cputime
    accounting that relies on these generic kernel/user hooks.

    There are some upsides of doing this:

    - This requires no arch code to implement CONFIG_VIRT_CPU_ACCOUNTING
    if context tracking is already built (already necessary for RCU in full
    tickless mode).

    - We can rely on the generic context tracking subsystem to dynamically
    (de)activate the hooks, so that we can switch anytime between virtual
    and tick based accounting. This way we don't have the overhead
    of the virtual accounting when the tick is running periodically.

    And one downside:

    - There is probably more overhead than a native virtual based cputime
    accounting. But this relies on hooks that are already set anyway.

    Signed-off-by: Frederic Weisbecker
    Cc: Andrew Morton
    Cc: Ingo Molnar
    Cc: Li Zhong
    Cc: Namhyung Kim
    Cc: Paul E. McKenney
    Cc: Paul Gortmaker
    Cc: Peter Zijlstra
    Cc: Steven Rostedt
    Cc: Thomas Gleixner

    Frederic Weisbecker
     

27 Jan, 2013

1 commit


25 Jan, 2013

2 commits


24 Jan, 2013

1 commit


18 Jan, 2013

1 commit


17 Jan, 2013

1 commit

  • In commit 281dc5c5ec0f ("Give up on pushing CC_OPTIMIZE_FOR_SIZE") we
    already changed the actual default value, but the help-text still
    suggested 'y'. Fix the help text too, for all the same reasons.

    Sadly, -Os keeps on generating some very suboptimal code for certain
    cases, to the point where any I$ miss upside is swamped by the downside.
    The main ones are:

    - using "rep movsb" for memcpy, even on CPU's where that is
    horrendously bad for performance.

    - not honoring branch prediction information, so any I$ footprint you
    win from smaller code, you lose from less code density in the I$.

    - using divide instructions when that is very expensive.

    Signed-off-by: Kirill Smelkov
    Signed-off-by: Linus Torvalds

    Kirill Smelkov
     

12 Jan, 2013

2 commits

  • The CONFIG_EXPERIMENTAL config item has not carried much meaning for a
    while now and is almost always enabled by default. As agreed during the
    Linux kernel summit, remove it from any "depends on" lines in Kconfigs.

    CC: "Eric W. Biederman"
    CC: Serge Hallyn
    CC: "Paul E. McKenney"
    CC: Andrew Morton
    CC: Frederic Weisbecker
    Signed-off-by: Kees Cook
    Acked-by: Serge Hallyn

    Kees Cook
     
  • This config item has not carried much meaning for a while now and is
    almost always enabled by default (especially in distro builds). As agreed
    during the Linux kernel summit, it should be removed. As a first step,
    remove it from being listed, and default it to on. Once it has been
    removed from all subsystem Kconfigs, it will be dropped entirely.

    For items that really are experimental, maintainers should use "default
    n", optionally include "(EXPERIMENTAL)" in the title, and add language to
    the help text indicating why the item should be considered experimental.

    For items that are dangerously experimental, the maintainer is encouraged
    to follow the above title recommendation, add stronger language to the
    help text, and optionally use (depending on the extent of the danger,
    from least to most dangerous): printk(), add_taint(TAINT_WARN),
    add_taint(TAINT_CRAP), WARN_ON(1), and CONFIG_BROKEN.

    CC: Greg KH
    CC: "Eric W. Biederman"
    CC: Serge Hallyn
    CC: "Paul E. McKenney"
    CC: Andrew Morton
    CC: Frederic Weisbecker
    Signed-off-by: Kees Cook
    Acked-by: Serge Hallyn
    Acked-by: Greg Kroah-Hartman
    Reviewed-by: Paul E. McKenney

    Kees Cook
     

10 Jan, 2013

1 commit

  • IA64 defines /proc/sys/kernel/ignore-unaligned-usertrap to control
    verbose warnings on unaligned access emulation.

    Although the exact mechanics of what to do with sysctl (ignore/shout)
    are arch specific, this change enables the sysctl to be usable cross-arch.

    Signed-off-by: Vineet Gupta
    Cc: Fenghua Yu
    Cc: "Eric W. Biederman"
    Cc: Serge Hallyn
    Signed-off-by: Tony Luck

    Vineet Gupta
     

19 Dec, 2012

2 commits

  • The page allocator is able to bind a page to a memcg when it is
    allocated. But for the caches, we'd like to have as many objects as
    possible in a page belonging to the same cache.

    This is done in this patch by calling memcg_kmem_get_cache in the
    beginning of every allocation function. This function is patched out by
    static branches when kernel memory controller is not being used.

    It assumes that the task allocating, which determines the memcg in the
    page allocator, belongs to the same cgroup throughout the whole process.
    Misaccounting can happen if the task calls memcg_kmem_get_cache() while
    belonging to a cgroup, and later on changes. This is considered
    acceptable, and should only happen upon task migration.

    Before the cache is created by the memcg core, there is also a possible
    imbalance: the task belongs to a memcg, but the cache being allocated from
    is the global cache, since the child cache is not yet guaranteed to be
    ready. This case is also fine, since in this case the GFP_KMEMCG will not
    be passed and the page allocator will not attempt any cgroup accounting.

    Signed-off-by: Glauber Costa
    Cc: Christoph Lameter
    Cc: David Rientjes
    Cc: Frederic Weisbecker
    Cc: Greg Thelen
    Cc: Johannes Weiner
    Cc: JoonSoo Kim
    Cc: KAMEZAWA Hiroyuki
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Pekka Enberg
    Cc: Rik van Riel
    Cc: Suleiman Souhlal
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Glauber Costa
     
  • Add the basic infrastructure for the accounting of kernel memory. To
    control that, the following files are created:

    * memory.kmem.usage_in_bytes
    * memory.kmem.limit_in_bytes
    * memory.kmem.failcnt
    * memory.kmem.max_usage_in_bytes

    They have the same meaning of their user memory counterparts. They
    reflect the state of the "kmem" res_counter.

    Per cgroup kmem memory accounting is not enabled until a limit is set for
    the group. Once the limit is set the accounting cannot be disabled for
    that group. This means that after the patch is applied, no behavioral
    changes exists for whoever is still using memcg to control their memory
    usage, until memory.kmem.limit_in_bytes is set for the first time.

    We always account to both user and kernel resource_counters. This
    effectively means that an independent kernel limit is in place when the
    limit is set to a lower value than the user memory. A equal or higher
    value means that the user limit will always hit first, meaning that kmem
    is effectively unlimited.

    People who want to track kernel memory but not limit it, can set this
    limit to a very high number (like RESOURCE_MAX - 1page - that no one will
    ever hit, or equal to the user memory)

    [akpm@linux-foundation.org: MEMCG_MMEM only works with slab and slub]
    Signed-off-by: Glauber Costa
    Acked-by: Kamezawa Hiroyuki
    Acked-by: Michal Hocko
    Cc: Johannes Weiner
    Cc: Tejun Heo
    Cc: Christoph Lameter
    Cc: David Rientjes
    Cc: Frederic Weisbecker
    Cc: Greg Thelen
    Cc: JoonSoo Kim
    Cc: Mel Gorman
    Cc: Pekka Enberg
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Glauber Costa
     

18 Dec, 2012

1 commit

  • Pull user namespace changes from Eric Biederman:
    "While small this set of changes is very significant with respect to
    containers in general and user namespaces in particular. The user
    space interface is now complete.

    This set of changes adds support for unprivileged users to create user
    namespaces and as a user namespace root to create other namespaces.
    The tyranny of supporting suid root preventing unprivileged users from
    using cool new kernel features is broken.

    This set of changes completes the work on setns, adding support for
    the pid, user, mount namespaces.

    This set of changes includes a bunch of basic pid namespace
    cleanups/simplifications. Of particular significance is the rework of
    the pid namespace cleanup so it no longer requires sending out
    tendrils into all kinds of unexpected cleanup paths for operation. At
    least one case of broken error handling is fixed by this cleanup.

    The files under /proc//ns/ have been converted from regular files
    to magic symlinks which prevents incorrect caching by the VFS,
    ensuring the files always refer to the namespace the process is
    currently using and ensuring that the ptrace_mayaccess permission
    checks are always applied.

    The files under /proc//ns/ have been given stable inode numbers
    so it is now possible to see if different processes share the same
    namespaces.

    Through the David Miller's net tree are changes to relax many of the
    permission checks in the networking stack to allowing the user
    namespace root to usefully use the networking stack. Similar changes
    for the mount namespace and the pid namespace are coming through my
    tree.

    Two small changes to add user namespace support were commited here adn
    in David Miller's -net tree so that I could complete the work on the
    /proc//ns/ files in this tree.

    Work remains to make it safe to build user namespaces and 9p, afs,
    ceph, cifs, coda, gfs2, ncpfs, nfs, nfsd, ocfs2, and xfs so the
    Kconfig guard remains in place preventing that user namespaces from
    being built when any of those filesystems are enabled.

    Future design work remains to allow root users outside of the initial
    user namespace to mount more than just /proc and /sys."

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (38 commits)
    proc: Usable inode numbers for the namespace file descriptors.
    proc: Fix the namespace inode permission checks.
    proc: Generalize proc inode allocation
    userns: Allow unprivilged mounts of proc and sysfs
    userns: For /proc/self/{uid,gid}_map derive the lower userns from the struct file
    procfs: Print task uids and gids in the userns that opened the proc file
    userns: Implement unshare of the user namespace
    userns: Implent proc namespace operations
    userns: Kill task_user_ns
    userns: Make create_new_namespaces take a user_ns parameter
    userns: Allow unprivileged use of setns.
    userns: Allow unprivileged users to create new namespaces
    userns: Allow setting a userns mapping to your current uid.
    userns: Allow chown and setgid preservation
    userns: Allow unprivileged users to create user namespaces.
    userns: Ignore suid and sgid on binaries if the uid or gid can not be mapped
    userns: fix return value on mntns_install() failure
    vfs: Allow unprivileged manipulation of the mount namespace.
    vfs: Only support slave subtrees across different user namespaces
    vfs: Add a user namespace reference from struct mnt_namespace
    ...

    Linus Torvalds
     

17 Dec, 2012

1 commit

  • Pull Automatic NUMA Balancing bare-bones from Mel Gorman:
    "There are three implementations for NUMA balancing, this tree
    (balancenuma), numacore which has been developed in tip/master and
    autonuma which is in aa.git.

    In almost all respects balancenuma is the dumbest of the three because
    its main impact is on the VM side with no attempt to be smart about
    scheduling. In the interest of getting the ball rolling, it would be
    desirable to see this much merged for 3.8 with the view to building
    scheduler smarts on top and adapting the VM where required for 3.9.

    The most recent set of comparisons available from different people are

    mel: https://lkml.org/lkml/2012/12/9/108
    mingo: https://lkml.org/lkml/2012/12/7/331
    tglx: https://lkml.org/lkml/2012/12/10/437
    srikar: https://lkml.org/lkml/2012/12/10/397

    The results are a mixed bag. In my own tests, balancenuma does
    reasonably well. It's dumb as rocks and does not regress against
    mainline. On the other hand, Ingo's tests shows that balancenuma is
    incapable of converging for this workloads driven by perf which is bad
    but is potentially explained by the lack of scheduler smarts. Thomas'
    results show balancenuma improves on mainline but falls far short of
    numacore or autonuma. Srikar's results indicate we all suffer on a
    large machine with imbalanced node sizes.

    My own testing showed that recent numacore results have improved
    dramatically, particularly in the last week but not universally.
    We've butted heads heavily on system CPU usage and high levels of
    migration even when it shows that overall performance is better.
    There are also cases where it regresses. Of interest is that for
    specjbb in some configurations it will regress for lower numbers of
    warehouses and show gains for higher numbers which is not reported by
    the tool by default and sometimes missed in treports. Recently I
    reported for numacore that the JVM was crashing with
    NullPointerExceptions but currently it's unclear what the source of
    this problem is. Initially I thought it was in how numacore batch
    handles PTEs but I'm no longer think this is the case. It's possible
    numacore is just able to trigger it due to higher rates of migration.

    These reports were quite late in the cycle so I/we would like to start
    with this tree as it contains much of the code we can agree on and has
    not changed significantly over the last 2-3 weeks."

    * tag 'balancenuma-v11' of git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux-balancenuma: (50 commits)
    mm/rmap, migration: Make rmap_walk_anon() and try_to_unmap_anon() more scalable
    mm/rmap: Convert the struct anon_vma::mutex to an rwsem
    mm: migrate: Account a transhuge page properly when rate limiting
    mm: numa: Account for failed allocations and isolations as migration failures
    mm: numa: Add THP migration for the NUMA working set scanning fault case build fix
    mm: numa: Add THP migration for the NUMA working set scanning fault case.
    mm: sched: numa: Delay PTE scanning until a task is scheduled on a new node
    mm: sched: numa: Control enabling and disabling of NUMA balancing if !SCHED_DEBUG
    mm: sched: numa: Control enabling and disabling of NUMA balancing
    mm: sched: Adapt the scanning rate if a NUMA hinting fault does not migrate
    mm: numa: Use a two-stage filter to restrict pages being migrated for unlikely tasknode relationships
    mm: numa: migrate: Set last_nid on newly allocated page
    mm: numa: split_huge_page: Transfer last_nid on tail page
    mm: numa: Introduce last_nid to the page frame
    sched: numa: Slowly increase the scanning period as NUMA faults are handled
    mm: numa: Rate limit setting of pte_numa if node is saturated
    mm: numa: Rate limit the amount of memory that is migrated between nodes
    mm: numa: Structures for Migrate On Fault per NUMA migration rate limiting
    mm: numa: Migrate pages handled during a pmd_numa hinting fault
    mm: numa: Migrate on reference policy
    ...

    Linus Torvalds
     

11 Dec, 2012

2 commits

  • This patch adds Kconfig options and kernel parameters to allow the
    enabling and disabling of automatic NUMA balancing. The existance
    of such a switch was and is very important when debugging problems
    related to transparent hugepages and we should have the same for
    automatic NUMA placement.

    Signed-off-by: Mel Gorman

    Mel Gorman
     
  • Implement pte_numa and pmd_numa.

    We must atomically set the numa bit and clear the present bit to
    define a pte_numa or pmd_numa.

    Once a pte or pmd has been set as pte_numa or pmd_numa, the next time
    a thread touches a virtual address in the corresponding virtual range,
    a NUMA hinting page fault will trigger. The NUMA hinting page fault
    will clear the NUMA bit and set the present bit again to resolve the
    page fault.

    The expectation is that a NUMA hinting page fault is used as part
    of a placement policy that decides if a page should remain on the
    current node or migrated to a different node.

    Acked-by: Rik van Riel
    Signed-off-by: Andrea Arcangeli
    Signed-off-by: Mel Gorman

    Andrea Arcangeli