21 Dec, 2012

5 commits

  • Pull filesystem notification updates from Eric Paris:
    "This pull mostly is about locking changes in the fsnotify system. By
    switching the group lock from a spin_lock() to a mutex() we can now
    hold the lock across things like iput(). This fixes a problem
    involving unmounting a fs and having inodes be busy, first pointed out
    by FAT, but reproducible with tmpfs.

    This also restores signal driven I/O for inotify, which has been
    broken since about 2.6.32."

    Ugh. I *hate* the timing of this. It was rebased after the merge
    window opened, and then left to sit with the pull request coming the day
    before the merge window closes. That's just crap. But apparently the
    patches themselves have been around for over a year, just gathering
    dust, so now it's suddenly critical.

    Fixed up semantic conflict in fs/notify/fdinfo.c as per Stephen
    Rothwell's fixes from -next.

    * 'for-next' of git://git.infradead.org/users/eparis/notify:
    inotify: automatically restart syscalls
    inotify: dont skip removal of watch descriptor if creation of ignored event failed
    fanotify: dont merge permission events
    fsnotify: make fasync generic for both inotify and fanotify
    fsnotify: change locking order
    fsnotify: dont put marks on temporary list when clearing marks by group
    fsnotify: introduce locked versions of fsnotify_add_mark() and fsnotify_remove_mark()
    fsnotify: pass group to fsnotify_destroy_mark()
    fsnotify: use a mutex instead of a spinlock to protect a groups mark list
    fanotify: add an extra flag to mark_remove_from_mask that indicates wheather a mark should be destroyed
    fsnotify: take groups mark_lock before mark lock
    fsnotify: use reference counting for groups
    fsnotify: introduce fsnotify_get_group()
    inotify, fanotify: replace fsnotify_put_group() with fsnotify_destroy_group()

    Linus Torvalds
     
  • Merge the rest of Andrew's patches for -rc1:
    "A bunch of fixes and misc missed-out-on things.

    That'll do for -rc1. I still have a batch of IPC patches which still
    have a possible bug report which I'm chasing down."

    * emailed patches from Andrew Morton : (25 commits)
    keys: use keyring_alloc() to create module signing keyring
    keys: fix unreachable code
    sendfile: allows bypassing of notifier events
    SGI-XP: handle non-fatal traps
    fat: fix incorrect function comment
    Documentation: ABI: remove testing/sysfs-devices-node
    proc: fix inconsistent lock state
    linux/kernel.h: fix DIV_ROUND_CLOSEST with unsigned divisors
    memcg: don't register hotcpu notifier from ->css_alloc()
    checkpatch: warn on uapi #includes that #include
    mm: cma: WARN if freed memory is still in use
    exec: do not leave bprm->interp on stack
    ...

    Linus Torvalds
     
  • Pull signal handling cleanups from Al Viro:
    "sigaltstack infrastructure + conversion for x86, alpha and um,
    COMPAT_SYSCALL_DEFINE infrastructure.

    Note that there are several conflicts between "unify
    SS_ONSTACK/SS_DISABLE definitions" and UAPI patches in mainline;
    resolution is trivial - just remove definitions of SS_ONSTACK and
    SS_DISABLED from arch/*/uapi/asm/signal.h; they are all identical and
    include/uapi/linux/signal.h contains the unified variant."

    Fixed up conflicts as per Al.

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal:
    alpha: switch to generic sigaltstack
    new helpers: __save_altstack/__compat_save_altstack, switch x86 and um to those
    generic compat_sys_sigaltstack()
    introduce generic sys_sigaltstack(), switch x86 and um to it
    new helper: compat_user_stack_pointer()
    new helper: restore_altstack()
    unify SS_ONSTACK/SS_DISABLE definitions
    new helper: current_user_stack_pointer()
    missing user_stack_pointer() instances
    Bury the conditionals from kernel_thread/kernel_execve series
    COMPAT_SYSCALL_DEFINE: infrastructure

    Linus Torvalds
     
  • Use keyring_alloc() to create special keyrings now that it has
    a permissions parameter rather than using key_alloc() +
    key_instantiate_and_link().

    Signed-off-by: David Howells
    Cc: Rusty Russell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Howells
     
  • This makes it compile on s390. After all the ptrace_may_access
    (which we use this file) is declared exactly in linux/ptrace.h.

    This is preparatory work to wire this syscall up on all archs.

    Signed-off-by: Cyrill Gorcunov
    Signed-off-by: Alexander Kartashov
    Cc: Heiko Carstens
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cyrill Gorcunov
     

20 Dec, 2012

8 commits

  • task_numa_placement() oopsed on NULL p->mm when task_numa_fault() got
    called in the handling of break_ksm() for ksmd. That might be a
    peculiar case, which perhaps KSM could takes steps to avoid? but it's
    more robust if task_numa_placement() allows for such a possibility.

    Signed-off-by: Hugh Dickins
    Acked-by: Mel Gorman
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Pull random updates from Ted Ts'o:
    "A few /dev/random improvements for the v3.8 merge window."

    * tag 'random_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/random:
    random: Mix cputime from each thread that exits to the pool
    random: prime last_data value per fips requirements
    random: fix debug format strings
    random: make it possible to enable debugging without rebuild

    Linus Torvalds
     
  • note that they are relying on access_ok() already checked by caller.

    Signed-off-by: Al Viro

    Al Viro
     
  • Again, conditional on CONFIG_GENERIC_SIGALTSTACK

    Signed-off-by: Al Viro

    Al Viro
     
  • Conditional on CONFIG_GENERIC_SIGALTSTACK; architectures that do not
    select it are completely unaffected

    Signed-off-by: Al Viro

    Al Viro
     
  • to be used by rt_sigreturn instances

    Signed-off-by: Al Viro

    Al Viro
     
  • All architectures have
    CONFIG_GENERIC_KERNEL_THREAD
    CONFIG_GENERIC_KERNEL_EXECVE
    __ARCH_WANT_SYS_EXECVE
    None of them have __ARCH_WANT_KERNEL_EXECVE and there are only two callers
    of kernel_execve() (which is a trivial wrapper for do_execve() now) left.
    Kill the conditionals and make both callers use do_execve().

    Signed-off-by: Al Viro

    Al Viro
     
  • Commit 8d4516904b39 ("watchdog: Fix CPU hotplug regression") causes an
    oops or hard lockup when doing

    echo 0 > /proc/sys/kernel/nmi_watchdog
    echo 1 > /proc/sys/kernel/nmi_watchdog

    and the kernel is booted with nmi_watchdog=1 (default)

    Running laptop-mode-tools and disconnecting/connecting AC power will
    cause this to trigger, making it a common failure scenario on laptops.

    Instead of bailing out of watchdog_disable() when !watchdog_enabled we
    can initialize the hrtimer regardless of watchdog_enabled status. This
    makes it safe to call watchdog_disable() in the nmi_watchdog=0 case,
    without the negative effect on the enabled => disabled => enabled case.

    All these tests pass with this patch:
    - nmi_watchdog=1
    echo 0 > /proc/sys/kernel/nmi_watchdog
    echo 1 > /proc/sys/kernel/nmi_watchdog

    - nmi_watchdog=0
    echo 0 > /sys/devices/system/cpu/cpu1/online

    - nmi_watchdog=0
    echo mem > /sys/power/state

    Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=51661

    Cc: # v3.7
    Cc: Norbert Warmuth
    Cc: Joseph Salisbury
    Cc: Thomas Gleixner
    Signed-off-by: Bjørn Mork
    Signed-off-by: Linus Torvalds

    Bjørn Mork
     

19 Dec, 2012

7 commits

  • Pull module update from Rusty Russell:
    "Nothing all that exciting; a new module-from-fd syscall for those who
    want to verify the source of the module (ChromeOS) and/or use standard
    IMA on it or other security hooks."

    * tag 'modules-next-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux:
    MODSIGN: Fix kbuild output when using default extra_certificates
    MODSIGN: Avoid using .incbin in C source
    modules: don't hand 0 to vmalloc.
    module: Remove a extra null character at the top of module->strtab.
    ASN.1: Use the ASN1_LONG_TAG and ASN1_INDEFINITE_LENGTH constants
    ASN.1: Define indefinite length marker constant
    moduleparam: use __UNIQUE_ID()
    __UNIQUE_ID()
    MODSIGN: Add modules_sign make target
    powerpc: add finit_module syscall.
    ima: support new kernel module syscall
    add finit_module syscall to asm-generic
    ARM: add finit_module syscall to ARM
    security: introduce kernel_module_from_file hook
    module: add flags arg to sys_finit_module()
    module: add syscall to load module from fd

    Linus Torvalds
     
  • Merge patches from Andrew Morton:
    "Most of the rest of MM, plus a few dribs and drabs.

    I still have quite a few irritating patches left around: ones with
    dubious testing results, lack of review, ones which should have gone
    via maintainer trees but the maintainers are slack, etc.

    I need to be more activist in getting these things wrapped up outside
    the merge window, but they're such a PITA."

    * emailed patches from Andrew Morton : (48 commits)
    mm/vmscan.c: avoid possible deadlock caused by too_many_isolated()
    vmscan: comment too_many_isolated()
    mm/kmemleak.c: remove obsolete simple_strtoul
    mm/memory_hotplug.c: improve comments
    mm/hugetlb: create hugetlb cgroup file in hugetlb_init
    mm/mprotect.c: coding-style cleanups
    Documentation: ABI: /sys/devices/system/node/
    slub: drop mutex before deleting sysfs entry
    memcg: add comments clarifying aspects of cache attribute propagation
    kmem: add slab-specific documentation about the kmem controller
    slub: slub-specific propagation changes
    slab: propagate tunable values
    memcg: aggregate memcg cache values in slabinfo
    memcg/sl[au]b: shrink dead caches
    memcg/sl[au]b: track all the memcg children of a kmem_cache
    memcg: destroy memcg caches
    sl[au]b: allocate objects from memcg cache
    sl[au]b: always get the cache from its page in kmem_cache_free()
    memcg: skip memcg kmem allocations in specified code regions
    memcg: infrastructure to match an allocation to the right cache
    ...

    Linus Torvalds
     
  • Because those architectures will draw their stacks directly from the page
    allocator, rather than the slab cache, we can directly pass __GFP_KMEMCG
    flag, and issue the corresponding free_pages.

    This code path is taken when the architecture doesn't define
    CONFIG_ARCH_THREAD_INFO_ALLOCATOR (only ia64 seems to), and has
    THREAD_SIZE >= PAGE_SIZE. Luckily, most - if not all - of the remaining
    architectures fall in this category.

    This will guarantee that every stack page is accounted to the memcg the
    process currently lives on, and will have the allocations to fail if they
    go over limit.

    For the time being, I am defining a new variant of THREADINFO_GFP, not to
    mess with the other path. Once the slab is also tracked by memcg, we can
    get rid of that flag.

    Tested to successfully protect against :(){ :|:& };:

    Signed-off-by: Glauber Costa
    Acked-by: Frederic Weisbecker
    Acked-by: Kamezawa Hiroyuki
    Reviewed-by: Michal Hocko
    Cc: Christoph Lameter
    Cc: David Rientjes
    Cc: Greg Thelen
    Cc: Johannes Weiner
    Cc: JoonSoo Kim
    Cc: Mel Gorman
    Cc: Pekka Enberg
    Cc: Rik van Riel
    Cc: Suleiman Souhlal
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Glauber Costa
     
  • It is useful to know how many charges are still left after a call to
    res_counter_uncharge. While it is possible to issue a res_counter_read
    after uncharge, this can be racy.

    If we need, for instance, to take some action when the counters drop down
    to 0, only one of the callers should see it. This is the same semantics
    as the atomic variables in the kernel.

    Since the current return value is void, we don't need to worry about
    anything breaking due to this change: nobody relied on that, and only
    users appearing from now on will be checking this value.

    Signed-off-by: Glauber Costa
    Reviewed-by: Michal Hocko
    Acked-by: Kamezawa Hiroyuki
    Acked-by: David Rientjes
    Cc: Johannes Weiner
    Cc: Suleiman Souhlal
    Cc: Tejun Heo
    Cc: Christoph Lameter
    Cc: Frederic Weisbecker
    Cc: Greg Thelen
    Cc: JoonSoo Kim
    Cc: Mel Gorman
    Cc: Pekka Enberg
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Glauber Costa
     
  • The array check is useless so remove it.

    [akpm@linux-foundation.org: remove comment, per David]
    Signed-off-by: Alan Cox
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alan Cox
     
  • Pull minor tracing updates and fixes from Steven Rostedt:
    "It seems that one of my old pull requests have slipped through.

    The changes are contained to just the files that I maintain, and are
    changes from others that I told I would get into this merge window.

    They have already been in linux-next for several weeks, and should be
    well tested."

    * 'tip/perf/core-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
    tracing: Remove unnecessary WARN_ONCE's from tracing_buffers_splice_read
    tracing: Remove unneeded checks from the stack tracer
    tracing: Add a resize function to make one buffer equivalent to another buffer

    Linus Torvalds
     
  • Pull (again) user namespace infrastructure changes from Eric Biederman:
    "Those bugs, those darn embarrasing bugs just want don't want to get
    fixed.

    Linus I just updated my mirror of your kernel.org tree and it appears
    you successfully pulled everything except the last 4 commits that fix
    those embarrasing bugs.

    When you get a chance can you please repull my branch"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
    userns: Fix typo in description of the limitation of userns_install
    userns: Add a more complete capability subset test to commit_creds
    userns: Require CAP_SYS_ADMIN for most uses of setns.
    Fix cap_capable to only allow owners in the parent user namespace to have caps.

    Linus Torvalds
     

18 Dec, 2012

11 commits

  • Merge misc patches from Andrew Morton:
    "Incoming:

    - lots of misc stuff

    - backlight tree updates

    - lib/ updates

    - Oleg's percpu-rwsem changes

    - checkpatch

    - rtc

    - aoe

    - more checkpoint/restart support

    I still have a pile of MM stuff pending - Pekka should be merging
    later today after which that is good to go. A number of other things
    are twiddling thumbs awaiting maintainer merges."

    * emailed patches from Andrew Morton : (180 commits)
    scatterlist: don't BUG when we can trivially return a proper error.
    docs: update documentation about /proc//fdinfo/ fanotify output
    fs, fanotify: add @mflags field to fanotify output
    docs: add documentation about /proc//fdinfo/ output
    fs, notify: add procfs fdinfo helper
    fs, exportfs: add exportfs_encode_inode_fh() helper
    fs, exportfs: escape nil dereference if no s_export_op present
    fs, epoll: add procfs fdinfo helper
    fs, eventfd: add procfs fdinfo helper
    procfs: add ability to plug in auxiliary fdinfo providers
    tools/testing/selftests/kcmp/kcmp_test.c: print reason for failure in kcmp_test
    breakpoint selftests: print failure status instead of cause make error
    kcmp selftests: print fail status instead of cause make error
    kcmp selftests: make run_tests fix
    mem-hotplug selftests: print failure status instead of cause make error
    cpu-hotplug selftests: print failure status instead of cause make error
    mqueue selftests: print failure status instead of cause make error
    vm selftests: print failure status instead of cause make error
    ubifs: use prandom_bytes
    mtd: nandsim: use prandom_bytes
    ...

    Linus Torvalds
     
  • Since commit 1cdcbec1a337 ("CRED: Neuter sys_capset()")
    is_container_init() has no callers.

    Signed-off-by: Gao feng
    Cc: David Howells
    Acked-by: Serge Hallyn
    Cc: James Morris
    Cc: "Eric W. Biederman"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Gao feng
     
  • Ptrace jailers want to be sure that the tracee can never escape
    from the control. However if the tracer dies unexpectedly the
    tracee continues to run in potentially unsafe mode.

    Add the new ptrace option PTRACE_O_EXITKILL. If the tracer exits
    it sends SIGKILL to every tracee which has this bit set.

    Note that the new option is not equal to the last-option << 1. Because
    currently all options have an event, and the new one starts the eventless
    group. It uses the random 20 bit, so we have the room for 12 more events,
    but we can also add the new eventless options below this one.

    Suggested by Amnon Shiloh.

    Signed-off-by: Oleg Nesterov
    Tested-by: Amnon Shiloh
    Cc: Denys Vlasenko
    Cc: Michael Kerrisk
    Cc: Serge Hallyn
    Cc: Chris Evans
    Cc: David Howells
    Cc: "Eric W. Biederman"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • This function is used by sparc, powerpc tile and arm64 for compat support.
    The patch adds a generic implementation with a wrapper for PowerPC to do
    the u32->int sign extension.

    The reason for a single patch covering powerpc, tile, sparc and arm64 is
    to keep it bisectable, otherwise kernel building may fail with mismatched
    function declarations.

    Signed-off-by: Catalin Marinas
    Acked-by: Chris Metcalf [for tile]
    Acked-by: David S. Miller
    Acked-by: Arnd Bergmann
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: Alexander Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Catalin Marinas
     
  • Signed-off-by: Andy Shevchenko
    Cc: Steven Rostedt
    Cc: Frederic Weisbecker
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andy Shevchenko
     
  • The boot_delay parameter affects all printk(), even if the log level
    prevents visible output from the call. It results in delays greater than
    the user intended without purpose.

    This patch changes the behaviour of boot_delay to only delay output.

    Signed-off-by: Andrew Cooks
    Acked-by: Randy Dunlap
    Cc: Joe Perches
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Cooks
     
  • Currently getting the sample period is always thru a complex
    calculation: get_softlockup_thresh() * ((u64)NSEC_PER_SEC / 5).

    We can store the sample period as a variable, and set it as __read_mostly
    type.

    Signed-off-by: liu chuansheng
    Cc: Don Zickus
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Chuansheng Liu
     
  • But the kernel decided to call it "origin" instead. Fix most of the
    sites.

    Acked-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • In commit 9c0ece069b32 ("Get rid of Documentation/feature-removal.txt"),
    Linus removed feature-removal-schedule.txt from Documentation, but there
    is still some reference to this file. So remove them.

    Signed-off-by: Tao Ma
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tao Ma
     
  • Pull user namespace changes from Eric Biederman:
    "While small this set of changes is very significant with respect to
    containers in general and user namespaces in particular. The user
    space interface is now complete.

    This set of changes adds support for unprivileged users to create user
    namespaces and as a user namespace root to create other namespaces.
    The tyranny of supporting suid root preventing unprivileged users from
    using cool new kernel features is broken.

    This set of changes completes the work on setns, adding support for
    the pid, user, mount namespaces.

    This set of changes includes a bunch of basic pid namespace
    cleanups/simplifications. Of particular significance is the rework of
    the pid namespace cleanup so it no longer requires sending out
    tendrils into all kinds of unexpected cleanup paths for operation. At
    least one case of broken error handling is fixed by this cleanup.

    The files under /proc//ns/ have been converted from regular files
    to magic symlinks which prevents incorrect caching by the VFS,
    ensuring the files always refer to the namespace the process is
    currently using and ensuring that the ptrace_mayaccess permission
    checks are always applied.

    The files under /proc//ns/ have been given stable inode numbers
    so it is now possible to see if different processes share the same
    namespaces.

    Through the David Miller's net tree are changes to relax many of the
    permission checks in the networking stack to allowing the user
    namespace root to usefully use the networking stack. Similar changes
    for the mount namespace and the pid namespace are coming through my
    tree.

    Two small changes to add user namespace support were commited here adn
    in David Miller's -net tree so that I could complete the work on the
    /proc//ns/ files in this tree.

    Work remains to make it safe to build user namespaces and 9p, afs,
    ceph, cifs, coda, gfs2, ncpfs, nfs, nfsd, ocfs2, and xfs so the
    Kconfig guard remains in place preventing that user namespaces from
    being built when any of those filesystems are enabled.

    Future design work remains to allow root users outside of the initial
    user namespace to mount more than just /proc and /sys."

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (38 commits)
    proc: Usable inode numbers for the namespace file descriptors.
    proc: Fix the namespace inode permission checks.
    proc: Generalize proc inode allocation
    userns: Allow unprivilged mounts of proc and sysfs
    userns: For /proc/self/{uid,gid}_map derive the lower userns from the struct file
    procfs: Print task uids and gids in the userns that opened the proc file
    userns: Implement unshare of the user namespace
    userns: Implent proc namespace operations
    userns: Kill task_user_ns
    userns: Make create_new_namespaces take a user_ns parameter
    userns: Allow unprivileged use of setns.
    userns: Allow unprivileged users to create new namespaces
    userns: Allow setting a userns mapping to your current uid.
    userns: Allow chown and setgid preservation
    userns: Allow unprivileged users to create user namespaces.
    userns: Ignore suid and sgid on binaries if the uid or gid can not be mapped
    userns: fix return value on mntns_install() failure
    vfs: Allow unprivileged manipulation of the mount namespace.
    vfs: Only support slave subtrees across different user namespaces
    vfs: Add a user namespace reference from struct mnt_namespace
    ...

    Linus Torvalds
     
  • Michal Hocko reported that the following build error occurs if
    CONFIG_NUMA_BALANCING is set without THP support

    kernel/sched/fair.c: In function ‘task_numa_work’:
    kernel/sched/fair.c:932:55: error: call to ‘__build_bug_failed’ declared with attribute error: BUILD_BUG failed

    The problem is that HPAGE_PMD_SHIFT triggers a BUILD_BUG() on
    !CONFIG_TRANSPARENT_HUGEPAGE. This patch addresses the problem.

    Reported-by: Michal Hocko
    Signed-off-by: Mel Gorman
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

17 Dec, 2012

3 commits

  • When a thread exits mix it's cputime (userspace + kernelspace) to the entropy pool.

    We don't know how "random" this is, so we use add_device_randomness that doesn't mess
    with entropy count.

    Signed-off-by: Nick Kossifidis
    Signed-off-by: Theodore Ts'o

    Nick Kossifidis
     
  • Pull security subsystem updates from James Morris:
    "A quiet cycle for the security subsystem with just a few maintenance
    updates."

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security:
    Smack: create a sysfs mount point for smackfs
    Smack: use select not depends in Kconfig
    Yama: remove locking from delete path
    Yama: add RCU to drop read locking
    drivers/char/tpm: remove tasklet and cleanup
    KEYS: Use keyring_alloc() to create special keyrings
    KEYS: Reduce initial permissions on keys
    KEYS: Make the session and process keyrings per-thread
    seccomp: Make syscall skipping and nr changes more consistent
    key: Fix resource leak
    keys: Fix unreachable code
    KEYS: Add payload preparsing opportunity prior to key instantiate or update

    Linus Torvalds
     
  • Pull Automatic NUMA Balancing bare-bones from Mel Gorman:
    "There are three implementations for NUMA balancing, this tree
    (balancenuma), numacore which has been developed in tip/master and
    autonuma which is in aa.git.

    In almost all respects balancenuma is the dumbest of the three because
    its main impact is on the VM side with no attempt to be smart about
    scheduling. In the interest of getting the ball rolling, it would be
    desirable to see this much merged for 3.8 with the view to building
    scheduler smarts on top and adapting the VM where required for 3.9.

    The most recent set of comparisons available from different people are

    mel: https://lkml.org/lkml/2012/12/9/108
    mingo: https://lkml.org/lkml/2012/12/7/331
    tglx: https://lkml.org/lkml/2012/12/10/437
    srikar: https://lkml.org/lkml/2012/12/10/397

    The results are a mixed bag. In my own tests, balancenuma does
    reasonably well. It's dumb as rocks and does not regress against
    mainline. On the other hand, Ingo's tests shows that balancenuma is
    incapable of converging for this workloads driven by perf which is bad
    but is potentially explained by the lack of scheduler smarts. Thomas'
    results show balancenuma improves on mainline but falls far short of
    numacore or autonuma. Srikar's results indicate we all suffer on a
    large machine with imbalanced node sizes.

    My own testing showed that recent numacore results have improved
    dramatically, particularly in the last week but not universally.
    We've butted heads heavily on system CPU usage and high levels of
    migration even when it shows that overall performance is better.
    There are also cases where it regresses. Of interest is that for
    specjbb in some configurations it will regress for lower numbers of
    warehouses and show gains for higher numbers which is not reported by
    the tool by default and sometimes missed in treports. Recently I
    reported for numacore that the JVM was crashing with
    NullPointerExceptions but currently it's unclear what the source of
    this problem is. Initially I thought it was in how numacore batch
    handles PTEs but I'm no longer think this is the case. It's possible
    numacore is just able to trigger it due to higher rates of migration.

    These reports were quite late in the cycle so I/we would like to start
    with this tree as it contains much of the code we can agree on and has
    not changed significantly over the last 2-3 weeks."

    * tag 'balancenuma-v11' of git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux-balancenuma: (50 commits)
    mm/rmap, migration: Make rmap_walk_anon() and try_to_unmap_anon() more scalable
    mm/rmap: Convert the struct anon_vma::mutex to an rwsem
    mm: migrate: Account a transhuge page properly when rate limiting
    mm: numa: Account for failed allocations and isolations as migration failures
    mm: numa: Add THP migration for the NUMA working set scanning fault case build fix
    mm: numa: Add THP migration for the NUMA working set scanning fault case.
    mm: sched: numa: Delay PTE scanning until a task is scheduled on a new node
    mm: sched: numa: Control enabling and disabling of NUMA balancing if !SCHED_DEBUG
    mm: sched: numa: Control enabling and disabling of NUMA balancing
    mm: sched: Adapt the scanning rate if a NUMA hinting fault does not migrate
    mm: numa: Use a two-stage filter to restrict pages being migrated for unlikely tasknode relationships
    mm: numa: migrate: Set last_nid on newly allocated page
    mm: numa: split_huge_page: Transfer last_nid on tail page
    mm: numa: Introduce last_nid to the page frame
    sched: numa: Slowly increase the scanning period as NUMA faults are handled
    mm: numa: Rate limit setting of pte_numa if node is saturated
    mm: numa: Rate limit the amount of memory that is migrated between nodes
    mm: numa: Structures for Migrate On Fault per NUMA migration rate limiting
    mm: numa: Migrate pages handled during a pmd_numa hinting fault
    mm: numa: Migrate on reference policy
    ...

    Linus Torvalds
     

16 Dec, 2012

1 commit

  • Pull crypto update from Herbert Xu:

    - Added aesni/avx/x86_64 implementations for camellia.

    - Optimised AVX code for cast5/serpent/twofish/cast6.

    - Fixed vmac bug with unaligned input.

    - Allow compression algorithms in FIPS mode.

    - Optimised crc32c implementation for Intel.

    - Misc fixes.

    * git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (32 commits)
    crypto: caam - Updated SEC-4.0 device tree binding for ERA information.
    crypto: testmgr - remove superfluous initializers for xts(aes)
    crypto: testmgr - allow compression algs in fips mode
    crypto: testmgr - add larger crc32c test vector to test FPU path in crc32c_intel
    crypto: testmgr - clean alg_test_null entries in alg_test_descs[]
    crypto: testmgr - remove fips_allowed flag from camellia-aesni null-tests
    crypto: cast5/cast6 - move lookup tables to shared module
    padata: use __this_cpu_read per-cpu helper
    crypto: s5p-sss - Fix compilation error
    crypto: picoxcell - Add terminating entry for platform_device_id table
    crypto: omap-aes - select BLKCIPHER2
    crypto: camellia - add AES-NI/AVX/x86_64 assembler implementation of camellia cipher
    crypto: camellia-x86_64 - share common functions and move structures and function definitions to header file
    crypto: tcrypt - add async speed test for camellia cipher
    crypto: tegra-aes - fix error-valued pointer dereference
    crypto: tegra - fix missing unlock on error case
    crypto: cast5/avx - avoid using temporary stack buffers
    crypto: serpent/avx - avoid using temporary stack buffers
    crypto: twofish/avx - avoid using temporary stack buffers
    crypto: cast6/avx - avoid using temporary stack buffers
    ...

    Linus Torvalds
     

15 Dec, 2012

3 commits

  • Acked-by: Serge Hallyn
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     
  • When unsharing a user namespace we reduce our credentials to just what
    can be done in that user namespace. This is a subset of the credentials
    we previously had. Teach commit_creds to recognize this is a subset
    of the credentials we have had before and don't clear the dumpability flag.

    This allows an unprivileged program to do:
    unshare(CLONE_NEWUSER);
    fd = open("/proc/self/uid_map", O_RDWR);

    Where previously opening the uid_map writable would fail because
    the the task had been made non-dumpable.

    Acked-by: Serge Hallyn
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     
  • Andy Lutomirski found a nasty little bug in
    the permissions of setns. With unprivileged user namespaces it
    became possible to create new namespaces without privilege.

    However the setns calls were relaxed to only require CAP_SYS_ADMIN in
    the user nameapce of the targed namespace.

    Which made the following nasty sequence possible.

    pid = clone(CLONE_NEWUSER | CLONE_NEWNS);
    if (pid == 0) { /* child */
    system("mount --bind /home/me/passwd /etc/passwd");
    }
    else if (pid != 0) { /* parent */
    char path[PATH_MAX];
    snprintf(path, sizeof(path), "/proc/%u/ns/mnt");
    fd = open(path, O_RDONLY);
    setns(fd, 0);
    system("su -");
    }

    Prevent this possibility by requiring CAP_SYS_ADMIN
    in the current user namespace when joing all but the user namespace.

    Acked-by: Serge Hallyn
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

14 Dec, 2012

2 commits

  • This reverts commit f269ae0469fc882332bdfb5db15d3c1315fe2a10.

    It turns out it causes a very noticeable interactivity regression with
    CONFIG_SCHED_AUTOGROUP (test-case: "make -j32" of the kernel in a
    terminal window, while scrolling in a browser - the autogrouping means
    that the two end up in separate cgroups, and the browser should be
    smooth as silk despite the high load).

    Says Paul Turner:
    "It seems that the update-throttling on the wake-side is reducing the
    interactive tasks' ability to preempt. While I suspect the right
    longer term answer here is force these updates only in the
    cross-cgroup case; this is less trivial. For this release I believe
    the right answer is either going to be a revert or restore the updates
    on the enqueue-side."

    Reported-by: Linus Torvalds
    Bisected-by: Mike Galbraith
    Acked-by: Paul Turner
    Acked-by: Ingo Molnar
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     
  • Reported-by: Peter Foley
    Signed-off-by: Michal Marek
    Acked-by: Peter Foley
    Signed-off-by: Rusty Russell

    Michal Marek