24 May, 2016

1 commit

  • PR_SET_THP_DISABLE requires mmap_sem for write. If the waiting task
    gets killed by the oom killer it would block oom_reaper from
    asynchronous address space reclaim and reduce the chances of timely OOM
    resolving. Wait for the lock in the killable mode and return with EINTR
    if the task got killed while waiting.

    Signed-off-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Acked-by: Alex Thorlton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

18 Mar, 2016

1 commit

  • This patchset introduces a /proc//timerslack_ns interface which
    would allow controlling processes to be able to set the timerslack value
    on other processes in order to save power by avoiding wakeups (Something
    Android currently does via out-of-tree patches).

    The first patch tries to fix the internal timer_slack_ns usage which was
    defined as a long, which limits the slack range to ~4 seconds on 32bit
    systems. It converts it to a u64, which provides the same basically
    unlimited slack (500 years) on both 32bit and 64bit machines.

    The second patch introduces the /proc//timerslack_ns interface
    which allows the full 64bit slack range for a task to be read or set on
    both 32bit and 64bit machines.

    With these two patches, on a 32bit machine, after setting the slack on
    bash to 10 seconds:

    $ time sleep 1

    real 0m10.747s
    user 0m0.001s
    sys 0m0.005s

    The first patch is a little ugly, since I had to chase the slack delta
    arguments through a number of functions converting them to u64s. Let me
    know if it makes sense to break that up more or not.

    Other than that things are fairly straightforward.

    This patch (of 2):

    The timer_slack_ns value in the task struct is currently a unsigned
    long. This means that on 32bit applications, the maximum slack is just
    over 4 seconds. However, on 64bit machines, its much much larger (~500
    years).

    This disparity could make application development a little (as well as
    the default_slack) to a u64. This means both 32bit and 64bit systems
    have the same effective internal slack range.

    Now the existing ABI via PR_GET_TIMERSLACK and PR_SET_TIMERSLACK specify
    the interface as a unsigned long, so we preserve that limitation on
    32bit systems, where SET_TIMERSLACK can only set the slack to a unsigned
    long value, and GET_TIMERSLACK will return ULONG_MAX if the slack is
    actually larger then what can be stored by an unsigned long.

    This patch also modifies hrtimer functions which specified the slack
    delta as a unsigned long.

    Signed-off-by: John Stultz
    Cc: Arjan van de Ven
    Cc: Thomas Gleixner
    Cc: Oren Laadan
    Cc: Ruchi Kandoi
    Cc: Rom Lemarchand
    Cc: Kees Cook
    Cc: Android Kernel Team
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    John Stultz
     

21 Jan, 2016

1 commit

  • An unprivileged user can trigger an oops on a kernel with
    CONFIG_CHECKPOINT_RESTORE.

    proc_pid_cmdline_read takes mmap_sem for reading and obtains args + env
    start/end values. These get sanity checked as follows:
    BUG_ON(arg_start > arg_end);
    BUG_ON(env_start > env_end);

    These can be changed by prctl_set_mm. Turns out also takes the semaphore for
    reading, effectively rendering it useless. This results in:

    kernel BUG at fs/proc/base.c:240!
    invalid opcode: 0000 [#1] SMP
    Modules linked in: virtio_net
    CPU: 0 PID: 925 Comm: a.out Not tainted 4.4.0-rc8-next-20160105dupa+ #71
    Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
    task: ffff880077a68000 ti: ffff8800784d0000 task.ti: ffff8800784d0000
    RIP: proc_pid_cmdline_read+0x520/0x530
    RSP: 0018:ffff8800784d3db8 EFLAGS: 00010206
    RAX: ffff880077c5b6b0 RBX: ffff8800784d3f18 RCX: 0000000000000000
    RDX: 0000000000000002 RSI: 00007f78e8857000 RDI: 0000000000000246
    RBP: ffff8800784d3e40 R08: 0000000000000008 R09: 0000000000000001
    R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000000050
    R13: 00007f78e8857800 R14: ffff88006fcef000 R15: ffff880077c5b600
    FS: 00007f78e884a740(0000) GS:ffff88007b200000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
    CR2: 00007f78e8361770 CR3: 00000000790a5000 CR4: 00000000000006f0
    Call Trace:
    __vfs_read+0x37/0x100
    vfs_read+0x82/0x130
    SyS_read+0x58/0xd0
    entry_SYSCALL_64_fastpath+0x12/0x76
    Code: 4c 8b 7d a8 eb e9 48 8b 9d 78 ff ff ff 4c 8b 7d 90 48 8b 03 48 39 45 a8 0f 87 f0 fe ff ff e9 d1 fe ff ff 4c 8b 7d 90 eb c6 0f 0b 0b 0f 0b 66 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00
    RIP proc_pid_cmdline_read+0x520/0x530
    ---[ end trace 97882617ae9c6818 ]---

    Turns out there are instances where the code just reads aformentioned
    values without locking whatsoever - namely environ_read and get_cmdline.

    Interestingly these functions look quite resilient against bogus values,
    but I don't believe this should be relied upon.

    The first patch gets rid of the oops bug by grabbing mmap_sem for
    writing.

    The second patch is optional and puts locking around aformentioned
    consumers for safety. Consumers of other fields don't seem to benefit
    from similar treatment and are left untouched.

    This patch (of 2):

    The code was taking the semaphore for reading, which does not protect
    against readers nor concurrent modifications.

    The problem could cause a sanity checks to fail in procfs's cmdline
    reader, resulting in an OOPS.

    Note that some functions perform an unlocked read of various mm fields,
    but they seem to be fine despite possible modificaton.

    Signed-off-by: Mateusz Guzik
    Acked-by: Cyrill Gorcunov
    Cc: Alexey Dobriyan
    Cc: Jarod Wilson
    Cc: Jan Stancek
    Cc: Al Viro
    Cc: Anshuman Khandual
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mateusz Guzik
     

07 Nov, 2015

1 commit

  • setpriority(PRIO_USER, 0, x) will change the priority of tasks outside of
    the current pid namespace. This is in contrast to both the other modes of
    setpriority and the example of kill(-1). Fix this. getpriority and
    ioprio have the same failure mode, fix them too.

    Eric said:

    : After some more thinking about it this patch sounds justifiable.
    :
    : My goal with namespaces is not to build perfect isolation mechanisms
    : as that can get into ill defined territory, but to build well defined
    : mechanisms. And to handle the corner cases so you can use only
    : a single namespace with well defined results.
    :
    : In this case you have found the two interfaces I am aware of that
    : identify processes by uid instead of by pid. Which quite frankly is
    : weird. Unfortunately the weird unexpected cases are hard to handle
    : in the usual way.
    :
    : I was hoping for a little more information. Changes like this one we
    : have to be careful of because someone might be depending on the current
    : behavior. I don't think they are and I do think this make sense as part
    : of the pid namespace.

    Signed-off-by: Ben Segall
    Cc: Oleg Nesterov
    Cc: Al Viro
    Cc: Ambrose Feinstein
    Acked-by: "Eric W. Biederman"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ben Segall
     

10 Jul, 2015

1 commit

  • Today proc and sysfs do not contain any executable files. Several
    applications today mount proc or sysfs without noexec and nosuid and
    then depend on there being no exectuables files on proc or sysfs.
    Having any executable files show on proc or sysfs would cause
    a user space visible regression, and most likely security problems.

    Therefore commit to never allowing executables on proc and sysfs by
    adding a new flag to mark them as filesystems without executables and
    enforce that flag.

    Test the flag where MNT_NOEXEC is tested today, so that the only user
    visible effect will be that exectuables will be treated as if the
    execute bit is cleared.

    The filesystems proc and sysfs do not currently incoporate any
    executable files so this does not result in any user visible effects.

    This makes it unnecessary to vet changes to proc and sysfs tightly for
    adding exectuable files or changes to chattr that would modify
    existing files, as no matter what the individual file say they will
    not be treated as exectuable files by the vfs.

    Not having to vet changes to closely is important as without this we
    are only one proc_create call (or another goof up in the
    implementation of notify_change) from having problematic executables
    on proc. Those mistakes are all too easy to make and would create
    a situation where there are security issues or the assumptions of
    some program having to be broken (and cause userspace regressions).

    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

26 Jun, 2015

1 commit

  • Individual prctl(PR_SET_MM_*) calls do some checking to maintain a
    consistent view of mm->arg_start et al fields, but not enough. In
    particular PR_SET_MM_ARG_START/PR_SET_MM_ARG_END/ R_SET_MM_ENV_START/
    PR_SET_MM_ENV_END only check that the address lies in an existing VMA,
    but don't check that the start address is lower than the end address _at
    all_.

    Consolidate all consistency checks, so there will be no difference in
    the future between PR_SET_MM_MAP and individual PR_SET_MM_* calls.

    The program below makes both ARGV and ENVP areas be reversed. It makes
    /proc/$PID/cmdline show garbage (it doesn't oops by luck).

    #include
    #include
    #include

    enum {PAGE_SIZE=4096};

    int main(void)
    {
    void *p;

    p = mmap(NULL, PAGE_SIZE, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);

    #define PR_SET_MM 35
    #define PR_SET_MM_ARG_START 8
    #define PR_SET_MM_ARG_END 9
    #define PR_SET_MM_ENV_START 10
    #define PR_SET_MM_ENV_END 11
    prctl(PR_SET_MM, PR_SET_MM_ARG_START, (unsigned long)p + PAGE_SIZE - 1, 0, 0);
    prctl(PR_SET_MM, PR_SET_MM_ARG_END, (unsigned long)p, 0, 0);
    prctl(PR_SET_MM, PR_SET_MM_ENV_START, (unsigned long)p + PAGE_SIZE - 1, 0, 0);
    prctl(PR_SET_MM, PR_SET_MM_ENV_END, (unsigned long)p, 0, 0);

    pause();
    return 0;
    }

    [akpm@linux-foundation.org: tidy code, tweak comment]
    Signed-off-by: Alexey Dobriyan
    Acked-by: Cyrill Gorcunov
    Cc: Jarod Wilson
    Cc: Jan Stancek
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     

09 Jun, 2015

1 commit

  • The MPX code can only work on the current task. You can not,
    for instance, enable MPX management in another process or
    thread. You can also not handle a fault for another process or
    thread.

    Despite this, we pass a task_struct around prolifically. This
    patch removes all of the task struct passing for code paths
    where the code can not deal with another task (which turns out
    to be all of them).

    This has no functional changes. It's just a cleanup.

    Signed-off-by: Dave Hansen
    Reviewed-by: Thomas Gleixner
    Cc: Andrew Morton
    Cc: Dave Hansen
    Cc: H. Peter Anvin
    Cc: Linus Torvalds
    Cc: Oleg Nesterov
    Cc: Peter Zijlstra
    Cc: bp@alien8.de
    Link: http://lkml.kernel.org/r/20150607183702.6A81DA2C@viggo.jf.intel.com
    Signed-off-by: Ingo Molnar

    Dave Hansen
     

17 Apr, 2015

1 commit

  • Oleg cleverly suggested using xchg() to set the new mm->exe_file instead
    of calling set_mm_exe_file() which requires some form of serialization --
    mmap_sem in this case. For archs that do not have atomic rmw instructions
    we still fallback to a spinlock alternative, so this should always be
    safe. As such, we only need the mmap_sem for looking up the backing
    vm_file, which can be done sharing the lock. Naturally, this means we
    need to manually deal with both the new and old file reference counting,
    and we need not worry about the MMF_EXE_FILE_CHANGED bits, which can
    probably be deleted in the future anyway.

    Signed-off-by: Davidlohr Bueso
    Suggested-by: Oleg Nesterov
    Acked-by: Oleg Nesterov
    Reviewed-by: Konstantin Khlebnikov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     

16 Apr, 2015

1 commit

  • There are a lot of embedded systems that run most or all of their
    functionality in init, running as root:root. For these systems,
    supporting multiple users is not necessary.

    This patch adds a new symbol, CONFIG_MULTIUSER, that makes support for
    non-root users, non-root groups, and capabilities optional. It is enabled
    under CONFIG_EXPERT menu.

    When this symbol is not defined, UID and GID are zero in any possible case
    and processes always have all capabilities.

    The following syscalls are compiled out: setuid, setregid, setgid,
    setreuid, setresuid, getresuid, setresgid, getresgid, setgroups,
    getgroups, setfsuid, setfsgid, capget, capset.

    Also, groups.c is compiled out completely.

    In kernel/capability.c, capable function was moved in order to avoid
    adding two ifdef blocks.

    This change saves about 25 KB on a defconfig build. The most minimal
    kernels have total text sizes in the high hundreds of kB rather than
    low MB. (The 25k goes down a bit with allnoconfig, but not that much.

    The kernel was booted in Qemu. All the common functionalities work.
    Adding users/groups is not possible, failing with -ENOSYS.

    Bloat-o-meter output:
    add/remove: 7/87 grow/shrink: 19/397 up/down: 1675/-26325 (-24650)

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Iulia Manda
    Reviewed-by: Josh Triplett
    Acked-by: Geert Uytterhoeven
    Tested-by: Paul E. McKenney
    Reviewed-by: Paul E. McKenney
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Iulia Manda
     

01 Mar, 2015

1 commit

  • There's a uname workaround for broken userspace which can't handle kernel
    versions of 3.x. Update it for 4.x.

    Signed-off-by: Jon DeVree
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jon DeVree
     

22 Feb, 2015

1 commit

  • Pull MIPS updates from Ralf Baechle:
    "This is the main pull request for MIPS:

    - a number of fixes that didn't make the 3.19 release.

    - a number of cleanups.

    - preliminary support for Cavium's Octeon 3 SOCs which feature up to
    48 MIPS64 R3 cores with FPU and hardware virtualization.

    - support for MIPS R6 processors.

    Revision 6 of the MIPS architecture is a major revision of the MIPS
    architecture which does away with many of original sins of the
    architecture such as branch delay slots. This and other changes in
    R6 require major changes throughout the entire MIPS core
    architecture code and make up for the lion share of this pull
    request.

    - finally some preparatory work for eXtendend Physical Address
    support, which allows support of up to 40 bit of physical address
    space on 32 bit processors"

    [ Ahh, MIPS can't leave the PAE brain damage alone. It's like
    every CPU architect has to make that mistake, but pee in the snow
    by changing the TLA. But whether it's called PAE, LPAE or XPA,
    it's horrid crud - Linus ]

    * 'upstream' of git://git.linux-mips.org/pub/scm/ralf/upstream-linus: (114 commits)
    MIPS: sead3: Corrected get_c0_perfcount_int
    MIPS: mm: Remove dead macro definitions
    MIPS: OCTEON: irq: add CIB and other fixes
    MIPS: OCTEON: Don't do acknowledge operations for level triggered irqs.
    MIPS: OCTEON: More OCTEONIII support
    MIPS: OCTEON: Remove setting of processor specific CVMCTL icache bits.
    MIPS: OCTEON: Core-15169 Workaround and general CVMSEG cleanup.
    MIPS: OCTEON: Update octeon-model.h code for new SoCs.
    MIPS: OCTEON: Implement DCache errata workaround for all CN6XXX
    MIPS: OCTEON: Add little-endian support to asm/octeon/octeon.h
    MIPS: OCTEON: Implement the core-16057 workaround
    MIPS: OCTEON: Delete unused COP2 saving code
    MIPS: OCTEON: Use correct instruction to read 64-bit COP0 register
    MIPS: OCTEON: Save and restore CP2 SHA3 state
    MIPS: OCTEON: Fix FP context save.
    MIPS: OCTEON: Save/Restore wider multiply registers in OCTEON III CPUs
    MIPS: boot: Provide more uImage options
    MIPS: Remove unneeded #ifdef __KERNEL__ from asm/processor.h
    MIPS: ip22-gio: Remove legacy suspend/resume support
    mips: pci: Add ifdef around pci_proc_domain
    ...

    Linus Torvalds
     

12 Feb, 2015

1 commit

  • Userland code may be built using an ABI which permits linking to objects
    that have more restrictive floating point requirements. For example,
    userland code may be built to target the O32 FPXX ABI. Such code may be
    linked with other FPXX code, or code built for either one of the more
    restrictive FP32 or FP64. When linking with more restrictive code, the
    overall requirement of the process becomes that of the more restrictive
    code. The kernel has no way to know in advance which mode the process
    will need to be executed in, and indeed it may need to change during
    execution. The dynamic loader is the only code which will know the
    overall required mode, and so it needs to have a means to instruct the
    kernel to switch the FP mode of the process.

    This patch introduces 2 new options to the prctl syscall which provide
    such a capability. The FP mode of the process is represented as a
    simple bitmask combining a number of mode bits mirroring those present
    in the hardware. Userland can either retrieve the current FP mode of
    the process:

    mode = prctl(PR_GET_FP_MODE);

    or modify the current FP mode of the process:

    err = prctl(PR_SET_FP_MODE, new_mode);

    Signed-off-by: Paul Burton
    Cc: Matthew Fortune
    Cc: Markos Chandras
    Cc: linux-mips@linux-mips.org
    Patchwork: https://patchwork.linux-mips.org/patch/8899/
    Signed-off-by: Ralf Baechle

    Paul Burton
     

23 Jan, 2015

1 commit

  • Description from Michael Kerrisk. He suggested an identical patch
    to one I had already coded up and tested.

    commit fe3d197f8431 "x86, mpx: On-demand kernel allocation of bounds
    tables" added two new prctl() operations, PR_MPX_ENABLE_MANAGEMENT and
    PR_MPX_DISABLE_MANAGEMENT. However, no checks were included to ensure
    that unused arguments are zero, as is done in many existing prctl()s
    and as should be done for all new prctl()s. This patch adds the
    required checks.

    Suggested-by: Andy Lutomirski
    Suggested-by: Michael Kerrisk
    Signed-off-by: Dave Hansen
    Cc: Dave Hansen
    Link: http://lkml.kernel.org/r/20150108223022.7F56FD13@viggo.jf.intel.com
    Signed-off-by: Thomas Gleixner

    Dave Hansen
     

18 Nov, 2014

1 commit

  • This is really the meat of the MPX patch set. If there is one patch to
    review in the entire series, this is the one. There is a new ABI here
    and this kernel code also interacts with userspace memory in a
    relatively unusual manner. (small FAQ below).

    Long Description:

    This patch adds two prctl() commands to provide enable or disable the
    management of bounds tables in kernel, including on-demand kernel
    allocation (See the patch "on-demand kernel allocation of bounds tables")
    and cleanup (See the patch "cleanup unused bound tables"). Applications
    do not strictly need the kernel to manage bounds tables and we expect
    some applications to use MPX without taking advantage of this kernel
    support. This means the kernel can not simply infer whether an application
    needs bounds table management from the MPX registers. The prctl() is an
    explicit signal from userspace.

    PR_MPX_ENABLE_MANAGEMENT is meant to be a signal from userspace to
    require kernel's help in managing bounds tables.

    PR_MPX_DISABLE_MANAGEMENT is the opposite, meaning that userspace don't
    want kernel's help any more. With PR_MPX_DISABLE_MANAGEMENT, the kernel
    won't allocate and free bounds tables even if the CPU supports MPX.

    PR_MPX_ENABLE_MANAGEMENT will fetch the base address of the bounds
    directory out of a userspace register (bndcfgu) and then cache it into
    a new field (->bd_addr) in the 'mm_struct'. PR_MPX_DISABLE_MANAGEMENT
    will set "bd_addr" to an invalid address. Using this scheme, we can
    use "bd_addr" to determine whether the management of bounds tables in
    kernel is enabled.

    Also, the only way to access that bndcfgu register is via an xsaves,
    which can be expensive. Caching "bd_addr" like this also helps reduce
    the cost of those xsaves when doing table cleanup at munmap() time.
    Unfortunately, we can not apply this optimization to #BR fault time
    because we need an xsave to get the value of BNDSTATUS.

    ==== Why does the hardware even have these Bounds Tables? ====

    MPX only has 4 hardware registers for storing bounds information.
    If MPX-enabled code needs more than these 4 registers, it needs to
    spill them somewhere. It has two special instructions for this
    which allow the bounds to be moved between the bounds registers
    and some new "bounds tables".

    They are similar conceptually to a page fault and will be raised by
    the MPX hardware during both bounds violations or when the tables
    are not present. This patch handles those #BR exceptions for
    not-present tables by carving the space out of the normal processes
    address space (essentially calling the new mmap() interface indroduced
    earlier in this patch set.) and then pointing the bounds-directory
    over to it.

    The tables *need* to be accessed and controlled by userspace because
    the instructions for moving bounds in and out of them are extremely
    frequent. They potentially happen every time a register pointing to
    memory is dereferenced. Any direct kernel involvement (like a syscall)
    to access the tables would obviously destroy performance.

    ==== Why not do this in userspace? ====

    This patch is obviously doing this allocation in the kernel.
    However, MPX does not strictly *require* anything in the kernel.
    It can theoretically be done completely from userspace. Here are
    a few ways this *could* be done. I don't think any of them are
    practical in the real-world, but here they are.

    Q: Can virtual space simply be reserved for the bounds tables so
    that we never have to allocate them?
    A: As noted earlier, these tables are *HUGE*. An X-GB virtual
    area needs 4*X GB of virtual space, plus 2GB for the bounds
    directory. If we were to preallocate them for the 128TB of
    user virtual address space, we would need to reserve 512TB+2GB,
    which is larger than the entire virtual address space today.
    This means they can not be reserved ahead of time. Also, a
    single process's pre-popualated bounds directory consumes 2GB
    of virtual *AND* physical memory. IOW, it's completely
    infeasible to prepopulate bounds directories.

    Q: Can we preallocate bounds table space at the same time memory
    is allocated which might contain pointers that might eventually
    need bounds tables?
    A: This would work if we could hook the site of each and every
    memory allocation syscall. This can be done for small,
    constrained applications. But, it isn't practical at a larger
    scale since a given app has no way of controlling how all the
    parts of the app might allocate memory (think libraries). The
    kernel is really the only place to intercept these calls.

    Q: Could a bounds fault be handed to userspace and the tables
    allocated there in a signal handler instead of in the kernel?
    A: (thanks to tglx) mmap() is not on the list of safe async
    handler functions and even if mmap() would work it still
    requires locking or nasty tricks to keep track of the
    allocation state there.

    Having ruled out all of the userspace-only approaches for managing
    bounds tables that we could think of, we create them on demand in
    the kernel.

    Based-on-patch-by: Qiaowei Ren
    Signed-off-by: Dave Hansen
    Cc: linux-mm@kvack.org
    Cc: linux-mips@linux-mips.org
    Cc: Dave Hansen
    Link: http://lkml.kernel.org/r/20141114151829.AD4310DE@viggo.jf.intel.com
    Signed-off-by: Thomas Gleixner

    Dave Hansen
     

13 Oct, 2014

1 commit

  • Pull scheduler updates from Ingo Molnar:
    "The main changes in this cycle were:

    - Optimized support for Intel "Cluster-on-Die" (CoD) topologies (Dave
    Hansen)

    - Various sched/idle refinements for better idle handling (Nicolas
    Pitre, Daniel Lezcano, Chuansheng Liu, Vincent Guittot)

    - sched/numa updates and optimizations (Rik van Riel)

    - sysbench speedup (Vincent Guittot)

    - capacity calculation cleanups/refactoring (Vincent Guittot)

    - Various cleanups to thread group iteration (Oleg Nesterov)

    - Double-rq-lock removal optimization and various refactorings
    (Kirill Tkhai)

    - various sched/deadline fixes

    ... and lots of other changes"

    * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (72 commits)
    sched/dl: Use dl_bw_of() under rcu_read_lock_sched()
    sched/fair: Delete resched_cpu() from idle_balance()
    sched, time: Fix build error with 64 bit cputime_t on 32 bit systems
    sched: Improve sysbench performance by fixing spurious active migration
    sched/x86: Fix up typo in topology detection
    x86, sched: Add new topology for multi-NUMA-node CPUs
    sched/rt: Use resched_curr() in task_tick_rt()
    sched: Use rq->rd in sched_setaffinity() under RCU read lock
    sched: cleanup: Rename 'out_unlock' to 'out_free_new_mask'
    sched: Use dl_bw_of() under RCU read lock
    sched/fair: Remove duplicate code from can_migrate_task()
    sched, mips, ia64: Remove __ARCH_WANT_UNLOCKED_CTXSW
    sched: print_rq(): Don't use tasklist_lock
    sched: normalize_rt_tasks(): Don't use _irqsave for tasklist_lock, use task_rq_lock()
    sched: Fix the task-group check in tg_has_rt_tasks()
    sched/fair: Leverage the idle state info when choosing the "idlest" cpu
    sched: Let the scheduler see CPU idle states
    sched/deadline: Fix inter- exclusive cpusets migrations
    sched/deadline: Clear dl_entity params when setscheduling to different class
    sched/numa: Kill the wrong/dead TASK_DEAD check in task_numa_fault()
    ...

    Linus Torvalds
     

10 Oct, 2014

6 commits

  • Fix undefined behavior and compiler warning by replacing right shift 32
    with upper_32_bits macro

    Signed-off-by: Scotty Bauer
    Cc: Clemens Ladisch
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Scotty Bauer
     
  • Fix minor errors and warning messages in kernel/sys.c. These errors were
    reported by checkpatch while working with some modifications in sys.c
    file. Fixing this first will help me to improve my further patches.

    ERROR: trailing whitespace - 9
    ERROR: do not use assignment in if condition - 4
    ERROR: spaces required around that '?' (ctx:VxO) - 10
    ERROR: switch and case should be at the same indent - 3

    total 26 errors & 3 warnings fixed.

    Signed-off-by: vishnu.ps
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    vishnu.ps
     
  • Dump the contents of the relevant struct_mm when we hit the bug condition.

    Signed-off-by: Sasha Levin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sasha Levin
     
  • During development of c/r we've noticed that in case if we need to support
    user namespaces we face a problem with capabilities in prctl(PR_SET_MM,
    ...) call, in particular once new user namespace is created
    capable(CAP_SYS_RESOURCE) no longer passes.

    A approach is to eliminate CAP_SYS_RESOURCE check but pass all new values
    in one bundle, which would allow the kernel to make more intensive test
    for sanity of values and same time allow us to support checkpoint/restore
    of user namespaces.

    Thus a new command PR_SET_MM_MAP introduced. It takes a pointer of
    prctl_mm_map structure which carries all the members to be updated.

    prctl(PR_SET_MM, PR_SET_MM_MAP, struct prctl_mm_map *, size)

    struct prctl_mm_map {
    __u64 start_code;
    __u64 end_code;
    __u64 start_data;
    __u64 end_data;
    __u64 start_brk;
    __u64 brk;
    __u64 start_stack;
    __u64 arg_start;
    __u64 arg_end;
    __u64 env_start;
    __u64 env_end;
    __u64 *auxv;
    __u32 auxv_size;
    __u32 exe_fd;
    };

    All members except @exe_fd correspond ones of struct mm_struct. To figure
    out which available values these members may take here are meanings of the
    members.

    - start_code, end_code: represent bounds of executable code area
    - start_data, end_data: represent bounds of data area
    - start_brk, brk: used to calculate bounds for brk() syscall
    - start_stack: used when accounting space needed for command
    line arguments, environment and shmat() syscall
    - arg_start, arg_end, env_start, env_end: represent memory area
    supplied for command line arguments and environment variables
    - auxv, auxv_size: carries auxiliary vector, Elf format specifics
    - exe_fd: file descriptor number for executable link (/proc/self/exe)

    Thus we apply the following requirements to the values

    1) Any member except @auxv, @auxv_size, @exe_fd is rather an address
    in user space thus it must be laying inside [mmap_min_addr, mmap_max_addr)
    interval.

    2) While @[start|end]_code and @[start|end]_data may point to an nonexisting
    VMAs (say a program maps own new .text and .data segments during execution)
    the rest of members should belong to VMA which must exist.

    3) Addresses must be ordered, ie @start_ member must not be greater or
    equal to appropriate @end_ member.

    4) As in regular Elf loading procedure we require that @start_brk and
    @brk be greater than @end_data.

    5) If RLIMIT_DATA rlimit is set to non-infinity new values should not
    exceed existing limit. Same applies to RLIMIT_STACK.

    6) Auxiliary vector size must not exceed existing one (which is
    predefined as AT_VECTOR_SIZE and depends on architecture).

    7) File descriptor passed in @exe_file should be pointing
    to executable file (because we use existing prctl_set_mm_exe_file_locked
    helper it ensures that the file we are going to use as exe link has all
    required permission granted).

    Now about where these members are involved inside kernel code:

    - @start_code and @end_code are used in /proc/$pid/[stat|statm] output;

    - @start_data and @end_data are used in /proc/$pid/[stat|statm] output,
    also they are considered if there enough space for brk() syscall
    result if RLIMIT_DATA is set;

    - @start_brk shown in /proc/$pid/stat output and accounted in brk()
    syscall if RLIMIT_DATA is set; also this member is tested to
    find a symbolic name of mmap event for perf system (we choose
    if event is generated for "heap" area); one more aplication is
    selinux -- we test if a process has PROCESS__EXECHEAP permission
    if trying to make heap area being executable with mprotect() syscall;

    - @brk is a current value for brk() syscall which lays inside heap
    area, it's shown in /proc/$pid/stat. When syscall brk() succesfully
    provides new memory area to a user space upon brk() completion the
    mm::brk is updated to carry new value;

    Both @start_brk and @brk are actively used in /proc/$pid/maps
    and /proc/$pid/smaps output to find a symbolic name "heap" for
    VMA being scanned;

    - @start_stack is printed out in /proc/$pid/stat and used to
    find a symbolic name "stack" for task and threads in
    /proc/$pid/maps and /proc/$pid/smaps output, and as the same
    as with @start_brk -- perf system uses it for event naming.
    Also kernel treat this member as a start address of where
    to map vDSO pages and to check if there is enough space
    for shmat() syscall;

    - @arg_start, @arg_end, @env_start and @env_end are printed out
    in /proc/$pid/stat. Another access to the data these members
    represent is to read /proc/$pid/environ or /proc/$pid/cmdline.
    Any attempt to read these areas kernel tests with access_process_vm
    helper so a user must have enough rights for this action;

    - @auxv and @auxv_size may be read from /proc/$pid/auxv. Strictly
    speaking kernel doesn't care much about which exactly data is
    sitting there because it is solely for userspace;

    - @exe_fd is referred from /proc/$pid/exe and when generating
    coredump. We uses prctl_set_mm_exe_file_locked helper to update
    this member, so exe-file link modification remains one-shot
    action.

    Still note that updating exe-file link now doesn't require sys-resource
    capability anymore, after all there is no much profit in preventing setup
    own file link (there are a number of ways to execute own code -- ptrace,
    ld-preload, so that the only reliable way to find which exactly code is
    executed is to inspect running program memory). Still we require the
    caller to be at least user-namespace root user.

    I believe the old interface should be deprecated and ripped off in a
    couple of kernel releases if no one against.

    To test if new interface is implemented in the kernel one can pass
    PR_SET_MM_MAP_SIZE opcode and the kernel returns the size of currently
    supported struct prctl_mm_map.

    [akpm@linux-foundation.org: fix 80-col wordwrap in macro definitions]
    Signed-off-by: Cyrill Gorcunov
    Cc: Kees Cook
    Cc: Tejun Heo
    Acked-by: Andrew Vagin
    Tested-by: Andrew Vagin
    Cc: Eric W. Biederman
    Cc: H. Peter Anvin
    Acked-by: Serge Hallyn
    Cc: Pavel Emelyanov
    Cc: Vasiliy Kulikov
    Cc: KAMEZAWA Hiroyuki
    Cc: Michael Kerrisk
    Cc: Julien Tinnes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cyrill Gorcunov
     
  • Instead of taking mm->mmap_sem inside prctl_set_mm_exe_file() move it out
    and rename the helper to prctl_set_mm_exe_file_locked(). This will allow
    to reuse this function in a next patch.

    Signed-off-by: Cyrill Gorcunov
    Cc: Kees Cook
    Cc: Tejun Heo
    Cc: Andrew Vagin
    Cc: Eric W. Biederman
    Cc: H. Peter Anvin
    Acked-by: Serge Hallyn
    Cc: Pavel Emelyanov
    Cc: Vasiliy Kulikov
    Cc: KAMEZAWA Hiroyuki
    Cc: Michael Kerrisk
    Cc: Julien Tinnes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cyrill Gorcunov
     
  • Signed-off-by: Cyrill Gorcunov
    Cc: Kees Cook
    Cc: Tejun Heo
    Cc: Andrew Vagin
    Cc: Eric W. Biederman
    Cc: H. Peter Anvin
    Acked-by: Serge Hallyn
    Cc: Pavel Emelyanov
    Cc: Vasiliy Kulikov
    Cc: KAMEZAWA Hiroyuki
    Cc: Michael Kerrisk
    Cc: Julien Tinnes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cyrill Gorcunov
     

08 Sep, 2014

1 commit

  • Both times() and clock_gettime(CLOCK_PROCESS_CPUTIME_ID) have scalability
    issues on large systems, due to both functions being serialized with a
    lock.

    The lock protects against reporting a wrong value, due to a thread in the
    task group exiting, its statistics reporting up to the signal struct, and
    that exited task's statistics being counted twice (or not at all).

    Protecting that with a lock results in times() and clock_gettime() being
    completely serialized on large systems.

    This can be fixed by using a seqlock around the events that gather and
    propagate statistics. As an additional benefit, the protection code can
    be moved into thread_group_cputime(), slightly simplifying the calling
    functions.

    In the case of posix_cpu_clock_get_task() things can be simplified a
    lot, because the calling function already ensures that the task sticks
    around, and the rest is now taken care of in thread_group_cputime().

    This way the statistics reporting code can run lockless.

    Signed-off-by: Rik van Riel
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Alex Thorlton
    Cc: Andrew Morton
    Cc: Daeseok Youn
    Cc: David Rientjes
    Cc: Dongsheng Yang
    Cc: Geert Uytterhoeven
    Cc: Guillaume Morin
    Cc: Ionut Alexa
    Cc: Kees Cook
    Cc: Linus Torvalds
    Cc: Li Zefan
    Cc: Michal Hocko
    Cc: Michal Schmidt
    Cc: Oleg Nesterov
    Cc: Vladimir Davydov
    Cc: umgwanakikbuti@gmail.com
    Cc: fweisbec@gmail.com
    Cc: srao@redhat.com
    Cc: lwoodman@redhat.com
    Cc: atheurer@redhat.com
    Link: http://lkml.kernel.org/r/20140816134010.26a9b572@annuminas.surriel.com
    Signed-off-by: Ingo Molnar

    Rik van Riel
     

19 Jul, 2014

1 commit

  • Since seccomp transitions between threads requires updates to the
    no_new_privs flag to be atomic, the flag must be part of an atomic flag
    set. This moves the nnp flag into a separate task field, and introduces
    accessors.

    Signed-off-by: Kees Cook
    Reviewed-by: Oleg Nesterov
    Reviewed-by: Andy Lutomirski

    Kees Cook
     

22 May, 2014

1 commit


08 Apr, 2014

1 commit

  • Add VM_INIT_DEF_MASK, to allow us to set the default flags for VMs. It
    also adds a prctl control which allows us to set the THP disable bit in
    mm->def_flags so that VMs will pick up the setting as they are created.

    Signed-off-by: Alex Thorlton
    Suggested-by: Oleg Nesterov
    Cc: Gerald Schaefer
    Cc: Martin Schwidefsky
    Cc: Heiko Carstens
    Cc: Christian Borntraeger
    Cc: Paolo Bonzini
    Cc: "Kirill A. Shutemov"
    Cc: Mel Gorman
    Acked-by: Rik van Riel
    Cc: Ingo Molnar
    Cc: Peter Zijlstra
    Cc: Andrea Arcangeli
    Cc: Oleg Nesterov
    Cc: "Eric W. Biederman"
    Cc: Alexander Viro
    Cc: Johannes Weiner
    Cc: David Rientjes
    Cc: Paolo Bonzini
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alex Thorlton
     

23 Feb, 2014

1 commit

  • Signed-off-by: Dongsheng Yang
    Signed-off-by: Peter Zijlstra
    Cc: Andrew Morton
    Cc: Oleg Nesterov
    Cc: Robin Holt
    Cc: Al Viro
    Cc: Kees Cook
    Cc: "Eric W. Biederman"
    Cc: Stephen Rothwell
    Link: http://lkml.kernel.org/r/0261f094b836f1acbcdf52e7166487c0c77323c8.1392103744.git.yangds.fnst@cn.fujitsu.com
    Signed-off-by: Thomas Gleixner
    Signed-off-by: Ingo Molnar

    Dongsheng Yang
     

24 Jan, 2014

2 commits


13 Nov, 2013

1 commit


31 Aug, 2013

1 commit


10 Jul, 2013

2 commits


04 Jul, 2013

3 commits


13 Jun, 2013

1 commit

  • We recently noticed that reboot of a 1024 cpu machine takes approx 16
    minutes of just stopping the cpus. The slowdown was tracked to commit
    f96972f2dc63 ("kernel/sys.c: call disable_nonboot_cpus() in
    kernel_restart()").

    The current implementation does all the work of hot removing the cpus
    before halting the system. We are switching to just migrating to the
    boot cpu and then continuing with shutdown/reboot.

    This also has the effect of not breaking x86's command line parameter
    for specifying the reboot cpu. Note, this code was shamelessly copied
    from arch/x86/kernel/reboot.c with bits removed pertaining to the
    reboot_cpu command line parameter.

    Signed-off-by: Robin Holt
    Tested-by: Shawn Guo
    Cc: "Srivatsa S. Bhat"
    Cc: H. Peter Anvin
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: Russ Anderson
    Cc: Robin Holt
    Cc: Russell King
    Cc: Guan Xuetao
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Robin Holt
     

01 May, 2013

3 commits

  • Pull compat cleanup from Al Viro:
    "Mostly about syscall wrappers this time; there will be another pile
    with patches in the same general area from various people, but I'd
    rather push those after both that and vfs.git pile are in."

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal:
    syscalls.h: slightly reduce the jungles of macros
    get rid of union semop in sys_semctl(2) arguments
    make do_mremap() static
    sparc: no need to sign-extend in sync_file_range() wrapper
    ppc compat wrappers for add_key(2) and request_key(2) are pointless
    x86: trim sys_ia32.h
    x86: sys32_kill and sys32_mprotect are pointless
    get rid of compat_sys_semctl() and friends in case of ARCH_WANT_OLD_COMPAT_IPC
    merge compat sys_ipc instances
    consolidate compat lookup_dcookie()
    convert vmsplice to COMPAT_SYSCALL_DEFINE
    switch getrusage() to COMPAT_SYSCALL_DEFINE
    switch epoll_pwait to COMPAT_SYSCALL_DEFINE
    convert sendfile{,64} to COMPAT_SYSCALL_DEFINE
    switch signalfd{,4}() to COMPAT_SYSCALL_DEFINE
    make SYSCALL_DEFINE-generated wrappers do asmlinkage_protect
    make HAVE_SYSCALL_WRAPPERS unconditional
    consolidate cond_syscall and SYSCALL_ALIAS declarations
    teach SYSCALL_DEFINE how to deal with long long/unsigned long long
    get rid of duplicate logics in __SC_....[1-6] definitions

    Linus Torvalds
     
  • The purpose of this patch is to allow privileged processes to set
    their own per-memory memory-region fields:

    start_code, end_code, start_data, end_data, start_brk, brk,
    start_stack, arg_start, arg_end, env_start, env_end.

    This functionality is needed by any application or package that needs to
    reconstruct Linux processes, that is, to start them in any way other than
    by means of an "execve()" from an executable file. This includes:

    1. Restoring processes from a checkpoint-file (by all potential
    user-level checkpointing packages, not only CRIU's).
    2. Restarting processes on another node after process migration.
    3. Starting duplicated copies of a running process (for reliability
    and high-availablity).
    4. Starting a process from an executable format that is not supported
    by Linux, thus requiring a "manual execve" by a user-level utility.
    5. Similarly, starting a process from a networked and/or crypted
    executable that, for confidentiality, licensing or other reasons,
    may not be written to the local file-systems.

    The code that does that was already included in the Linux kernel by the
    CRIU group, in the form of "prctl(PR_SET_MM)", but prior to this was
    enclosed within their private "#ifdef CONFIG_CHECKPOINT_RESTORE", which is
    normally disabled. The patch removes those ifdefs.

    Signed-off-by: Amnon Shiloh
    Cc: Cyrill Gorcunov
    Cc: Pavel Emelyanov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Amnon Shiloh
     
  • Andrew Morton noted:

    akpm3:/usr/src/25> grep SYSCALL kernel/timer.c
    SYSCALL_DEFINE1(alarm, unsigned int, seconds)
    SYSCALL_DEFINE0(getpid)
    SYSCALL_DEFINE0(getppid)
    SYSCALL_DEFINE0(getuid)
    SYSCALL_DEFINE0(geteuid)
    SYSCALL_DEFINE0(getgid)
    SYSCALL_DEFINE0(getegid)
    SYSCALL_DEFINE0(gettid)
    SYSCALL_DEFINE1(sysinfo, struct sysinfo __user *, info)
    COMPAT_SYSCALL_DEFINE1(sysinfo, struct compat_sysinfo __user *, info)

    Only one of those should be in kernel/timer.c. Who wrote this thing?

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Stephen Rothwell
    Acked-by: Thomas Gleixner
    Cc: Guenter Roeck
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Stephen Rothwell
     

09 Apr, 2013

1 commit

  • As commit 40dc166c (PM / Core: Introduce struct syscore_ops for core
    subsystems PM) say, syscore_ops operations should be carried with one
    CPU on-line and interrupts disabled. However, after commit f96972f2d
    (kernel/sys.c: call disable_nonboot_cpus() in kernel_restart()),
    syscore_shutdown() is called before disable_nonboot_cpus(), so break
    the rules. We have a MIPS machine with a 8259A PIC, and there is an
    external timer (HPET) linked at 8259A. Since 8259A has been shutdown
    too early (by syscore_shutdown()), disable_nonboot_cpus() runs without
    timer interrupt, so it hangs and reboot fails. This patch call
    syscore_shutdown() a little later (after disable_nonboot_cpus()) to
    avoid reboot failure, this is the same way as poweroff does.

    For consistency, add disable_nonboot_cpus() to kernel_halt().

    Signed-off-by: Huacai Chen
    Cc:
    Signed-off-by: Rafael J. Wysocki

    Huacai Chen