20 Nov, 2015

5 commits

  • Userspace processes often have multiple allocators that each do
    anonymous mmaps to get memory. When examining memory usage of
    individual processes or systems as a whole, it is useful to be
    able to break down the various heaps that were allocated by
    each layer and examine their size, RSS, and physical memory
    usage.

    This patch adds a user pointer to the shared union in
    vm_area_struct that points to a null terminated string inside
    the user process containing a name for the vma. vmas that
    point to the same address will be merged, but vmas that
    point to equivalent strings at different addresses will
    not be merged.

    Userspace can set the name for a region of memory by calling
    prctl(PR_SET_VMA, PR_SET_VMA_ANON_NAME, start, len, (unsigned long)name);
    Setting the name to NULL clears it.

    The names of named anonymous vmas are shown in /proc/pid/maps
    as [anon:] and in /proc/pid/smaps in a new "Name" field
    that is only present for named vmas. If the userspace pointer
    is no longer valid all or part of the name will be replaced
    with "".

    The idea to store a userspace pointer to reduce the complexity
    within mm (at the expense of the complexity of reading
    /proc/pid/mem) came from Dave Hansen. This results in no
    runtime overhead in the mm subsystem other than comparing
    the anon_name pointers when considering vma merging. The pointer
    is stored in a union with fieds that are only used on file-backed
    mappings, so it does not increase memory usage.

    Includes fix from Jed Davis for typo in
    prctl_set_vma_anon_name, which could attempt to set the name
    across two vmas at the same time due to a typo, which might
    corrupt the vma list. Fix it to use tmp instead of end to limit
    the name setting to a single vma at a time.

    Change-Id: I9aa7b6b5ef536cd780599ba4e2fba8ceebe8b59f
    Signed-off-by: Dmitry Shmidt

    Colin Cross
     
  • Make PR_SET_TIMERSLACK_PID consider pid namespace and resolve the
    target pid in the caller's namespace. Otherwise, calls from pid
    namespace other than init would fail or affect the wrong task.

    Change-Id: I1da15196abc4096536713ce03714e99d2e63820a
    Signed-off-by: Micha Kalfon
    Acked-by: Oren Laadan

    Micha Kalfon
     
  • The case clause for the PR_SET_TIMERSLACK_PID option was placed inside
    the an internal switch statement for PR_MCE_KILL (see commits 37a591d4
    and 8ae872f1) . This commit moves it to the right place.

    Change-Id: I63251669d7e2f2aa843d1b0900e7df61518c3dea
    Signed-off-by: Micha Kalfon
    Acked-by: Oren Laadan

    Micha Kalfon
     
  • Adds a capable() check to make sure that arbitary apps do not change the
    timer slack for other apps.

    Bug: 15000427
    Change-Id: I558a2551a0e3579c7f7e7aae54b28aa9d982b209
    Signed-off-by: Ruchi Kandoi

    Ruchi Kandoi
     
  • Second argument is similar to PR_SET_TIMERSLACK, if non-zero then the
    slack is set to that value otherwise sets it to the default for the thread.

    Takes PID of the thread as the third argument.

    This allows power/performance management software to set timer slack for
    other threads according to its policy for the thread (such as when the
    thread is designated foreground vs. background activity)

    Change-Id: I744d451ff4e60dae69f38f53948ff36c51c14a3f
    Signed-off-by: Ruchi Kandoi

    Ruchi Kandoi
     

17 Apr, 2015

1 commit

  • Oleg cleverly suggested using xchg() to set the new mm->exe_file instead
    of calling set_mm_exe_file() which requires some form of serialization --
    mmap_sem in this case. For archs that do not have atomic rmw instructions
    we still fallback to a spinlock alternative, so this should always be
    safe. As such, we only need the mmap_sem for looking up the backing
    vm_file, which can be done sharing the lock. Naturally, this means we
    need to manually deal with both the new and old file reference counting,
    and we need not worry about the MMF_EXE_FILE_CHANGED bits, which can
    probably be deleted in the future anyway.

    Signed-off-by: Davidlohr Bueso
    Suggested-by: Oleg Nesterov
    Acked-by: Oleg Nesterov
    Reviewed-by: Konstantin Khlebnikov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     

16 Apr, 2015

1 commit

  • There are a lot of embedded systems that run most or all of their
    functionality in init, running as root:root. For these systems,
    supporting multiple users is not necessary.

    This patch adds a new symbol, CONFIG_MULTIUSER, that makes support for
    non-root users, non-root groups, and capabilities optional. It is enabled
    under CONFIG_EXPERT menu.

    When this symbol is not defined, UID and GID are zero in any possible case
    and processes always have all capabilities.

    The following syscalls are compiled out: setuid, setregid, setgid,
    setreuid, setresuid, getresuid, setresgid, getresgid, setgroups,
    getgroups, setfsuid, setfsgid, capget, capset.

    Also, groups.c is compiled out completely.

    In kernel/capability.c, capable function was moved in order to avoid
    adding two ifdef blocks.

    This change saves about 25 KB on a defconfig build. The most minimal
    kernels have total text sizes in the high hundreds of kB rather than
    low MB. (The 25k goes down a bit with allnoconfig, but not that much.

    The kernel was booted in Qemu. All the common functionalities work.
    Adding users/groups is not possible, failing with -ENOSYS.

    Bloat-o-meter output:
    add/remove: 7/87 grow/shrink: 19/397 up/down: 1675/-26325 (-24650)

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Iulia Manda
    Reviewed-by: Josh Triplett
    Acked-by: Geert Uytterhoeven
    Tested-by: Paul E. McKenney
    Reviewed-by: Paul E. McKenney
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Iulia Manda
     

01 Mar, 2015

1 commit

  • There's a uname workaround for broken userspace which can't handle kernel
    versions of 3.x. Update it for 4.x.

    Signed-off-by: Jon DeVree
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jon DeVree
     

22 Feb, 2015

1 commit

  • Pull MIPS updates from Ralf Baechle:
    "This is the main pull request for MIPS:

    - a number of fixes that didn't make the 3.19 release.

    - a number of cleanups.

    - preliminary support for Cavium's Octeon 3 SOCs which feature up to
    48 MIPS64 R3 cores with FPU and hardware virtualization.

    - support for MIPS R6 processors.

    Revision 6 of the MIPS architecture is a major revision of the MIPS
    architecture which does away with many of original sins of the
    architecture such as branch delay slots. This and other changes in
    R6 require major changes throughout the entire MIPS core
    architecture code and make up for the lion share of this pull
    request.

    - finally some preparatory work for eXtendend Physical Address
    support, which allows support of up to 40 bit of physical address
    space on 32 bit processors"

    [ Ahh, MIPS can't leave the PAE brain damage alone. It's like
    every CPU architect has to make that mistake, but pee in the snow
    by changing the TLA. But whether it's called PAE, LPAE or XPA,
    it's horrid crud - Linus ]

    * 'upstream' of git://git.linux-mips.org/pub/scm/ralf/upstream-linus: (114 commits)
    MIPS: sead3: Corrected get_c0_perfcount_int
    MIPS: mm: Remove dead macro definitions
    MIPS: OCTEON: irq: add CIB and other fixes
    MIPS: OCTEON: Don't do acknowledge operations for level triggered irqs.
    MIPS: OCTEON: More OCTEONIII support
    MIPS: OCTEON: Remove setting of processor specific CVMCTL icache bits.
    MIPS: OCTEON: Core-15169 Workaround and general CVMSEG cleanup.
    MIPS: OCTEON: Update octeon-model.h code for new SoCs.
    MIPS: OCTEON: Implement DCache errata workaround for all CN6XXX
    MIPS: OCTEON: Add little-endian support to asm/octeon/octeon.h
    MIPS: OCTEON: Implement the core-16057 workaround
    MIPS: OCTEON: Delete unused COP2 saving code
    MIPS: OCTEON: Use correct instruction to read 64-bit COP0 register
    MIPS: OCTEON: Save and restore CP2 SHA3 state
    MIPS: OCTEON: Fix FP context save.
    MIPS: OCTEON: Save/Restore wider multiply registers in OCTEON III CPUs
    MIPS: boot: Provide more uImage options
    MIPS: Remove unneeded #ifdef __KERNEL__ from asm/processor.h
    MIPS: ip22-gio: Remove legacy suspend/resume support
    mips: pci: Add ifdef around pci_proc_domain
    ...

    Linus Torvalds
     

12 Feb, 2015

1 commit

  • Userland code may be built using an ABI which permits linking to objects
    that have more restrictive floating point requirements. For example,
    userland code may be built to target the O32 FPXX ABI. Such code may be
    linked with other FPXX code, or code built for either one of the more
    restrictive FP32 or FP64. When linking with more restrictive code, the
    overall requirement of the process becomes that of the more restrictive
    code. The kernel has no way to know in advance which mode the process
    will need to be executed in, and indeed it may need to change during
    execution. The dynamic loader is the only code which will know the
    overall required mode, and so it needs to have a means to instruct the
    kernel to switch the FP mode of the process.

    This patch introduces 2 new options to the prctl syscall which provide
    such a capability. The FP mode of the process is represented as a
    simple bitmask combining a number of mode bits mirroring those present
    in the hardware. Userland can either retrieve the current FP mode of
    the process:

    mode = prctl(PR_GET_FP_MODE);

    or modify the current FP mode of the process:

    err = prctl(PR_SET_FP_MODE, new_mode);

    Signed-off-by: Paul Burton
    Cc: Matthew Fortune
    Cc: Markos Chandras
    Cc: linux-mips@linux-mips.org
    Patchwork: https://patchwork.linux-mips.org/patch/8899/
    Signed-off-by: Ralf Baechle

    Paul Burton
     

23 Jan, 2015

1 commit

  • Description from Michael Kerrisk. He suggested an identical patch
    to one I had already coded up and tested.

    commit fe3d197f8431 "x86, mpx: On-demand kernel allocation of bounds
    tables" added two new prctl() operations, PR_MPX_ENABLE_MANAGEMENT and
    PR_MPX_DISABLE_MANAGEMENT. However, no checks were included to ensure
    that unused arguments are zero, as is done in many existing prctl()s
    and as should be done for all new prctl()s. This patch adds the
    required checks.

    Suggested-by: Andy Lutomirski
    Suggested-by: Michael Kerrisk
    Signed-off-by: Dave Hansen
    Cc: Dave Hansen
    Link: http://lkml.kernel.org/r/20150108223022.7F56FD13@viggo.jf.intel.com
    Signed-off-by: Thomas Gleixner

    Dave Hansen
     

18 Nov, 2014

1 commit

  • This is really the meat of the MPX patch set. If there is one patch to
    review in the entire series, this is the one. There is a new ABI here
    and this kernel code also interacts with userspace memory in a
    relatively unusual manner. (small FAQ below).

    Long Description:

    This patch adds two prctl() commands to provide enable or disable the
    management of bounds tables in kernel, including on-demand kernel
    allocation (See the patch "on-demand kernel allocation of bounds tables")
    and cleanup (See the patch "cleanup unused bound tables"). Applications
    do not strictly need the kernel to manage bounds tables and we expect
    some applications to use MPX without taking advantage of this kernel
    support. This means the kernel can not simply infer whether an application
    needs bounds table management from the MPX registers. The prctl() is an
    explicit signal from userspace.

    PR_MPX_ENABLE_MANAGEMENT is meant to be a signal from userspace to
    require kernel's help in managing bounds tables.

    PR_MPX_DISABLE_MANAGEMENT is the opposite, meaning that userspace don't
    want kernel's help any more. With PR_MPX_DISABLE_MANAGEMENT, the kernel
    won't allocate and free bounds tables even if the CPU supports MPX.

    PR_MPX_ENABLE_MANAGEMENT will fetch the base address of the bounds
    directory out of a userspace register (bndcfgu) and then cache it into
    a new field (->bd_addr) in the 'mm_struct'. PR_MPX_DISABLE_MANAGEMENT
    will set "bd_addr" to an invalid address. Using this scheme, we can
    use "bd_addr" to determine whether the management of bounds tables in
    kernel is enabled.

    Also, the only way to access that bndcfgu register is via an xsaves,
    which can be expensive. Caching "bd_addr" like this also helps reduce
    the cost of those xsaves when doing table cleanup at munmap() time.
    Unfortunately, we can not apply this optimization to #BR fault time
    because we need an xsave to get the value of BNDSTATUS.

    ==== Why does the hardware even have these Bounds Tables? ====

    MPX only has 4 hardware registers for storing bounds information.
    If MPX-enabled code needs more than these 4 registers, it needs to
    spill them somewhere. It has two special instructions for this
    which allow the bounds to be moved between the bounds registers
    and some new "bounds tables".

    They are similar conceptually to a page fault and will be raised by
    the MPX hardware during both bounds violations or when the tables
    are not present. This patch handles those #BR exceptions for
    not-present tables by carving the space out of the normal processes
    address space (essentially calling the new mmap() interface indroduced
    earlier in this patch set.) and then pointing the bounds-directory
    over to it.

    The tables *need* to be accessed and controlled by userspace because
    the instructions for moving bounds in and out of them are extremely
    frequent. They potentially happen every time a register pointing to
    memory is dereferenced. Any direct kernel involvement (like a syscall)
    to access the tables would obviously destroy performance.

    ==== Why not do this in userspace? ====

    This patch is obviously doing this allocation in the kernel.
    However, MPX does not strictly *require* anything in the kernel.
    It can theoretically be done completely from userspace. Here are
    a few ways this *could* be done. I don't think any of them are
    practical in the real-world, but here they are.

    Q: Can virtual space simply be reserved for the bounds tables so
    that we never have to allocate them?
    A: As noted earlier, these tables are *HUGE*. An X-GB virtual
    area needs 4*X GB of virtual space, plus 2GB for the bounds
    directory. If we were to preallocate them for the 128TB of
    user virtual address space, we would need to reserve 512TB+2GB,
    which is larger than the entire virtual address space today.
    This means they can not be reserved ahead of time. Also, a
    single process's pre-popualated bounds directory consumes 2GB
    of virtual *AND* physical memory. IOW, it's completely
    infeasible to prepopulate bounds directories.

    Q: Can we preallocate bounds table space at the same time memory
    is allocated which might contain pointers that might eventually
    need bounds tables?
    A: This would work if we could hook the site of each and every
    memory allocation syscall. This can be done for small,
    constrained applications. But, it isn't practical at a larger
    scale since a given app has no way of controlling how all the
    parts of the app might allocate memory (think libraries). The
    kernel is really the only place to intercept these calls.

    Q: Could a bounds fault be handed to userspace and the tables
    allocated there in a signal handler instead of in the kernel?
    A: (thanks to tglx) mmap() is not on the list of safe async
    handler functions and even if mmap() would work it still
    requires locking or nasty tricks to keep track of the
    allocation state there.

    Having ruled out all of the userspace-only approaches for managing
    bounds tables that we could think of, we create them on demand in
    the kernel.

    Based-on-patch-by: Qiaowei Ren
    Signed-off-by: Dave Hansen
    Cc: linux-mm@kvack.org
    Cc: linux-mips@linux-mips.org
    Cc: Dave Hansen
    Link: http://lkml.kernel.org/r/20141114151829.AD4310DE@viggo.jf.intel.com
    Signed-off-by: Thomas Gleixner

    Dave Hansen
     

13 Oct, 2014

1 commit

  • Pull scheduler updates from Ingo Molnar:
    "The main changes in this cycle were:

    - Optimized support for Intel "Cluster-on-Die" (CoD) topologies (Dave
    Hansen)

    - Various sched/idle refinements for better idle handling (Nicolas
    Pitre, Daniel Lezcano, Chuansheng Liu, Vincent Guittot)

    - sched/numa updates and optimizations (Rik van Riel)

    - sysbench speedup (Vincent Guittot)

    - capacity calculation cleanups/refactoring (Vincent Guittot)

    - Various cleanups to thread group iteration (Oleg Nesterov)

    - Double-rq-lock removal optimization and various refactorings
    (Kirill Tkhai)

    - various sched/deadline fixes

    ... and lots of other changes"

    * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (72 commits)
    sched/dl: Use dl_bw_of() under rcu_read_lock_sched()
    sched/fair: Delete resched_cpu() from idle_balance()
    sched, time: Fix build error with 64 bit cputime_t on 32 bit systems
    sched: Improve sysbench performance by fixing spurious active migration
    sched/x86: Fix up typo in topology detection
    x86, sched: Add new topology for multi-NUMA-node CPUs
    sched/rt: Use resched_curr() in task_tick_rt()
    sched: Use rq->rd in sched_setaffinity() under RCU read lock
    sched: cleanup: Rename 'out_unlock' to 'out_free_new_mask'
    sched: Use dl_bw_of() under RCU read lock
    sched/fair: Remove duplicate code from can_migrate_task()
    sched, mips, ia64: Remove __ARCH_WANT_UNLOCKED_CTXSW
    sched: print_rq(): Don't use tasklist_lock
    sched: normalize_rt_tasks(): Don't use _irqsave for tasklist_lock, use task_rq_lock()
    sched: Fix the task-group check in tg_has_rt_tasks()
    sched/fair: Leverage the idle state info when choosing the "idlest" cpu
    sched: Let the scheduler see CPU idle states
    sched/deadline: Fix inter- exclusive cpusets migrations
    sched/deadline: Clear dl_entity params when setscheduling to different class
    sched/numa: Kill the wrong/dead TASK_DEAD check in task_numa_fault()
    ...

    Linus Torvalds
     

10 Oct, 2014

6 commits

  • Fix undefined behavior and compiler warning by replacing right shift 32
    with upper_32_bits macro

    Signed-off-by: Scotty Bauer
    Cc: Clemens Ladisch
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Scotty Bauer
     
  • Fix minor errors and warning messages in kernel/sys.c. These errors were
    reported by checkpatch while working with some modifications in sys.c
    file. Fixing this first will help me to improve my further patches.

    ERROR: trailing whitespace - 9
    ERROR: do not use assignment in if condition - 4
    ERROR: spaces required around that '?' (ctx:VxO) - 10
    ERROR: switch and case should be at the same indent - 3

    total 26 errors & 3 warnings fixed.

    Signed-off-by: vishnu.ps
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    vishnu.ps
     
  • Dump the contents of the relevant struct_mm when we hit the bug condition.

    Signed-off-by: Sasha Levin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sasha Levin
     
  • During development of c/r we've noticed that in case if we need to support
    user namespaces we face a problem with capabilities in prctl(PR_SET_MM,
    ...) call, in particular once new user namespace is created
    capable(CAP_SYS_RESOURCE) no longer passes.

    A approach is to eliminate CAP_SYS_RESOURCE check but pass all new values
    in one bundle, which would allow the kernel to make more intensive test
    for sanity of values and same time allow us to support checkpoint/restore
    of user namespaces.

    Thus a new command PR_SET_MM_MAP introduced. It takes a pointer of
    prctl_mm_map structure which carries all the members to be updated.

    prctl(PR_SET_MM, PR_SET_MM_MAP, struct prctl_mm_map *, size)

    struct prctl_mm_map {
    __u64 start_code;
    __u64 end_code;
    __u64 start_data;
    __u64 end_data;
    __u64 start_brk;
    __u64 brk;
    __u64 start_stack;
    __u64 arg_start;
    __u64 arg_end;
    __u64 env_start;
    __u64 env_end;
    __u64 *auxv;
    __u32 auxv_size;
    __u32 exe_fd;
    };

    All members except @exe_fd correspond ones of struct mm_struct. To figure
    out which available values these members may take here are meanings of the
    members.

    - start_code, end_code: represent bounds of executable code area
    - start_data, end_data: represent bounds of data area
    - start_brk, brk: used to calculate bounds for brk() syscall
    - start_stack: used when accounting space needed for command
    line arguments, environment and shmat() syscall
    - arg_start, arg_end, env_start, env_end: represent memory area
    supplied for command line arguments and environment variables
    - auxv, auxv_size: carries auxiliary vector, Elf format specifics
    - exe_fd: file descriptor number for executable link (/proc/self/exe)

    Thus we apply the following requirements to the values

    1) Any member except @auxv, @auxv_size, @exe_fd is rather an address
    in user space thus it must be laying inside [mmap_min_addr, mmap_max_addr)
    interval.

    2) While @[start|end]_code and @[start|end]_data may point to an nonexisting
    VMAs (say a program maps own new .text and .data segments during execution)
    the rest of members should belong to VMA which must exist.

    3) Addresses must be ordered, ie @start_ member must not be greater or
    equal to appropriate @end_ member.

    4) As in regular Elf loading procedure we require that @start_brk and
    @brk be greater than @end_data.

    5) If RLIMIT_DATA rlimit is set to non-infinity new values should not
    exceed existing limit. Same applies to RLIMIT_STACK.

    6) Auxiliary vector size must not exceed existing one (which is
    predefined as AT_VECTOR_SIZE and depends on architecture).

    7) File descriptor passed in @exe_file should be pointing
    to executable file (because we use existing prctl_set_mm_exe_file_locked
    helper it ensures that the file we are going to use as exe link has all
    required permission granted).

    Now about where these members are involved inside kernel code:

    - @start_code and @end_code are used in /proc/$pid/[stat|statm] output;

    - @start_data and @end_data are used in /proc/$pid/[stat|statm] output,
    also they are considered if there enough space for brk() syscall
    result if RLIMIT_DATA is set;

    - @start_brk shown in /proc/$pid/stat output and accounted in brk()
    syscall if RLIMIT_DATA is set; also this member is tested to
    find a symbolic name of mmap event for perf system (we choose
    if event is generated for "heap" area); one more aplication is
    selinux -- we test if a process has PROCESS__EXECHEAP permission
    if trying to make heap area being executable with mprotect() syscall;

    - @brk is a current value for brk() syscall which lays inside heap
    area, it's shown in /proc/$pid/stat. When syscall brk() succesfully
    provides new memory area to a user space upon brk() completion the
    mm::brk is updated to carry new value;

    Both @start_brk and @brk are actively used in /proc/$pid/maps
    and /proc/$pid/smaps output to find a symbolic name "heap" for
    VMA being scanned;

    - @start_stack is printed out in /proc/$pid/stat and used to
    find a symbolic name "stack" for task and threads in
    /proc/$pid/maps and /proc/$pid/smaps output, and as the same
    as with @start_brk -- perf system uses it for event naming.
    Also kernel treat this member as a start address of where
    to map vDSO pages and to check if there is enough space
    for shmat() syscall;

    - @arg_start, @arg_end, @env_start and @env_end are printed out
    in /proc/$pid/stat. Another access to the data these members
    represent is to read /proc/$pid/environ or /proc/$pid/cmdline.
    Any attempt to read these areas kernel tests with access_process_vm
    helper so a user must have enough rights for this action;

    - @auxv and @auxv_size may be read from /proc/$pid/auxv. Strictly
    speaking kernel doesn't care much about which exactly data is
    sitting there because it is solely for userspace;

    - @exe_fd is referred from /proc/$pid/exe and when generating
    coredump. We uses prctl_set_mm_exe_file_locked helper to update
    this member, so exe-file link modification remains one-shot
    action.

    Still note that updating exe-file link now doesn't require sys-resource
    capability anymore, after all there is no much profit in preventing setup
    own file link (there are a number of ways to execute own code -- ptrace,
    ld-preload, so that the only reliable way to find which exactly code is
    executed is to inspect running program memory). Still we require the
    caller to be at least user-namespace root user.

    I believe the old interface should be deprecated and ripped off in a
    couple of kernel releases if no one against.

    To test if new interface is implemented in the kernel one can pass
    PR_SET_MM_MAP_SIZE opcode and the kernel returns the size of currently
    supported struct prctl_mm_map.

    [akpm@linux-foundation.org: fix 80-col wordwrap in macro definitions]
    Signed-off-by: Cyrill Gorcunov
    Cc: Kees Cook
    Cc: Tejun Heo
    Acked-by: Andrew Vagin
    Tested-by: Andrew Vagin
    Cc: Eric W. Biederman
    Cc: H. Peter Anvin
    Acked-by: Serge Hallyn
    Cc: Pavel Emelyanov
    Cc: Vasiliy Kulikov
    Cc: KAMEZAWA Hiroyuki
    Cc: Michael Kerrisk
    Cc: Julien Tinnes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cyrill Gorcunov
     
  • Instead of taking mm->mmap_sem inside prctl_set_mm_exe_file() move it out
    and rename the helper to prctl_set_mm_exe_file_locked(). This will allow
    to reuse this function in a next patch.

    Signed-off-by: Cyrill Gorcunov
    Cc: Kees Cook
    Cc: Tejun Heo
    Cc: Andrew Vagin
    Cc: Eric W. Biederman
    Cc: H. Peter Anvin
    Acked-by: Serge Hallyn
    Cc: Pavel Emelyanov
    Cc: Vasiliy Kulikov
    Cc: KAMEZAWA Hiroyuki
    Cc: Michael Kerrisk
    Cc: Julien Tinnes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cyrill Gorcunov
     
  • Signed-off-by: Cyrill Gorcunov
    Cc: Kees Cook
    Cc: Tejun Heo
    Cc: Andrew Vagin
    Cc: Eric W. Biederman
    Cc: H. Peter Anvin
    Acked-by: Serge Hallyn
    Cc: Pavel Emelyanov
    Cc: Vasiliy Kulikov
    Cc: KAMEZAWA Hiroyuki
    Cc: Michael Kerrisk
    Cc: Julien Tinnes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cyrill Gorcunov
     

08 Sep, 2014

1 commit

  • Both times() and clock_gettime(CLOCK_PROCESS_CPUTIME_ID) have scalability
    issues on large systems, due to both functions being serialized with a
    lock.

    The lock protects against reporting a wrong value, due to a thread in the
    task group exiting, its statistics reporting up to the signal struct, and
    that exited task's statistics being counted twice (or not at all).

    Protecting that with a lock results in times() and clock_gettime() being
    completely serialized on large systems.

    This can be fixed by using a seqlock around the events that gather and
    propagate statistics. As an additional benefit, the protection code can
    be moved into thread_group_cputime(), slightly simplifying the calling
    functions.

    In the case of posix_cpu_clock_get_task() things can be simplified a
    lot, because the calling function already ensures that the task sticks
    around, and the rest is now taken care of in thread_group_cputime().

    This way the statistics reporting code can run lockless.

    Signed-off-by: Rik van Riel
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Alex Thorlton
    Cc: Andrew Morton
    Cc: Daeseok Youn
    Cc: David Rientjes
    Cc: Dongsheng Yang
    Cc: Geert Uytterhoeven
    Cc: Guillaume Morin
    Cc: Ionut Alexa
    Cc: Kees Cook
    Cc: Linus Torvalds
    Cc: Li Zefan
    Cc: Michal Hocko
    Cc: Michal Schmidt
    Cc: Oleg Nesterov
    Cc: Vladimir Davydov
    Cc: umgwanakikbuti@gmail.com
    Cc: fweisbec@gmail.com
    Cc: srao@redhat.com
    Cc: lwoodman@redhat.com
    Cc: atheurer@redhat.com
    Link: http://lkml.kernel.org/r/20140816134010.26a9b572@annuminas.surriel.com
    Signed-off-by: Ingo Molnar

    Rik van Riel
     

19 Jul, 2014

1 commit

  • Since seccomp transitions between threads requires updates to the
    no_new_privs flag to be atomic, the flag must be part of an atomic flag
    set. This moves the nnp flag into a separate task field, and introduces
    accessors.

    Signed-off-by: Kees Cook
    Reviewed-by: Oleg Nesterov
    Reviewed-by: Andy Lutomirski

    Kees Cook
     

22 May, 2014

1 commit


08 Apr, 2014

1 commit

  • Add VM_INIT_DEF_MASK, to allow us to set the default flags for VMs. It
    also adds a prctl control which allows us to set the THP disable bit in
    mm->def_flags so that VMs will pick up the setting as they are created.

    Signed-off-by: Alex Thorlton
    Suggested-by: Oleg Nesterov
    Cc: Gerald Schaefer
    Cc: Martin Schwidefsky
    Cc: Heiko Carstens
    Cc: Christian Borntraeger
    Cc: Paolo Bonzini
    Cc: "Kirill A. Shutemov"
    Cc: Mel Gorman
    Acked-by: Rik van Riel
    Cc: Ingo Molnar
    Cc: Peter Zijlstra
    Cc: Andrea Arcangeli
    Cc: Oleg Nesterov
    Cc: "Eric W. Biederman"
    Cc: Alexander Viro
    Cc: Johannes Weiner
    Cc: David Rientjes
    Cc: Paolo Bonzini
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alex Thorlton
     

23 Feb, 2014

1 commit

  • Signed-off-by: Dongsheng Yang
    Signed-off-by: Peter Zijlstra
    Cc: Andrew Morton
    Cc: Oleg Nesterov
    Cc: Robin Holt
    Cc: Al Viro
    Cc: Kees Cook
    Cc: "Eric W. Biederman"
    Cc: Stephen Rothwell
    Link: http://lkml.kernel.org/r/0261f094b836f1acbcdf52e7166487c0c77323c8.1392103744.git.yangds.fnst@cn.fujitsu.com
    Signed-off-by: Thomas Gleixner
    Signed-off-by: Ingo Molnar

    Dongsheng Yang
     

24 Jan, 2014

2 commits


13 Nov, 2013

1 commit


31 Aug, 2013

1 commit


10 Jul, 2013

2 commits


04 Jul, 2013

3 commits


13 Jun, 2013

1 commit

  • We recently noticed that reboot of a 1024 cpu machine takes approx 16
    minutes of just stopping the cpus. The slowdown was tracked to commit
    f96972f2dc63 ("kernel/sys.c: call disable_nonboot_cpus() in
    kernel_restart()").

    The current implementation does all the work of hot removing the cpus
    before halting the system. We are switching to just migrating to the
    boot cpu and then continuing with shutdown/reboot.

    This also has the effect of not breaking x86's command line parameter
    for specifying the reboot cpu. Note, this code was shamelessly copied
    from arch/x86/kernel/reboot.c with bits removed pertaining to the
    reboot_cpu command line parameter.

    Signed-off-by: Robin Holt
    Tested-by: Shawn Guo
    Cc: "Srivatsa S. Bhat"
    Cc: H. Peter Anvin
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: Russ Anderson
    Cc: Robin Holt
    Cc: Russell King
    Cc: Guan Xuetao
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Robin Holt
     

01 May, 2013

3 commits

  • Pull compat cleanup from Al Viro:
    "Mostly about syscall wrappers this time; there will be another pile
    with patches in the same general area from various people, but I'd
    rather push those after both that and vfs.git pile are in."

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal:
    syscalls.h: slightly reduce the jungles of macros
    get rid of union semop in sys_semctl(2) arguments
    make do_mremap() static
    sparc: no need to sign-extend in sync_file_range() wrapper
    ppc compat wrappers for add_key(2) and request_key(2) are pointless
    x86: trim sys_ia32.h
    x86: sys32_kill and sys32_mprotect are pointless
    get rid of compat_sys_semctl() and friends in case of ARCH_WANT_OLD_COMPAT_IPC
    merge compat sys_ipc instances
    consolidate compat lookup_dcookie()
    convert vmsplice to COMPAT_SYSCALL_DEFINE
    switch getrusage() to COMPAT_SYSCALL_DEFINE
    switch epoll_pwait to COMPAT_SYSCALL_DEFINE
    convert sendfile{,64} to COMPAT_SYSCALL_DEFINE
    switch signalfd{,4}() to COMPAT_SYSCALL_DEFINE
    make SYSCALL_DEFINE-generated wrappers do asmlinkage_protect
    make HAVE_SYSCALL_WRAPPERS unconditional
    consolidate cond_syscall and SYSCALL_ALIAS declarations
    teach SYSCALL_DEFINE how to deal with long long/unsigned long long
    get rid of duplicate logics in __SC_....[1-6] definitions

    Linus Torvalds
     
  • The purpose of this patch is to allow privileged processes to set
    their own per-memory memory-region fields:

    start_code, end_code, start_data, end_data, start_brk, brk,
    start_stack, arg_start, arg_end, env_start, env_end.

    This functionality is needed by any application or package that needs to
    reconstruct Linux processes, that is, to start them in any way other than
    by means of an "execve()" from an executable file. This includes:

    1. Restoring processes from a checkpoint-file (by all potential
    user-level checkpointing packages, not only CRIU's).
    2. Restarting processes on another node after process migration.
    3. Starting duplicated copies of a running process (for reliability
    and high-availablity).
    4. Starting a process from an executable format that is not supported
    by Linux, thus requiring a "manual execve" by a user-level utility.
    5. Similarly, starting a process from a networked and/or crypted
    executable that, for confidentiality, licensing or other reasons,
    may not be written to the local file-systems.

    The code that does that was already included in the Linux kernel by the
    CRIU group, in the form of "prctl(PR_SET_MM)", but prior to this was
    enclosed within their private "#ifdef CONFIG_CHECKPOINT_RESTORE", which is
    normally disabled. The patch removes those ifdefs.

    Signed-off-by: Amnon Shiloh
    Cc: Cyrill Gorcunov
    Cc: Pavel Emelyanov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Amnon Shiloh
     
  • Andrew Morton noted:

    akpm3:/usr/src/25> grep SYSCALL kernel/timer.c
    SYSCALL_DEFINE1(alarm, unsigned int, seconds)
    SYSCALL_DEFINE0(getpid)
    SYSCALL_DEFINE0(getppid)
    SYSCALL_DEFINE0(getuid)
    SYSCALL_DEFINE0(geteuid)
    SYSCALL_DEFINE0(getgid)
    SYSCALL_DEFINE0(getegid)
    SYSCALL_DEFINE0(gettid)
    SYSCALL_DEFINE1(sysinfo, struct sysinfo __user *, info)
    COMPAT_SYSCALL_DEFINE1(sysinfo, struct compat_sysinfo __user *, info)

    Only one of those should be in kernel/timer.c. Who wrote this thing?

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Stephen Rothwell
    Acked-by: Thomas Gleixner
    Cc: Guenter Roeck
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Stephen Rothwell
     

09 Apr, 2013

1 commit

  • As commit 40dc166c (PM / Core: Introduce struct syscore_ops for core
    subsystems PM) say, syscore_ops operations should be carried with one
    CPU on-line and interrupts disabled. However, after commit f96972f2d
    (kernel/sys.c: call disable_nonboot_cpus() in kernel_restart()),
    syscore_shutdown() is called before disable_nonboot_cpus(), so break
    the rules. We have a MIPS machine with a 8259A PIC, and there is an
    external timer (HPET) linked at 8259A. Since 8259A has been shutdown
    too early (by syscore_shutdown()), disable_nonboot_cpus() runs without
    timer interrupt, so it hangs and reboot fails. This patch call
    syscore_shutdown() a little later (after disable_nonboot_cpus()) to
    avoid reboot failure, this is the same way as poweroff does.

    For consistency, add disable_nonboot_cpus() to kernel_halt().

    Signed-off-by: Huacai Chen
    Cc:
    Signed-off-by: Rafael J. Wysocki

    Huacai Chen
     

23 Mar, 2013

1 commit

  • David said:

    Commit 6c0c0d4d1080 ("poweroff: fix bug in orderly_poweroff()")
    apparently fixes one bug in orderly_poweroff(), but introduces
    another. The comments on orderly_poweroff() claim it can be called
    from any context - and indeed we call it from interrupt context in
    arch/powerpc/platforms/pseries/ras.c for example. But since that
    commit this is no longer safe, since call_usermodehelper_fns() is not
    safe in interrupt context without the UMH_NO_WAIT option.

    orderly_poweroff() can be used from any context but UMH_WAIT_EXEC is
    sleepable. Move the "force" logic into __orderly_poweroff() and change
    orderly_poweroff() to use the global poweroff_work which simply calls
    __orderly_poweroff().

    While at it, remove the unneeded "int argc" and change argv_split() to
    use GFP_KERNEL.

    We use the global "bool poweroff_force" to pass the argument, this can
    obviously affect the previous request if it is pending/running. So we
    only allow the "false => true" transition assuming that the pending
    "true" should succeed anyway. If schedule_work() fails after that we
    know that work->func() was not called yet, it must see the new value.

    This means that orderly_poweroff() becomes async even if we do not run
    the command and always succeeds, schedule_work() can only fail if the
    work is already pending. We can export __orderly_poweroff() and change
    the non-atomic callers which want the old semantics.

    Signed-off-by: Oleg Nesterov
    Reported-by: Benjamin Herrenschmidt
    Reported-by: David Gibson
    Cc: Lucas De Marchi
    Cc: Feng Hong
    Cc: Kees Cook
    Cc: Serge Hallyn
    Cc: "Eric W. Biederman"
    Cc: "Rafael J. Wysocki"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

04 Mar, 2013

1 commit