30 Apr, 2013

2 commits

  • Add an admin_reserve_kbytes knob to allow admins to change the hardcoded
    memory reserve to something other than 3%, which may be multiple
    gigabytes on large memory systems. Only about 8MB is necessary to
    enable recovery in the default mode, and only a few hundred MB are
    required even when overcommit is disabled.

    This affects OVERCOMMIT_GUESS and OVERCOMMIT_NEVER.

    admin_reserve_kbytes is initialized to min(3% free pages, 8MB)

    I arrived at 8MB by summing the RSS of sshd or login, bash, and top.

    Please see first patch in this series for full background, motivation,
    testing, and full changelog.

    [akpm@linux-foundation.org: coding-style fixes]
    [akpm@linux-foundation.org: make init_admin_reserve() static]
    Signed-off-by: Andrew Shewmaker
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Shewmaker
     
  • Add user_reserve_kbytes knob.

    Limit the growth of the memory reserved for other user processes to
    min(3% current process size, user_reserve_pages). Only about 8MB is
    necessary to enable recovery in the default mode, and only a few hundred
    MB are required even when overcommit is disabled.

    user_reserve_pages defaults to min(3% free pages, 128MB)

    I arrived at 128MB by taking the max VSZ of sshd, login, bash, and top ...
    then adding the RSS of each.

    This only affects OVERCOMMIT_NEVER mode.

    Background

    1. user reserve

    __vm_enough_memory reserves a hardcoded 3% of the current process size for
    other applications when overcommit is disabled. This was done so that a
    user could recover if they launched a memory hogging process. Without the
    reserve, a user would easily run into a message such as:

    bash: fork: Cannot allocate memory

    2. admin reserve

    Additionally, a hardcoded 3% of free memory is reserved for root in both
    overcommit 'guess' and 'never' modes. This was intended to prevent a
    scenario where root-cant-log-in and perform recovery operations.

    Note that this reserve shrinks, and doesn't guarantee a useful reserve.

    Motivation

    The two hardcoded memory reserves should be updated to account for current
    memory sizes.

    Also, the admin reserve would be more useful if it didn't shrink too much.

    When the current code was originally written, 1GB was considered
    "enterprise". Now the 3% reserve can grow to multiple GB on large memory
    systems, and it only needs to be a few hundred MB at most to enable a user
    or admin to recover a system with an unwanted memory hogging process.

    I've found that reducing these reserves is especially beneficial for a
    specific type of application load:

    * single application system
    * one or few processes (e.g. one per core)
    * allocating all available memory
    * not initializing every page immediately
    * long running

    I've run scientific clusters with this sort of load. A long running job
    sometimes failed many hours (weeks of CPU time) into a calculation. They
    weren't initializing all of their memory immediately, and they weren't
    using calloc, so I put systems into overcommit 'never' mode. These
    clusters run diskless and have no swap.

    However, with the current reserves, a user wishing to allocate as much
    memory as possible to one process may be prevented from using, for
    example, almost 2GB out of 32GB.

    The effect is less, but still significant when a user starts a job with
    one process per core. I have repeatedly seen a set of processes
    requesting the same amount of memory fail because one of them could not
    allocate the amount of memory a user would expect to be able to allocate.
    For example, Message Passing Interfce (MPI) processes, one per core. And
    it is similar for other parallel programming frameworks.

    Changing this reserve code will make the overcommit never mode more useful
    by allowing applications to allocate nearly all of the available memory.

    Also, the new admin_reserve_kbytes will be safer than the current behavior
    since the hardcoded 3% of available memory reserve can shrink to something
    useless in the case where applications have grabbed all available memory.

    Risks

    * "bash: fork: Cannot allocate memory"

    The downside of the first patch-- which creates a tunable user reserve
    that is only used in overcommit 'never' mode--is that an admin can set
    it so low that a user may not be able to kill their process, even if
    they already have a shell prompt.

    Of course, a user can get in the same predicament with the current 3%
    reserve--they just have to launch processes until 3% becomes negligible.

    * root-cant-log-in problem

    The second patch, adding the tunable rootuser_reserve_pages, allows
    the admin to shoot themselves in the foot by setting it too small. They
    can easily get the system into a state where root-can't-log-in.

    However, the new admin_reserve_kbytes will be safer than the current
    behavior since the hardcoded 3% of available memory reserve can shrink
    to something useless in the case where applications have grabbed all
    available memory.

    Alternatives

    * Memory cgroups provide a more flexible way to limit application memory.

    Not everyone wants to set up cgroups or deal with their overhead.

    * We could create a fourth overcommit mode which provides smaller reserves.

    The size of useful reserves may be drastically different depending
    on the whether the system is embedded or enterprise.

    * Force users to initialize all of their memory or use calloc.

    Some users don't want/expect the system to overcommit when they malloc.
    Overcommit 'never' mode is for this scenario, and it should work well.

    The new user and admin reserve tunables are simple to use, with low
    overhead compared to cgroups. The patches preserve current behavior where
    3% of memory is less than 128MB, except that the admin reserve doesn't
    shrink to an unusable size under pressure. The code allows admins to tune
    for embedded and enterprise usage.

    FAQ

    * How is the root-cant-login problem addressed?
    What happens if admin_reserve_pages is set to 0?

    Root is free to shoot themselves in the foot by setting
    admin_reserve_kbytes too low.

    On x86_64, the minimum useful reserve is:
    8MB for overcommit 'guess'
    128MB for overcommit 'never'

    admin_reserve_pages defaults to min(3% free memory, 8MB)

    So, anyone switching to 'never' mode needs to adjust
    admin_reserve_pages.

    * How do you calculate a minimum useful reserve?

    A user or the admin needs enough memory to login and perform
    recovery operations, which includes, at a minimum:

    sshd or login + bash (or some other shell) + top (or ps, kill, etc.)

    For overcommit 'guess', we can sum resident set sizes (RSS)
    because we only need enough memory to handle what the recovery
    programs will typically use. On x86_64 this is about 8MB.

    For overcommit 'never', we can take the max of their virtual sizes (VSZ)
    and add the sum of their RSS. We use VSZ instead of RSS because mode
    forces us to ensure we can fulfill all of the requested memory allocations--
    even if the programs only use a fraction of what they ask for.
    On x86_64 this is about 128MB.

    When swap is enabled, reserves are useful even when they are as
    small as 10MB, regardless of overcommit mode.

    When both swap and overcommit are disabled, then the admin should
    tune the reserves higher to be absolutley safe. Over 230MB each
    was safest in my testing.

    * What happens if user_reserve_pages is set to 0?

    Note, this only affects overcomitt 'never' mode.

    Then a user will be able to allocate all available memory minus
    admin_reserve_kbytes.

    However, they will easily see a message such as:

    "bash: fork: Cannot allocate memory"

    And they won't be able to recover/kill their application.
    The admin should be able to recover the system if
    admin_reserve_kbytes is set appropriately.

    * What's the difference between overcommit 'guess' and 'never'?

    "Guess" allows an allocation if there are enough free + reclaimable
    pages. It has a hardcoded 3% of free pages reserved for root.

    "Never" allows an allocation if there is enough swap + a configurable
    percentage (default is 50) of physical RAM. It has a hardcoded 3% of
    free pages reserved for root, like "Guess" mode. It also has a
    hardcoded 3% of the current process size reserved for additional
    applications.

    * Why is overcommit 'guess' not suitable even when an app eventually
    writes to every page? It takes free pages, file pages, available
    swap pages, reclaimable slab pages into consideration. In other words,
    these are all pages available, then why isn't overcommit suitable?

    Because it only looks at the present state of the system. It
    does not take into account the memory that other applications have
    malloced, but haven't initialized yet. It overcommits the system.

    Test Summary

    There was little change in behavior in the default overcommit 'guess'
    mode with swap enabled before and after the patch. This was expected.

    Systems run most predictably (i.e. no oom kills) in overcommit 'never'
    mode with swap enabled. This also allowed the most memory to be allocated
    to a user application.

    Overcommit 'guess' mode without swap is a bad idea. It is easy to
    crash the system. None of the other tested combinations crashed.
    This matches my experience on the Roadrunner supercomputer.

    Without the tunable user reserve, a system in overcommit 'never' mode
    and without swap does not allow the admin to recover, although the
    admin can.

    With the new tunable reserves, a system in overcommit 'never' mode
    and without swap can be configured to:

    1. maximize user-allocatable memory, running close to the edge of
    recoverability

    2. maximize recoverability, sacrificing allocatable memory to
    ensure that a user cannot take down a system

    Test Description

    Fedora 18 VM - 4 x86_64 cores, 5725MB RAM, 4GB Swap

    System is booted into multiuser console mode, with unnecessary services
    turned off. Caches were dropped before each test.

    Hogs are user memtester processes that attempt to allocate all free memory
    as reported by /proc/meminfo

    In overcommit 'never' mode, memory_ratio=100

    Test Results

    3.9.0-rc1-mm1

    Overcommit | Swap | Hogs | MB Got/Wanted | OOMs | User Recovery | Admin Recovery
    ---------- ---- ---- ------------- ---- ------------- --------------
    guess yes 1 5432/5432 no yes yes
    guess yes 4 5444/5444 1 yes yes
    guess no 1 5302/5449 no yes yes
    guess no 4 - crash no no

    never yes 1 5460/5460 1 yes yes
    never yes 4 5460/5460 1 yes yes
    never no 1 5218/5432 no no yes
    never no 4 5203/5448 no no yes

    3.9.0-rc1-mm1-tunablereserves

    User and Admin Recovery show their respective reserves, if applicable.

    Overcommit | Swap | Hogs | MB Got/Wanted | OOMs | User Recovery | Admin Recovery
    ---------- ---- ---- ------------- ---- ------------- --------------
    guess yes 1 5419/5419 no - yes 8MB yes
    guess yes 4 5436/5436 1 - yes 8MB yes
    guess no 1 5440/5440 * - yes 8MB yes
    guess no 4 - crash - no 8MB no

    * process would successfully mlock, then the oom killer would pick it

    never yes 1 5446/5446 no 10MB yes 20MB yes
    never yes 4 5456/5456 no 10MB yes 20MB yes
    never no 1 5387/5429 no 128MB no 8MB barely
    never no 1 5323/5428 no 226MB barely 8MB barely
    never no 1 5323/5428 no 226MB barely 8MB barely

    never no 1 5359/5448 no 10MB no 10MB barely

    never no 1 5323/5428 no 0MB no 10MB barely
    never no 1 5332/5428 no 0MB no 50MB yes
    never no 1 5293/5429 no 0MB no 90MB yes

    never no 1 5001/5427 no 230MB yes 338MB yes
    never no 4* 4998/5424 no 230MB yes 338MB yes

    * more memtesters were launched, able to allocate approximately another 100MB

    Future Work

    - Test larger memory systems.

    - Test an embedded image.

    - Test other architectures.

    - Time malloc microbenchmarks.

    - Would it be useful to be able to set overcommit policy for
    each memory cgroup?

    - Some lines are slightly above 80 chars.
    Perhaps define a macro to convert between pages and kb?
    Other places in the kernel do this.

    [akpm@linux-foundation.org: coding-style fixes]
    [akpm@linux-foundation.org: make init_user_reserve() static]
    Signed-off-by: Andrew Shewmaker
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Shewmaker
     

05 Jan, 2013

2 commits

  • Signed-off-by: Carlos Alberto Lopez Perez
    Cc: Rob Landley
    Cc: Larry Finger
    Cc: Neil Horman
    Cc: Mitsuo Hayasaka
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Carlos Alberto Lopez Perez
     
  • Add 3 new variables and sysctls to tune them (by one "next_id" variable
    for messages, semaphores and shared memory respectively). This variable
    can be used to set desired id for next allocated IPC object. By default
    it's equal to -1 and old behaviour is preserved. If this variable is
    non-negative, then desired idr will be extracted from it and used as a
    start value to search for free IDR slot.

    Notes:

    1) this patch doesn't guarantee that the new object will have desired
    id. So it's up to user space how to handle new object with wrong id.

    2) After a sucessful id allocation attempt, "next_id" will be set back
    to -1 (if it was non-negative).

    [akpm@linux-foundation.org: checkpatch fixes]
    Signed-off-by: Stanislav Kinsbursky
    Cc: Serge Hallyn
    Cc: "Eric W. Biederman"
    Cc: Pavel Emelyanov
    Cc: Al Viro
    Cc: KOSAKI Motohiro
    Cc: Michael Kerrisk
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Stanislav Kinsbursky
     

06 Oct, 2012

1 commit

  • Some coredump handlers want to create a core file in a way compatible with
    standard behavior. Standard behavior with fs.suid_dumpable = 2 is to
    create core file with uid=gid=0. However, there was no way for coredump
    handler to know that the process being dumped was suid'ed.

    This patch adds the new %d specifier for format_corename() which simply
    reports __get_dumpable(mm->flags), this is compatible with
    /proc/sys/fs/suid_dumpable we already have.

    Addresses https://bugzilla.redhat.com/show_bug.cgi?id=787135

    Developed during a discussion with Denys Vlasenko.

    Signed-off-by: Oleg Nesterov
    Cc: Denys Vlasenko
    Cc: Alex Kelly
    Cc: Andi Kleen
    Cc: Cong Wang
    Cc: Jiri Moskovcak
    Acked-by: Neil Horman
    Cc: Alan Cox
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

04 Aug, 2012

1 commit


02 Aug, 2012

1 commit

  • Pull second vfs pile from Al Viro:
    "The stuff in there: fsfreeze deadlock fixes by Jan (essentially, the
    deadlock reproduced by xfstests 068), symlink and hardlink restriction
    patches, plus assorted cleanups and fixes.

    Note that another fsfreeze deadlock (emergency thaw one) is *not*
    dealt with - the series by Fernando conflicts a lot with Jan's, breaks
    userland ABI (FIFREEZE semantics gets changed) and trades the deadlock
    for massive vfsmount leak; this is going to be handled next cycle.
    There probably will be another pull request, but that stuff won't be
    in it."

    Fix up trivial conflicts due to unrelated changes next to each other in
    drivers/{staging/gdm72xx/usb_boot.c, usb/gadget/storage_common.c}

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (54 commits)
    delousing target_core_file a bit
    Documentation: Correct s_umount state for freeze_fs/unfreeze_fs
    fs: Remove old freezing mechanism
    ext2: Implement freezing
    btrfs: Convert to new freezing mechanism
    nilfs2: Convert to new freezing mechanism
    ntfs: Convert to new freezing mechanism
    fuse: Convert to new freezing mechanism
    gfs2: Convert to new freezing mechanism
    ocfs2: Convert to new freezing mechanism
    xfs: Convert to new freezing code
    ext4: Convert to new freezing mechanism
    fs: Protect write paths by sb_start_write - sb_end_write
    fs: Skip atime update on frozen filesystem
    fs: Add freezing handling to mnt_want_write() / mnt_drop_write()
    fs: Improve filesystem freezing handling
    switch the protection of percpu_counter list to spinlock
    nfsd: Push mnt_want_write() outside of i_mutex
    btrfs: Push mnt_want_write() outside of i_mutex
    fat: Push mnt_want_write() outside of i_mutex
    ...

    Linus Torvalds
     

01 Aug, 2012

3 commits

  • The number of ptes and swap entries are used in the oom killer's badness
    heuristic, so they should be shown in the tasklist dump.

    This patch adds those fields and replaces cpu and oom_adj values that are
    currently emitted. Cpu isn't interesting and oom_adj is deprecated and
    will be removed later this year, the same information is already displayed
    as oom_score_adj which is used internally.

    At the same time, make the documentation a little more clear to state this
    information is helpful to determine why the oom killer chose the task it
    did to kill.

    Signed-off-by: David Rientjes
    Cc: KAMEZAWA Hiroyuki
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • Since per-BDI flusher threads were introduced in 2.6, the pdflush
    mechanism is not used any more. But the old interface exported through
    /proc/sys/vm/nr_pdflush_threads still exists and is obviously useless.

    For back-compatibility, printk warning information and return 2 to notify
    the users that the interface is removed.

    Signed-off-by: Wanpeng Li
    Cc: Wu Fengguang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wanpeng Li
     
  • Fix of the documentation of /proc/sys/vm/page-cluster to match the
    behavior of the code and add some comments about what the tunable will
    change in that behavior.

    Signed-off-by: Christian Ehrhardt
    Acked-by: Jens Axboe
    Reviewed-by: Minchan Kim
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christian Ehrhardt
     

31 Jul, 2012

1 commit

  • When the suid_dumpable sysctl is set to "2", and there is no core dump
    pipe defined in the core_pattern sysctl, a local user can cause core files
    to be written to root-writable directories, potentially with
    user-controlled content.

    This means an admin can unknowningly reintroduce a variation of
    CVE-2006-2451, allowing local users to gain root privileges.

    $ cat /proc/sys/fs/suid_dumpable
    2
    $ cat /proc/sys/kernel/core_pattern
    core
    $ ulimit -c unlimited
    $ cd /
    $ ls -l core
    ls: cannot access core: No such file or directory
    $ touch core
    touch: cannot touch `core': Permission denied
    $ OHAI="evil-string-here" ping localhost >/dev/null 2>&1 &
    $ pid=$!
    $ sleep 1
    $ kill -SEGV $pid
    $ ls -l core
    -rw------- 1 root kees 458752 Jun 21 11:35 core
    $ sudo strings core | grep evil
    OHAI=evil-string-here

    While cron has been fixed to abort reading a file when there is any
    parse error, there are still other sensitive directories that will read
    any file present and skip unparsable lines.

    Instead of introducing a suid_dumpable=3 mode and breaking all users of
    mode 2, this only disables the unsafe portion of mode 2 (writing to disk
    via relative path). Most users of mode 2 (e.g. Chrome OS) already use
    a core dump pipe handler, so this change will not break them. For the
    situations where a pipe handler is not defined but mode 2 is still
    active, crash dumps will only be written to fully qualified paths. If a
    relative path is defined (e.g. the default "core" pattern), dump
    attempts will trigger a printk yelling about the lack of a fully
    qualified path.

    Signed-off-by: Kees Cook
    Cc: Alexander Viro
    Cc: Alan Cox
    Cc: "Eric W. Biederman"
    Cc: Doug Ledford
    Cc: Serge Hallyn
    Cc: James Morris
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kees Cook
     

30 Jul, 2012

1 commit

  • This adds symlink and hardlink restrictions to the Linux VFS.

    Symlinks:

    A long-standing class of security issues is the symlink-based
    time-of-check-time-of-use race, most commonly seen in world-writable
    directories like /tmp. The common method of exploitation of this flaw
    is to cross privilege boundaries when following a given symlink (i.e. a
    root process follows a symlink belonging to another user). For a likely
    incomplete list of hundreds of examples across the years, please see:
    http://cve.mitre.org/cgi-bin/cvekey.cgi?keyword=/tmp

    The solution is to permit symlinks to only be followed when outside
    a sticky world-writable directory, or when the uid of the symlink and
    follower match, or when the directory owner matches the symlink's owner.

    Some pointers to the history of earlier discussion that I could find:

    1996 Aug, Zygo Blaxell
    http://marc.info/?l=bugtraq&m=87602167419830&w=2
    1996 Oct, Andrew Tridgell
    http://lkml.indiana.edu/hypermail/linux/kernel/9610.2/0086.html
    1997 Dec, Albert D Cahalan
    http://lkml.org/lkml/1997/12/16/4
    2005 Feb, Lorenzo Hernández García-Hierro
    http://lkml.indiana.edu/hypermail/linux/kernel/0502.0/1896.html
    2010 May, Kees Cook
    https://lkml.org/lkml/2010/5/30/144

    Past objections and rebuttals could be summarized as:

    - Violates POSIX.
    - POSIX didn't consider this situation and it's not useful to follow
    a broken specification at the cost of security.
    - Might break unknown applications that use this feature.
    - Applications that break because of the change are easy to spot and
    fix. Applications that are vulnerable to symlink ToCToU by not having
    the change aren't. Additionally, no applications have yet been found
    that rely on this behavior.
    - Applications should just use mkstemp() or O_CREATE|O_EXCL.
    - True, but applications are not perfect, and new software is written
    all the time that makes these mistakes; blocking this flaw at the
    kernel is a single solution to the entire class of vulnerability.
    - This should live in the core VFS.
    - This should live in an LSM. (https://lkml.org/lkml/2010/5/31/135)
    - This should live in an LSM.
    - This should live in the core VFS. (https://lkml.org/lkml/2010/8/2/188)

    Hardlinks:

    On systems that have user-writable directories on the same partition
    as system files, a long-standing class of security issues is the
    hardlink-based time-of-check-time-of-use race, most commonly seen in
    world-writable directories like /tmp. The common method of exploitation
    of this flaw is to cross privilege boundaries when following a given
    hardlink (i.e. a root process follows a hardlink created by another
    user). Additionally, an issue exists where users can "pin" a potentially
    vulnerable setuid/setgid file so that an administrator will not actually
    upgrade a system fully.

    The solution is to permit hardlinks to only be created when the user is
    already the existing file's owner, or if they already have read/write
    access to the existing file.

    Many Linux users are surprised when they learn they can link to files
    they have no access to, so this change appears to follow the doctrine
    of "least surprise". Additionally, this change does not violate POSIX,
    which states "the implementation may require that the calling process
    has permission to access the existing file"[1].

    This change is known to break some implementations of the "at" daemon,
    though the version used by Fedora and Ubuntu has been fixed[2] for
    a while. Otherwise, the change has been undisruptive while in use in
    Ubuntu for the last 1.5 years.

    [1] http://pubs.opengroup.org/onlinepubs/9699919799/functions/linkat.html
    [2] http://anonscm.debian.org/gitweb/?p=collab-maint/at.git;a=commitdiff;h=f4114656c3a6c6f6070e315ffdf940a49eda3279

    This patch is based on the patches in Openwall and grsecurity, along with
    suggestions from Al Viro. I have added a sysctl to enable the protected
    behavior, and documentation.

    Signed-off-by: Kees Cook
    Acked-by: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Al Viro

    Kees Cook
     

01 Jun, 2012

1 commit

  • Commit b231cca4381e ("message queues: increase range limits") changed
    mqueue default value when attr parameter is specified NULL from hard
    coded value to fs.mqueue.{msg,msgsize}_max sysctl value.

    This made large side effect. When user need to use two mqueue
    applications 1) using !NULL attr parameter and it require big message
    size and 2) using NULL attr parameter and only need small size message,
    app (1) require to raise fs.mqueue.msgsize_max and app (2) consume large
    memory size even though it doesn't need.

    Doug Ledford propsed to switch back it to static hard coded value.
    However it also has a compatibility problem. Some applications might
    started depend on the default value is tunable.

    The solution is to separate default value from maximum value.

    Signed-off-by: KOSAKI Motohiro
    Signed-off-by: Doug Ledford
    Acked-by: Doug Ledford
    Acked-by: Joe Korty
    Cc: Amerigo Wang
    Acked-by: Serge E. Hallyn
    Cc: Jiri Slaby
    Cc: Manfred Spraul
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     

27 Apr, 2012

1 commit


07 Feb, 2012

1 commit


13 Jan, 2012

1 commit

  • The sysctl works on the current task's pid namespace, getting and setting
    its last_pid field.

    Writing is allowed for CAP_SYS_ADMIN-capable tasks thus making it possible
    to create a task with desired pid value. This ability is required badly
    for the checkpoint/restore in userspace.

    This approach suits all the parties for now.

    Signed-off-by: Pavel Emelyanov
    Acked-by: Tejun Heo
    Cc: Oleg Nesterov
    Cc: Cyrill Gorcunov
    Cc: "Eric W. Biederman"
    Cc: Serge Hallyn
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pavel Emelyanov
     

05 Dec, 2011

1 commit

  • Currently, messages are just output on the detection of stack
    overflow, which is not sufficient for systems that need a
    high reliability. This is because in general the overflow may
    corrupt data, and the additional corruption may occur due to
    reading them unless systems stop.

    This patch adds the sysctl parameter
    kernel.panic_on_stackoverflow and causes a panic when detecting
    the overflows of kernel, IRQ and exception stacks except user
    stack according to the parameter. It is disabled by default.

    Signed-off-by: Mitsuo Hayasaka
    Cc: yrl.pp-manager.tt@hitachi.com
    Cc: Randy Dunlap
    Cc: "H. Peter Anvin"
    Link: http://lkml.kernel.org/r/20111129060836.11076.12323.stgit@ltc219.sdl.hitachi.co.jp
    Signed-off-by: Ingo Molnar

    Mitsuo Hayasaka
     

01 Nov, 2011

1 commit

  • Userspace needs to know the highest valid capability of the running
    kernel, which right now cannot reliably be retrieved from the header files
    only. The fact that this value cannot be determined properly right now
    creates various problems for libraries compiled on newer header files
    which are run on older kernels. They assume capabilities are available
    which actually aren't. libcap-ng is one example. And we ran into the
    same problem with systemd too.

    Now the capability is exported in /proc/sys/kernel/cap_last_cap.

    [akpm@linux-foundation.org: make cap_last_cap const, per Ulrich]
    Signed-off-by: Dan Ballard
    Cc: Randy Dunlap
    Cc: Ingo Molnar
    Cc: Lennart Poettering
    Cc: Kay Sievers
    Cc: Ulrich Drepper
    Cc: James Morris
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Ballard
     

27 Jul, 2011

1 commit

  • Add support for the shm_rmid_forced sysctl. If set to 1, all shared
    memory objects in current ipc namespace will be automatically forced to
    use IPC_RMID.

    The POSIX way of handling shmem allows one to create shm objects and
    call shmdt(), leaving shm object associated with no process, thus
    consuming memory not counted via rlimits.

    With shm_rmid_forced=1 the shared memory object is counted at least for
    one process, so OOM killer may effectively kill the fat process holding
    the shared memory.

    It obviously breaks POSIX - some programs relying on the feature would
    stop working. So set shm_rmid_forced=1 only if you're sure nobody uses
    "orphaned" memory. Use shm_rmid_forced=0 by default for compatability
    reasons.

    The feature was previously impemented in -ow as a configure option.

    [akpm@linux-foundation.org: fix documentation, per Randy]
    [akpm@linux-foundation.org: fix warning]
    [akpm@linux-foundation.org: readability/conventionality tweaks]
    [akpm@linux-foundation.org: fix shm_rmid_forced/shm_forced_rmid confusion, use standard comment layout]
    Signed-off-by: Vasiliy Kulikov
    Cc: Randy Dunlap
    Cc: "Eric W. Biederman"
    Cc: "Serge E. Hallyn"
    Cc: Daniel Lezcano
    Cc: Oleg Nesterov
    Cc: Tejun Heo
    Cc: Ingo Molnar
    Cc: Alan Cox
    Cc: Solar Designer
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vasiliy Kulikov
     

24 Jul, 2011

1 commit

  • Refresh sysctl/kernel.txt. More specifically,

    - drop stale index entries
    - sync and sort index and entries
    - reflow sticking out paragraphs to colwidth 72
    - correct typos
    - cleanup whitespace

    Signed-off-by: Borislav Petkov
    Signed-off-by: Randy Dunlap
    Signed-off-by: Linus Torvalds

    Borislav Petkov
     

27 May, 2011

1 commit

  • Now, exe_file is not proc FS dependent, so we can use it to name core
    file. So we add %E pattern for core file name cration which extract path
    from mm_struct->exe_file. Then it converts slashes to exclamation marks
    and pastes the result to the core file name itself.

    This is useful for environments where binary names are longer than 16
    character (the current->comm limitation). Also where there are binaries
    with same name but in a different path. Further in case the binery itself
    changes its current->comm after exec.

    So by doing (s/$/#/ -- # is treated as git comment):

    $ sysctl kernel.core_pattern='core.%p.%e.%E'
    $ ln /bin/cat cat45678901234567890
    $ ./cat45678901234567890
    ^Z
    $ rm cat45678901234567890
    $ fg
    ^\Quit (core dumped)
    $ ls core*

    we now get:

    core.2434.cat456789012345.!root!cat45678901234567890 (deleted)

    Signed-off-by: Jiri Slaby
    Cc: Al Viro
    Cc: Alan Cox
    Reviewed-by: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jiri Slaby
     

24 May, 2011

2 commits

  • max_user_instances was removed in this commit:

    commit 9df04e1f25effde823a600e755b51475d438f56b
    Author: Davide Libenzi
    Date: Thu Jan 29 14:25:26 2009 -0800

    epoll: drop max_user_instances and rely only on max_user_watches

    but the documentation entry was not removed.

    Cc: Davide Libenzi
    Signed-off-by: Lucian Adrian Grijincu
    Signed-off-by: Randy Dunlap
    Signed-off-by: Linus Torvalds

    Lucian Adrian Grijincu
     
  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (39 commits)
    b43: fix comment typo reqest -> request
    Haavard Skinnemoen has left Atmel
    cris: typo in mach-fs Makefile
    Kconfig: fix copy/paste-ism for dell-wmi-aio driver
    doc: timers-howto: fix a typo ("unsgined")
    perf: Only include annotate.h once in tools/perf/util/ui/browsers/annotate.c
    md, raid5: Fix spelling error in comment ('Ofcourse' --> 'Of course').
    treewide: fix a few typos in comments
    regulator: change debug statement be consistent with the style of the rest
    Revert "arm: mach-u300/gpio: Fix mem_region resource size miscalculations"
    audit: acquire creds selectively to reduce atomic op overhead
    rtlwifi: don't touch with treewide double semicolon removal
    treewide: cleanup continuations and remove logging message whitespace
    ath9k_hw: don't touch with treewide double semicolon removal
    include/linux/leds-regulator.h: fix syntax in example code
    tty: fix typo in descripton of tty_termios_encode_baud_rate
    xtensa: remove obsolete BKL kernel option from defconfig
    m68k: fix comment typo 'occcured'
    arch:Kconfig.locks Remove unused config option.
    treewide: remove extra semicolons
    ...

    Linus Torvalds
     

28 Apr, 2011

1 commit

  • In order to speedup packet filtering, here is an implementation of a
    JIT compiler for x86_64

    It is disabled by default, and must be enabled by the admin.

    echo 1 >/proc/sys/net/core/bpf_jit_enable

    It uses module_alloc() and module_free() to get memory in the 2GB text
    kernel range since we call helpers functions from the generated code.

    EAX : BPF A accumulator
    EBX : BPF X accumulator
    RDI : pointer to skb (first argument given to JIT function)
    RBP : frame pointer (even if CONFIG_FRAME_POINTER=n)
    r9d : skb->len - skb->data_len (headlen)
    r8 : skb->data

    To get a trace of generated code, use :

    echo 2 >/proc/sys/net/core/bpf_jit_enable

    Example of generated code :

    # tcpdump -p -n -s 0 -i eth1 host 192.168.20.0/24

    flen=18 proglen=147 pass=3 image=ffffffffa00b5000
    JIT code: ffffffffa00b5000: 55 48 89 e5 48 83 ec 60 48 89 5d f8 44 8b 4f 60
    JIT code: ffffffffa00b5010: 44 2b 4f 64 4c 8b 87 b8 00 00 00 be 0c 00 00 00
    JIT code: ffffffffa00b5020: e8 24 7b f7 e0 3d 00 08 00 00 75 28 be 1a 00 00
    JIT code: ffffffffa00b5030: 00 e8 fe 7a f7 e0 24 00 3d 00 14 a8 c0 74 49 be
    JIT code: ffffffffa00b5040: 1e 00 00 00 e8 eb 7a f7 e0 24 00 3d 00 14 a8 c0
    JIT code: ffffffffa00b5050: 74 36 eb 3b 3d 06 08 00 00 74 07 3d 35 80 00 00
    JIT code: ffffffffa00b5060: 75 2d be 1c 00 00 00 e8 c8 7a f7 e0 24 00 3d 00
    JIT code: ffffffffa00b5070: 14 a8 c0 74 13 be 26 00 00 00 e8 b5 7a f7 e0 24
    JIT code: ffffffffa00b5080: 00 3d 00 14 a8 c0 75 07 b8 ff ff 00 00 eb 02 31
    JIT code: ffffffffa00b5090: c0 c9 c3

    BPF program is 144 bytes long, so native program is almost same size ;)

    (000) ldh [12]
    (001) jeq #0x800 jt 2 jf 8
    (002) ld [26]
    (003) and #0xffffff00
    (004) jeq #0xc0a81400 jt 16 jf 5
    (005) ld [30]
    (006) and #0xffffff00
    (007) jeq #0xc0a81400 jt 16 jf 17
    (008) jeq #0x806 jt 10 jf 9
    (009) jeq #0x8035 jt 10 jf 17
    (010) ld [28]
    (011) and #0xffffff00
    (012) jeq #0xc0a81400 jt 16 jf 13
    (013) ld [38]
    (014) and #0xffffff00
    (015) jeq #0xc0a81400 jt 16 jf 17
    (016) ret #65535
    (017) ret #0

    Signed-off-by: Eric Dumazet
    Cc: Arnaldo Carvalho de Melo
    Cc: Ben Hutchings
    Cc: Hagen Paul Pfeifer
    Signed-off-by: David S. Miller

    Eric Dumazet
     

06 Apr, 2011

1 commit


19 Mar, 2011

1 commit

  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (47 commits)
    doc: CONFIG_UNEVICTABLE_LRU doesn't exist anymore
    Update cpuset info & webiste for cgroups
    dcdbas: force SMI to happen when expected
    arch/arm/Kconfig: remove one to many l's in the word.
    asm-generic/user.h: Fix spelling in comment
    drm: fix printk typo 'sracth'
    Remove one to many n's in a word
    Documentation/filesystems/romfs.txt: fixing link to genromfs
    drivers:scsi Change printk typo initate -> initiate
    serial, pch uart: Remove duplicate inclusion of linux/pci.h header
    fs/eventpoll.c: fix spelling
    mm: Fix out-of-date comments which refers non-existent functions
    drm: Fix printk typo 'failled'
    coh901318.c: Change initate to initiate.
    mbox-db5500.c Change initate to initiate.
    edac: correct i82975x error-info reported
    edac: correct i82975x mci initialisation
    edac: correct commented info
    fs: update comments to point correct document
    target: remove duplicate include of target/target_core_device.h from drivers/target/target_core_hba.c
    ...

    Trivial conflict in fs/eventpoll.c (spelling vs addition)

    Linus Torvalds
     

17 Mar, 2011

1 commit


11 Feb, 2011

1 commit


14 Jan, 2011

2 commits

  • ctl_unnumbered.txt have been removed in Documentation directory so just
    also remove this invalid comments

    [akpm@linux-foundation.org: fix Documentation/sysctl/00-INDEX, per Dave]
    Signed-off-by: Jovi Zhang
    Cc: Dave Young
    Acked-by: WANG Cong
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jovi Zhang
     
  • Add the %pK printk format specifier and the /proc/sys/kernel/kptr_restrict
    sysctl.

    The %pK format specifier is designed to hide exposed kernel pointers,
    specifically via /proc interfaces. Exposing these pointers provides an
    easy target for kernel write vulnerabilities, since they reveal the
    locations of writable structures containing easily triggerable function
    pointers. The behavior of %pK depends on the kptr_restrict sysctl.

    If kptr_restrict is set to 0, no deviation from the standard %p behavior
    occurs. If kptr_restrict is set to 1, the default, if the current user
    (intended to be a reader via seq_printf(), etc.) does not have CAP_SYSLOG
    (currently in the LSM tree), kernel pointers using %pK are printed as 0's.
    If kptr_restrict is set to 2, kernel pointers using %pK are printed as
    0's regardless of privileges. Replacing with 0's was chosen over the
    default "(null)", which cannot be parsed by userland %p, which expects
    "(nil)".

    [akpm@linux-foundation.org: check for IRQ context when !kptr_restrict, save an indent level, s/WARN/WARN_ONCE/]
    [akpm@linux-foundation.org: coding-style fixup]
    [randy.dunlap@oracle.com: fix kernel/sysctl.c warning]
    Signed-off-by: Dan Rosenberg
    Signed-off-by: Randy Dunlap
    Cc: James Morris
    Cc: Eric Dumazet
    Cc: Thomas Graf
    Cc: Eugene Teo
    Cc: Kees Cook
    Cc: Ingo Molnar
    Cc: David S. Miller
    Cc: Peter Zijlstra
    Cc: Eric Paris

    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Rosenberg
     

09 Dec, 2010

1 commit

  • Eric Paris pointed out that it doesn't make sense to require
    both CAP_SYS_ADMIN and CAP_SYSLOG for certain syslog actions.
    So require CAP_SYSLOG, not CAP_SYS_ADMIN, when dmesg_restrict
    is set.

    (I'm also consolidating the now common error path)

    Signed-off-by: Serge E. Hallyn
    Acked-by: Eric Paris
    Acked-by: Kees Cook
    Signed-off-by: James Morris

    Serge E. Hallyn
     

12 Nov, 2010

1 commit

  • The kernel syslog contains debugging information that is often useful
    during exploitation of other vulnerabilities, such as kernel heap
    addresses. Rather than futilely attempt to sanitize hundreds (or
    thousands) of printk statements and simultaneously cripple useful
    debugging functionality, it is far simpler to create an option that
    prevents unprivileged users from reading the syslog.

    This patch, loosely based on grsecurity's GRKERNSEC_DMESG, creates the
    dmesg_restrict sysctl. When set to "0", the default, no restrictions are
    enforced. When set to "1", only users with CAP_SYS_ADMIN can read the
    kernel syslog via dmesg(8) or other mechanisms.

    [akpm@linux-foundation.org: explain the config option in kernel.txt]
    Signed-off-by: Dan Rosenberg
    Acked-by: Ingo Molnar
    Acked-by: Eugene Teo
    Acked-by: Kees Cook
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Rosenberg
     

28 Oct, 2010

1 commit

  • When dirty_ratio or dirty_bytes is written the other parameter is disabled
    and set to 0 (in dirty_bytes_handler() / dirty_ratio_handler()).

    We do the same for dirty_background_ratio and dirty_background_bytes.

    However, in the sysctl documentation, we say that the counterpart becomes
    a function of the old value, that is not correct.

    Clarify the documentation reporting the actual behaviour.

    Reviewed-by: Greg Thelen
    Acked-by: David Rientjes
    Signed-off-by: Andrea Righi
    Cc: Randy Dunlap
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Righi
     

10 Aug, 2010

1 commit

  • The oom killer tasklist dump, enabled with the oom_dump_tasks sysctl, is
    very helpful information in diagnosing why a user's task has been killed.
    It emits useful information such as each eligible thread's memory usage
    that can determine why the system is oom, so it should be enabled by
    default.

    Signed-off-by: David Rientjes
    Acked-by: KOSAKI Motohiro
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     

28 Jun, 2010

1 commit


25 May, 2010

2 commits

  • …hen it should be reclaimed

    The kernel applies some heuristics when deciding if memory should be
    compacted or reclaimed to satisfy a high-order allocation. One of these
    is based on the fragmentation. If the index is below 500, memory will not
    be compacted. This choice is arbitrary and not based on data. To help
    optimise the system and set a sensible default for this value, this patch
    adds a sysctl extfrag_threshold. The kernel will only compact memory if
    the fragmentation index is above the extfrag_threshold.

    [randy.dunlap@oracle.com: Fix build errors when proc fs is not configured]
    Signed-off-by: Mel Gorman <mel@csn.ul.ie>
    Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
    Cc: Rik van Riel <riel@redhat.com>
    Cc: Minchan Kim <minchan.kim@gmail.com>
    Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
    Cc: Christoph Lameter <cl@linux-foundation.org>
    Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

    Mel Gorman
     
  • Add a proc file /proc/sys/vm/compact_memory. When an arbitrary value is
    written to the file, all zones are compacted. The expected user of such a
    trigger is a job scheduler that prepares the system before the target
    application runs.

    Signed-off-by: Mel Gorman
    Acked-by: Rik van Riel
    Reviewed-by: KAMEZAWA Hiroyuki
    Reviewed-by: Minchan Kim
    Reviewed-by: KOSAKI Motohiro
    Reviewed-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

16 May, 2010

1 commit

  • With RPS inclusion, skb timestamping is not consistent in RX path.

    If netif_receive_skb() is used, its deferred after RPS dispatch.

    If netif_rx() is used, its done before RPS dispatch.

    This can give strange tcpdump timestamps results.

    I think timestamping should be done as soon as possible in the receive
    path, to get meaningful values (ie timestamps taken at the time packet
    was delivered by NIC driver to our stack), even if NAPI already can
    defer timestamping a bit (RPS can help to reduce the gap)

    Tom Herbert prefer to sample timestamps after RPS dispatch. In case
    sampling is expensive (HPET/acpi_pm on x86), this makes sense.

    Let admins switch from one mode to another, using a new
    sysctl, /proc/sys/net/core/netdev_tstamp_prequeue

    Its default value (1), means timestamps are taken as soon as possible,
    before backlog queueing, giving accurate timestamps.

    Setting a 0 value permits to sample timestamps when processing backlog,
    after RPS dispatch, to lower the load of the pre-RPS cpu.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

13 Mar, 2010

1 commit

  • Presently, if panic_on_oom=2, the whole system panics even if the oom
    happend in some special situation (as cpuset, mempolicy....). Then,
    panic_on_oom=2 means painc_on_oom_always.

    Now, memcg doesn't check panic_on_oom flag. This patch adds a check.

    BTW, how it's useful ?

    kdump+panic_on_oom=2 is the last tool to investigate what happens in
    oom-ed system. When a task is killed, the sysytem recovers and there will
    be few hint to know what happnes. In mission critical system, oom should
    never happen. Then, panic_on_oom=2+kdump is useful to avoid next OOM by
    knowing precise information via snapshot.

    TODO:
    - For memcg, it's for isolate system's memory usage, oom-notiifer and
    freeze_at_oom (or rest_at_oom) should be implemented. Then, management
    daemon can do similar jobs (as kdump) or taking snapshot per cgroup.

    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Balbir Singh
    Cc: David Rientjes
    Cc: Nick Piggin
    Reviewed-by: Daisuke Nishimura
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     

12 Dec, 2009

1 commit