11 Sep, 2015

40 commits

  • Khelper is affine to all CPUs. Now since it creates the
    call_usermodehelper_exec_[a]sync() kernel threads, those inherit the wide
    affinity.

    As such explicitly forcing a wide affinity from those kernel threads
    is like a no-op.

    Just remove it. It's needless and it breaks CPU isolation users who
    rely on workqueue affinity tuning.

    Signed-off-by: Frederic Weisbecker
    Cc: Rik van Riel
    Reviewed-by: Oleg Nesterov
    Cc: Christoph Lameter
    Cc: Tejun Heo
    Cc: Rusty Russell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Frederic Weisbecker
     
  • This patchset does a bunch of cleanups and converts khelper to use system
    unbound workqueues. The 3 first patches should be uncontroversial. The
    last 2 patches are debatable.

    Kmod creates kernel threads that perform userspace jobs and we want those
    to have a large affinity in order not to contend busy CPUs. This is
    (partly) why we use khelper which has a wide affinity that the kernel
    threads it create can inherit from. Now khelper is a dedicated workqueue
    that has singlethread properties which we aren't interested in.

    Hence those two debatable changes:

    _ We would like to use generic workqueues. System unbound workqueues are
    a very good candidate but they are not wide affine, only node affine.
    Now probably a node is enough to perform many parallel kmod jobs.

    _ We would like to remove the wait_for_helper kernel thread (UMH_WAIT_PROC
    handler) to use the workqueue. It means that if the workqueue blocks,
    and no other worker can take pending kmod request, we can be screwed.
    Now if we have 512 threads, this should be enough.

    This patch (of 5):

    Underscores on function names aren't much verbose to explain the purpose
    of a function. And kmod has interesting such flavours.

    Lets rename the following functions:

    * __call_usermodehelper -> call_usermodehelper_exec_work
    * ____call_usermodehelper -> call_usermodehelper_exec_async
    * wait_for_helper -> call_usermodehelper_exec_sync

    Signed-off-by: Frederic Weisbecker
    Cc: Rik van Riel
    Reviewed-by: Oleg Nesterov
    Cc: Christoph Lameter
    Cc: Tejun Heo
    Cc: Rusty Russell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Frederic Weisbecker
     
  • If request_module() successfully runs modprobe, but modprobe exits with a
    non-zero status, then the return value from request_module() will be that
    (positive) error status. So the return from request_module can be:

    negative errno
    zero for success
    positive exit code.

    Signed-off-by: NeilBrown
    Cc: Goldwyn Rodrigues
    Cc: Oleg Nesterov
    Cc: Tejun Heo
    Cc: Rusty Russell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • Fix B-tree corruption when a new record is inserted at position 0 in the
    node in hfs_brec_insert().

    This is an identical change to the corresponding hfs b-tree code to Sergei
    Antonov's "hfsplus: fix B-tree corruption after insertion at position 0",
    to keep similar code paths in the hfs and hfsplus drivers in sync, where
    appropriate.

    Signed-off-by: Hin-Tak Leung
    Cc: Sergei Antonov
    Cc: Joe Perches
    Reviewed-by: Vyacheslav Dubeyko
    Cc: Anton Altaparmakov
    Cc: Al Viro
    Cc: Christoph Hellwig
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hin-Tak Leung
     
  • Pages looked up by __hfs_bnode_create() (called by hfs_bnode_create() and
    hfs_bnode_find() for finding or creating pages corresponding to an inode)
    are immediately kmap()'ed and used (both read and write) and kunmap()'ed,
    and should not be page_cache_release()'ed until hfs_bnode_free().

    This patch fixes a problem I first saw in July 2012: merely running "du"
    on a large hfsplus-mounted directory a few times on a reasonably loaded
    system would get the hfsplus driver all confused and complaining about
    B-tree inconsistencies, and generates a "BUG: Bad page state". Most
    recently, I can generate this problem on up-to-date Fedora 22 with shipped
    kernel 4.0.5, by running "du /" (="/" + "/home" + "/mnt" + other smaller
    mounts) and "du /mnt" simultaneously on two windows, where /mnt is a
    lightly-used QEMU VM image of the full Mac OS X 10.9:

    $ df -i / /home /mnt
    Filesystem Inodes IUsed IFree IUse% Mounted on
    /dev/mapper/fedora-root 3276800 551665 2725135 17% /
    /dev/mapper/fedora-home 52879360 716221 52163139 2% /home
    /dev/nbd0p2 4294967295 1387818 4293579477 1% /mnt

    After applying the patch, I was able to run "du /" (60+ times) and "du
    /mnt" (150+ times) continuously and simultaneously for 6+ hours.

    There are many reports of the hfsplus driver getting confused under load
    and generating "BUG: Bad page state" or other similar issues over the
    years. [1]

    The unpatched code [2] has always been wrong since it entered the kernel
    tree. The only reason why it gets away with it is that the
    kmap/memcpy/kunmap follow very quickly after the page_cache_release() so
    the kernel has not had a chance to reuse the memory for something else,
    most of the time.

    The current RW driver appears to have followed the design and development
    of the earlier read-only hfsplus driver [3], where-by version 0.1 (Dec
    2001) had a B-tree node-centric approach to
    read_cache_page()/page_cache_release() per bnode_get()/bnode_put(),
    migrating towards version 0.2 (June 2002) of caching and releasing pages
    per inode extents. When the current RW code first entered the kernel [2]
    in 2005, there was an REF_PAGES conditional (and "//" commented out code)
    to switch between B-node centric paging to inode-centric paging. There
    was a mistake with the direction of one of the REF_PAGES conditionals in
    __hfs_bnode_create(). In a subsequent "remove debug code" commit [4], the
    read_cache_page()/page_cache_release() per bnode_get()/bnode_put() were
    removed, but a page_cache_release() was mistakenly left in (propagating
    the "REF_PAGES !REF_PAGE" mistake), and the commented-out
    page_cache_release() in bnode_release() (which should be spanned by
    !REF_PAGES) was never enabled.

    References:
    [1]:
    Michael Fox, Apr 2013
    http://www.spinics.net/lists/linux-fsdevel/msg63807.html
    ("hfsplus volume suddenly inaccessable after 'hfs: recoff %d too large'")

    Sasha Levin, Feb 2015
    http://lkml.org/lkml/2015/2/20/85 ("use after free")

    https://bugs.launchpad.net/ubuntu/+source/linux/+bug/740814
    https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1027887
    https://bugzilla.kernel.org/show_bug.cgi?id=42342
    https://bugzilla.kernel.org/show_bug.cgi?id=63841
    https://bugzilla.kernel.org/show_bug.cgi?id=78761

    [2]:
    http://git.kernel.org/cgit/linux/kernel/git/tglx/history.git/commit/\
    fs/hfs/bnode.c?id=d1081202f1d0ee35ab0beb490da4b65d4bc763db
    commit d1081202f1d0ee35ab0beb490da4b65d4bc763db
    Author: Andrew Morton
    Date: Wed Feb 25 16:17:36 2004 -0800

    [PATCH] HFS rewrite

    http://git.kernel.org/cgit/linux/kernel/git/tglx/history.git/commit/\
    fs/hfsplus/bnode.c?id=91556682e0bf004d98a529bf829d339abb98bbbd

    commit 91556682e0bf004d98a529bf829d339abb98bbbd
    Author: Andrew Morton
    Date: Wed Feb 25 16:17:48 2004 -0800

    [PATCH] HFS+ support

    [3]:
    http://sourceforge.net/projects/linux-hfsplus/

    http://sourceforge.net/projects/linux-hfsplus/files/Linux%202.4.x%20patch/hfsplus%200.1/
    http://sourceforge.net/projects/linux-hfsplus/files/Linux%202.4.x%20patch/hfsplus%200.2/

    http://linux-hfsplus.cvs.sourceforge.net/viewvc/linux-hfsplus/linux/\
    fs/hfsplus/bnode.c?r1=1.4&r2=1.5

    Date: Thu Jun 6 09:45:14 2002 +0000
    Use buffer cache instead of page cache in bnode.c. Cache inode extents.

    [4]:
    http://git.kernel.org/cgit/linux/kernel/git/\
    stable/linux-stable.git/commit/?id=a5e3985fa014029eb6795664c704953720cc7f7d

    commit a5e3985fa014029eb6795664c704953720cc7f7d
    Author: Roman Zippel
    Date: Tue Sep 6 15:18:47 2005 -0700

    [PATCH] hfs: remove debug code

    Signed-off-by: Hin-Tak Leung
    Signed-off-by: Sergei Antonov
    Reviewed-by: Anton Altaparmakov
    Reported-by: Sasha Levin
    Cc: Al Viro
    Cc: Christoph Hellwig
    Cc: Vyacheslav Dubeyko
    Cc: Sougata Santra
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hin-Tak Leung
     
  • Dan Carpenter discovered a buffer overflow in the Coda file system
    readlink code. A userspace file system daemon can return a 4096 byte
    result which then triggers a one byte write past the allocated readlink
    result buffer.

    This does not trigger with an unmodified Coda implementation because Coda
    has a 1024 byte limit for symbolic links, however other userspace file
    systems using the Coda kernel module could be affected.

    Although this is an obvious overflow, I don't think this has to be handled
    as too sensitive from a security perspective because the overflow is on
    the Coda userspace daemon side which already needs root to open Coda's
    kernel device and to mount the file system before we get to the point that
    links can be read.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Jan Harkes
    Reported-by: Dan Carpenter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Harkes
     
  • "CONST variable" checks like:

    if (NULL != foo)
    and
    while (0 < bar(...))

    where a constant (or what appears to be a constant like an upper case
    identifier) is on the left of a comparison are generally preferred to be
    written using the constant on the right side like:

    if (foo != NULL)
    and
    while (bar(...) > 0)

    Add a test for this.

    Add a --fix option too, but only do it when the code is immediately
    surrounded by parentheses to avoid misfixing things like "(0 < bar() +
    constant)"

    Signed-off-by: Joe Perches
    Cc: Nicolas Morey Chaisemartin
    Cc: Viresh Kumar
    Cc: Dan Carpenter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joe Perches
     
  • commit 61031952f4c8 ("arch, x86: pmem api for ensuring durability of
    persistent memory updates") added a new __pmem annotation for sparse
    verification. Add __pmem to the $Sparse variable so checkpatch can
    appropriately ignore uses of this attribute too.

    Signed-off-by: Joe Perches
    Reviewed-by: Ross Zwisler
    Acked-by: Andy Whitcroft
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joe Perches
     
  • Using checkpatch.pl with Perl 5.22.0 generates the following warning:

    Unescaped left brace in regex is deprecated, passed through in regex;

    This patch fixes the warnings by escaping occurrences of the left brace
    inside the regular expression.

    Signed-off-by: Eddie Kovsky
    Cc: Joe Perches
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eddie Kovsky
     
  • Fixes: and Link: lines may exceed 75 chars in the commit log.

    So too can stack dump and dmesg lines and lines that seem
    like filenames.

    And Fixes: lines don't need to have a "commit" prefix before the
    commit id.

    Add exceptions for these types of lines.

    Signed-off-by: Joe Perches
    Reported-by: Paul Bolle
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joe Perches
     
  • Using 0x%d is wrong. Emit a message when it happens.

    Miscellanea:

    Improve the %Lu warning to match formats like %16Lu.

    Signed-off-by: Joe Perches
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joe Perches
     
  • Making --strict the default for staging may help some people submit
    patches without obvious defects.

    Signed-off-by: Joe Perches
    Cc: Dan Carpenter
    Cc: Greg Kroah-Hartman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joe Perches
     
  • Some of the block comment tests that are used only for networking are
    appropriate for all patches.

    For example, these styles are not encouraged:

    /*
    block comment without introductory *
    */
    and
    /*
    * block comment with line terminating */

    Remove the networking specific test and add comments.

    There are some infrequent false positives where code is lazily
    commented out using /* and */ rather than using #if 0/#endif blocks
    like:
    /* case foo:
    case bar: */
    case baz:

    Signed-off-by: Joe Perches
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joe Perches
     
  • commit 34d8815f9512 ("checkpatch: add --showfile to allow input via pipe
    to show filenames") broke the --emacs with --file option.

    Fix it.

    Signed-off-by: Joe Perches
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joe Perches
     
  • Sergey Senozhatsky has modified several destroy functions that can
    now be called with NULL values.

    - kmem_cache_destroy()
    - mempool_destroy()
    - dma_pool_destroy()

    Update checkpatch to warn when those functions are preceded by an if.

    Update checkpatch to --fix all the calls too only when the code style
    form is using leading tabs.

    from:
    if (foo)
    (foo);
    to:
    (foo);

    Signed-off-by: Joe Perches
    Tested-by: Sergey Senozhatsky
    Cc: David Rientjes
    Cc: Julia Lawall
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joe Perches
     
  • Some really long declaration macros exist.

    For instance;
    DEFINE_DMA_BUF_EXPORT_INFO(exp_info);
    and
    DECLARE_DM_KCOPYD_THROTTLE_WITH_MODULE_PARM(name, description)

    Increase the limit from 2 words to 6 after DECLARE/DEFINE uses.

    Signed-off-by: Joe Perches
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joe Perches
     
  • Many lines exist like

    if (foo)
    bar;

    where the tabbed indentation of the branch is not one more than the "if"
    line above it.

    checkpatch should emit a warning on those lines.

    Miscellenea:

    o Remove comments from branch blocks
    o Skip blank lines in block

    Signed-off-by: Joe Perches
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joe Perches
     
  • Using BUG/BUG_ON crashes the kernel and is just unfriendly.

    Enable code that emits a warning on BUG/BUG_ON use.

    Make the code emit the message at WARNING level when scanning a patch and
    at CHECK level when scanning files so that script users don't feel an
    obligation to fix code that might be above their pay grade.

    Signed-off-by: Joe Perches
    Reported-by: Geert Uytterhoeven
    Tested-by: Geert Uytterhoeven
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joe Perches
     
  • Commit IDs should have commit descriptions too. Warn when a 12 to 40 byte
    SHA-1 is used in commit logs.

    Signed-off-by: Joe Perches
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joe Perches
     
  • In kmalloc_oob_krealloc_less, I think it is better to test
    the size2 boundary.

    If we do not call krealloc, the access of position size1 will still cause
    out-of-bounds and access of position size2 does not. After call krealloc,
    the access of position size2 cause out-of-bounds. So using size2 is more
    correct.

    Signed-off-by: Wang Long
    Cc: Andrey Ryabinin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wang Long
     
  • Signed-off-by: Wang Long
    Cc: Andrey Ryabinin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wang Long
     
  • To further clarify the purpose of the "esc" argument, rename it to "only"
    to reflect that it is a limit, not a list of additional characters to
    escape.

    Signed-off-by: Kees Cook
    Suggested-by: Rasmus Villemoes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kees Cook
     
  • The esc argument is used to reduce which characters will be escaped. For
    example, using " " with ESCAPE_SPACE will not produce any escaped spaces.

    Signed-off-by: Kees Cook
    Cc: Andy Shevchenko
    Cc: Rasmus Villemoes
    Cc: Mathias Krause
    Cc: James Bottomley
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kees Cook
     
  • print_hex_dump_debug() is likely supposed to be analogous to pr_debug() or
    dev_dbg() & friends. Currently it will adhere to dynamic debug, but will
    not stub out prints if CONFIG_DEBUG is not set. Let's make it do the
    right thing, because I am tired of having my dmesg buffer full of hex
    dumps on production systems.

    Signed-off-by: Linus Walleij
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Linus Walleij
     
  • In __bitmap_parselist we can accept whitespaces on head or tail during
    every parsing procedure. If input has valid ranges, there is no reason to
    reject the user.

    For example, bitmap_parselist(" 1-3, 5, ", &mask, nmaskbits). After
    separating the string, we get " 1-3", " 5", and " ". It's possible and
    reasonable to accept such string as long as the parsing result is correct.

    Signed-off-by: Pan Xinhui
    Cc: Yury Norov
    Cc: Chris Metcalf
    Cc: Rasmus Villemoes
    Cc: Sudeep Holla
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pan Xinhui
     
  • If string end with '-', for exapmle, bitmap_parselist("1,0-",&mask,
    nmaskbits), It is not in a valid pattern, so add a check after loop.
    Return -EINVAL on such condition.

    Signed-off-by: Pan Xinhui
    Cc: Yury Norov
    Cc: Chris Metcalf
    Cc: Rasmus Villemoes
    Cc: Sudeep Holla
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pan Xinhui
     
  • We can avoid in-loop incrementation of ndigits. Save current totaldigits
    to ndigits before loop, and check ndigits against totaldigits after the
    loop.

    Signed-off-by: Pan Xinhui
    Cc: Yury Norov
    Cc: Chris Metcalf
    Cc: Rasmus Villemoes
    Cc: Sudeep Holla
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pan Xinhui
     
  • Convert from manual allocation/copy_from_user/... to kstrto*() family
    which were designed for exactly that.

    One case can not be converted to kstrto*_from_user() to make code even
    more simpler because of whitespace stripping, oh well...

    Signed-off-by: Alexey Dobriyan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     
  • strtol(3) et al accept "-0", so should we.

    Signed-off-by: Alexey Dobriyan
    Cc: David Howells
    Cc: Jan Kara
    Cc: Joel Becker
    Cc: Mark Fasheh
    Cc: Theodore Ts'o
    Cc: Rasmus Villemoes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     
  • Anil's email address bounces and he hasn't had a signoff
    in over 5 years.

    Signed-off-by: Joe Perches
    Cc: James Bottomley
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joe Perches
     
  • The other two implementations of pr_debug_ratelimited include pr_fmt,
    along with every other pr_* function. But pr_debug_ratelimited forgot to
    add it with the CONFIG_DYNAMIC_DEBUG implementation.

    This patch unifies the behavior.

    Signed-off-by: Jason A. Donenfeld
    Cc: Steven Rostedt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jason A. Donenfeld
     
  • Commit e0e817392b9a ("CRED: Add some configurable debugging [try #6]")
    added the kdebug mechanism to this file back in 2009.

    The kdebug macro calls no_printk which always evaluates arguments.

    Most of the kdebug uses have an unnecessary call of
    atomic_read(&cred->usage)

    Make the kdebug macro do nothing by defining it with
    do { if (0) no_printk(...); } while (0)
    when not enabled.

    $ size kernel/cred.o* (defconfig x86-64)
    text data bss dec hex filename
    2748 336 8 3092 c14 kernel/cred.o.new
    2788 336 8 3132 c3c kernel/cred.o.old

    Miscellanea:
    o Neaten the #define kdebug macros while there

    Signed-off-by: Joe Perches
    Cc: David Howells
    Cc: James Morris
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joe Perches
     
  • Signed-off-by: Wei Yongjun
    Acked-by: Steven Rostedt
    Cc: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wei Yongjun
     
  • Signed-off-by: Vasily Kulikov
    Cc: Solar Designer
    Cc: Thomas Gleixner
    Cc: "Kirill A. Shutemov"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vasily Kulikov
     
  • Poison pointer values should be small enough to find a room in
    non-mmap'able/hardly-mmap'able space. E.g. on x86 "poison pointer space"
    is located starting from 0x0. Given unprivileged users cannot mmap
    anything below mmap_min_addr, it should be safe to use poison pointers
    lower than mmap_min_addr.

    The current poison pointer values of LIST_POISON{1,2} might be too big for
    mmap_min_addr values equal or less than 1 MB (common case, e.g. Ubuntu
    uses only 0x10000). There is little point to use such a big value given
    the "poison pointer space" below 1 MB is not yet exhausted. Changing it
    to a smaller value solves the problem for small mmap_min_addr setups.

    The values are suggested by Solar Designer:
    http://www.openwall.com/lists/oss-security/2015/05/02/6

    Signed-off-by: Vasily Kulikov
    Cc: Solar Designer
    Cc: Thomas Gleixner
    Cc: "Kirill A. Shutemov"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vasily Kulikov
     
  • The proc_subdir_lock spinlock is used to allow only one task to make
    change to the proc directory structure as well as looking up information
    in it. However, the information lookup part can actually be entered by
    more than one task as the pde_get() and pde_put() reference count update
    calls in the critical sections are atomic increment and decrement
    respectively and so are safe with concurrent updates.

    The x86 architecture has already used qrwlock which is fair and other
    architectures like ARM are in the process of switching to qrwlock. So
    unfairness shouldn't be a concern in that conversion.

    This patch changed the proc_subdir_lock to a rwlock in order to enable
    concurrent lookup. The following functions were modified to take a
    write lock:
    - proc_register()
    - remove_proc_entry()
    - remove_proc_subtree()

    The following functions were modified to take a read lock:
    - xlate_proc_name()
    - proc_lookup_de()
    - proc_readdir_de()

    A parallel /proc filesystem search with the "find" command (1000 threads)
    was run on a 4-socket Haswell-EX box (144 threads). Before the patch, the
    parallel search took about 39s. After the patch, the parallel find took
    only 25s, a saving of about 14s.

    The micro-benchmark that I used was artificial, but it was used to
    reproduce an exit hanging problem that I saw in real application. In
    fact, only allow one task to do a lookup seems too limiting to me.

    Signed-off-by: Waiman Long
    Acked-by: "Eric W. Biederman"
    Cc: Alexey Dobriyan
    Cc: Nicolas Dichtel
    Cc: Al Viro
    Cc: Scott J Norton
    Cc: Douglas Hatch
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Waiman Long
     
  • Currently, /proc//map_files/ is restricted to CAP_SYS_ADMIN, and is
    only exposed if CONFIG_CHECKPOINT_RESTORE is set.

    Each mapped file region gets a symlink in /proc//map_files/
    corresponding to the virtual address range at which it is mapped. The
    symlinks work like the symlinks in /proc//fd/, so you can follow them
    to the backing file even if that backing file has been unlinked.

    Currently, files which are mapped, unlinked, and closed are impossible to
    stat() from userspace. Exposing /proc//map_files/ closes this
    functionality "hole".

    Not being able to stat() such files makes noticing and explicitly
    accounting for the space they use on the filesystem impossible. You can
    work around this by summing up the space used by every file in the
    filesystem and subtracting that total from what statfs() tells you, but
    that obviously isn't great, and it becomes unworkable once your filesystem
    becomes large enough.

    This patch moves map_files/ out from behind CONFIG_CHECKPOINT_RESTORE, and
    adjusts the permissions enforced on it as follows:

    * proc_map_files_lookup()
    * proc_map_files_readdir()
    * map_files_d_revalidate()

    Remove the CAP_SYS_ADMIN restriction, leaving only the current
    restriction requiring PTRACE_MODE_READ. The information made
    available to userspace by these three functions is already
    available in /proc/PID/maps with MODE_READ, so I don't see any
    reason to limit them any further (see below for more detail).

    * proc_map_files_follow_link()

    This stub has been added, and requires that the user have
    CAP_SYS_ADMIN in order to follow the links in map_files/,
    since there was concern on LKML both about the potential for
    bypassing permissions on ancestor directories in the path to
    files pointed to, and about what happens with more exotic
    memory mappings created by some drivers (ie dma-buf).

    In older versions of this patch, I changed every permission check in
    the four functions above to enforce MODE_ATTACH instead of MODE_READ.
    This was an oversight on my part, and after revisiting the discussion
    it seems that nobody was concerned about anything outside of what is
    made possible by ->follow_link(). So in this version, I've left the
    checks for PTRACE_MODE_READ as-is.

    [akpm@linux-foundation.org: catch up with concurrent proc_pid_follow_link() changes]
    Signed-off-by: Calvin Owens
    Reviewed-by: Kees Cook
    Cc: Andy Lutomirski
    Cc: Cyrill Gorcunov
    Cc: Joe Perches
    Cc: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Calvin Owens
     
  • Reading/writing a /proc/kpage* file may take long on machines with a lot
    of RAM installed.

    Signed-off-by: Vladimir Davydov
    Suggested-by: Andres Lagar-Cavilla
    Reviewed-by: Andres Lagar-Cavilla
    Cc: Minchan Kim
    Cc: Raghavendra K T
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Greg Thelen
    Cc: Michel Lespinasse
    Cc: David Rientjes
    Cc: Pavel Emelyanov
    Cc: Cyrill Gorcunov
    Cc: Jonathan Corbet
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • As noted by Minchan, a benefit of reading idle flag from /proc/kpageflags
    is that one can easily filter dirty and/or unevictable pages while
    estimating the size of unused memory.

    Note that idle flag read from /proc/kpageflags may be stale in case the
    page was accessed via a PTE, because it would be too costly to iterate
    over all page mappings on each /proc/kpageflags read to provide an
    up-to-date value. To make sure the flag is up-to-date one has to read
    /sys/kernel/mm/page_idle/bitmap first.

    Signed-off-by: Vladimir Davydov
    Reviewed-by: Andres Lagar-Cavilla
    Cc: Minchan Kim
    Cc: Raghavendra K T
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Greg Thelen
    Cc: Michel Lespinasse
    Cc: David Rientjes
    Cc: Pavel Emelyanov
    Cc: Cyrill Gorcunov
    Cc: Jonathan Corbet
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • Knowing the portion of memory that is not used by a certain application or
    memory cgroup (idle memory) can be useful for partitioning the system
    efficiently, e.g. by setting memory cgroup limits appropriately.
    Currently, the only means to estimate the amount of idle memory provided
    by the kernel is /proc/PID/{clear_refs,smaps}: the user can clear the
    access bit for all pages mapped to a particular process by writing 1 to
    clear_refs, wait for some time, and then count smaps:Referenced. However,
    this method has two serious shortcomings:

    - it does not count unmapped file pages
    - it affects the reclaimer logic

    To overcome these drawbacks, this patch introduces two new page flags,
    Idle and Young, and a new sysfs file, /sys/kernel/mm/page_idle/bitmap.
    A page's Idle flag can only be set from userspace by setting bit in
    /sys/kernel/mm/page_idle/bitmap at the offset corresponding to the page,
    and it is cleared whenever the page is accessed either through page tables
    (it is cleared in page_referenced() in this case) or using the read(2)
    system call (mark_page_accessed()). Thus by setting the Idle flag for
    pages of a particular workload, which can be found e.g. by reading
    /proc/PID/pagemap, waiting for some time to let the workload access its
    working set, and then reading the bitmap file, one can estimate the amount
    of pages that are not used by the workload.

    The Young page flag is used to avoid interference with the memory
    reclaimer. A page's Young flag is set whenever the Access bit of a page
    table entry pointing to the page is cleared by writing to the bitmap file.
    If page_referenced() is called on a Young page, it will add 1 to its
    return value, therefore concealing the fact that the Access bit was
    cleared.

    Note, since there is no room for extra page flags on 32 bit, this feature
    uses extended page flags when compiled on 32 bit.

    [akpm@linux-foundation.org: fix build]
    [akpm@linux-foundation.org: kpageidle requires an MMU]
    [akpm@linux-foundation.org: decouple from page-flags rework]
    Signed-off-by: Vladimir Davydov
    Reviewed-by: Andres Lagar-Cavilla
    Cc: Minchan Kim
    Cc: Raghavendra K T
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Greg Thelen
    Cc: Michel Lespinasse
    Cc: David Rientjes
    Cc: Pavel Emelyanov
    Cc: Cyrill Gorcunov
    Cc: Jonathan Corbet
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov