13 Sep, 2015

2 commits

  • Fix up the writeback plugging introduced in commit d353d7587d02
    ("writeback: plug writeback at a high level") that then caused problems
    due to the unplug happening with a spinlock held.

    * writeback-plugging:
    writeback: plug writeback in wb_writeback() and writeback_inodes_wb()
    Revert "writeback: plug writeback at a high level"

    Linus Torvalds
     
  • We had to revert the pluggin in writeback_sb_inodes() because the
    wb->list_lock is held, but we could easily plug at a higher level before
    taking that lock, and unplug after releasing it. This does that.

    Chris will run performance numbers, just to verify that this approach is
    comparable to the alternative (we could just drop and re-take the lock
    around the blk_finish_plug() rather than these two commits.

    I'd have preferred waiting for actual performance numbers before picking
    one approach over the other, but I don't want to release rc1 with the
    known "sleeping function called from invalid context" issue, so I'll
    pick this cleanup version for now. But if the numbers show that we
    really want to plug just at the writeback_sb_inodes() level, and we
    should just play ugly games with the spinlock, we'll switch to that.

    Cc: Chris Mason
    Cc: Josef Bacik
    Cc: Dave Chinner
    Cc: Neil Brown
    Cc: Jan Kara
    Cc: Christoph Hellwig
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

12 Sep, 2015

7 commits

  • Merge fourth patch-bomb from Andrew Morton:

    - sys_membarier syscall

    - seq_file interface changes

    - a few misc fixups

    * emailed patches from Andrew Morton :
    revert "ocfs2/dlm: use list_for_each_entry instead of list_for_each"
    mm/early_ioremap: add explicit #include of asm/early_ioremap.h
    fs/seq_file: convert int seq_vprint/seq_printf/etc... returns to void
    selftests: enhance membarrier syscall test
    selftests: add membarrier syscall test
    sys_membarrier(): system-wide memory barrier (generic, x86)
    MODSIGN: fix a compilation warning in extract-cert

    Linus Torvalds
     
  • Revert commit f83c7b5e9fd6 ("ocfs2/dlm: use list_for_each_entry instead
    of list_for_each").

    list_for_each_entry() will dereference its `pos' argument, which can be
    NULL in dlm_process_recovery_data().

    Reported-by: Julia Lawall
    Reported-by: Fengguang Wu
    Cc: Joseph Qi
    Cc: Mark Fasheh
    Cc: Joel Becker
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • The seq_ function return values were frequently misused.

    See: commit 1f33c41c03da ("seq_file: Rename seq_overflow() to
    seq_has_overflowed() and make public")

    All uses of these return values have been removed, so convert the
    return types to void.

    Miscellanea:

    o Move seq_put_decimal_ and seq_escape prototypes closer the
    other seq_vprintf prototypes
    o Reorder seq_putc and seq_puts to return early on overflow
    o Add argument names to seq_vprintf and seq_printf
    o Update the seq_escape kernel-doc
    o Convert a couple of leading spaces to tabs in seq_escape

    Signed-off-by: Joe Perches
    Cc: Al Viro
    Cc: Steven Rostedt
    Cc: Mark Brown
    Cc: Stephen Rothwell
    Cc: Joerg Roedel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joe Perches
     
  • This reverts commit d353d7587d02116b9732d5c06615aed75a4d3a47.

    Doing the block layer plug/unplug inside writeback_sb_inodes() is
    broken, because that function is actually called with a spinlock held:
    wb->list_lock, as pointed out by Chris Mason.

    Chris suggested just dropping and re-taking the spinlock around the
    blk_finish_plug() call (the plgging itself can happen under the
    spinlock), and that would technically work, but is just disgusting.

    We do something fairly similar - but not quite as disgusting because we
    at least have a better reason for it - in writeback_single_inode(), so
    it's not like the caller can depend on the lock being held over the
    call, but in this case there just isn't any good reason for that
    "release and re-take the lock" pattern.

    [ In general, we should really strive to avoid the "release and retake"
    pattern for locks, because in the general case it can easily cause
    subtle bugs when the caller caches any state around the call that
    might be invalidated by dropping the lock even just temporarily. ]

    But in this case, the plugging should be easy to just move up to the
    callers before the spinlock is taken, which should even improve the
    effectiveness of the plug. So there is really no good reason to play
    games with locking here.

    I'll send off a test-patch so that Dave Chinner can verify that that
    plug movement works. In the meantime this just reverts the problematic
    commit and adds a comment to the function so that we hopefully don't
    make this mistake again.

    Reported-by: Chris Mason
    Cc: Josef Bacik
    Cc: Dave Chinner
    Cc: Neil Brown
    Cc: Jan Kara
    Cc: Christoph Hellwig
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     
  • Pull btrfs cleanups and fixes from Chris Mason:
    "These are small cleanups, and also some fixes for our async worker
    thread initialization.

    I was having some trouble testing these, but it ended up being a
    combination of changing around my test servers and a shiny new
    schedule while atomic from the new start/finish_plug in
    writeback_sb_inodes().

    That one only hits on btrfs raid5/6 or MD raid10, and if I wasn't
    changing a bunch of things in my test setup at once it would have been
    really clear. Fix for writeback_sb_inodes() on the way as well"

    * 'for-linus-4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
    Btrfs: cleanup: remove unnecessary check before btrfs_free_path is called
    btrfs: async_thread: Fix workqueue 'max_active' value when initializing
    btrfs: Add raid56 support for updating num_tolerated_disk_barrier_failures in btrfs_balance
    btrfs: Cleanup for btrfs_calc_num_tolerated_disk_barrier_failures
    btrfs: Remove noused chunk_tree and chunk_objectid from scrub_enumerate_chunks and scrub_chunk
    btrfs: Update out-of-date "skip parity stripe" comment

    Linus Torvalds
     
  • Pull Ceph update from Sage Weil:
    "There are a few fixes for snapshot behavior with CephFS and support
    for the new keepalive protocol from Zheng, a libceph fix that affects
    both RBD and CephFS, a few bug fixes and cleanups for RBD from Ilya,
    and several small fixes and cleanups from Jianpeng and others"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
    ceph: improve readahead for file holes
    ceph: get inode size for each append write
    libceph: check data_len in ->alloc_msg()
    libceph: use keepalive2 to verify the mon session is alive
    rbd: plug rbd_dev->header.object_prefix memory leak
    rbd: fix double free on rbd_dev->header_name
    libceph: set 'exists' flag for newly up osd
    ceph: cleanup use of ceph_msg_get
    ceph: no need to get parent inode in ceph_open
    ceph: remove the useless judgement
    ceph: remove redundant test of head->safe and silence static analysis warnings
    ceph: fix queuing inode to mdsdir's snaprealm
    libceph: rename con_work() to ceph_con_workfn()
    libceph: Avoid holding the zero page on ceph_msgr_slab_init errors
    libceph: remove the unused macro AES_KEY_SIZE
    ceph: invalidate dirty pages after forced umount
    ceph: EIO all operations after forced umount

    Linus Torvalds
     
  • Pull GFS2 updates from Bob Peterson:
    "Here is a list of patches we've accumulated for GFS2 for the current
    upstream merge window. This time we've only got six patches, many of
    which are very small:

    - three cleanups from Andreas Gruenbacher, including a nice cleanup
    of the sequence file code for the sbstats debugfs file.

    - a patch from Ben Hutchings that changes statistics variables from
    signed to unsigned.

    - two patches from me that increase GFS2's glock scalability by
    switching from a conventional hash table to rhashtable"

    * tag 'gfs2-merge-window' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2:
    gfs2: A minor "sbstats" cleanup
    gfs2: Fix a typo in a comment
    gfs2: Make statistics unsigned, suitable for use with do_div()
    GFS2: Use resizable hash table for glocks
    GFS2: Move glock superblock pointer to field gl_name
    gfs2: Simplify the seq file code for "sbstats"

    Linus Torvalds
     

11 Sep, 2015

18 commits

  • Pull blk-cg updates from Jens Axboe:
    "A bit later in the cycle, but this has been in the block tree for a a
    while. This is basically four patchsets from Tejun, that improve our
    buffered cgroup writeback. It was dependent on the other cgroup
    changes, but they went in earlier in this cycle.

    Series 1 is set of 5 patches that has cgroup writeback updates:

    - bdi_writeback iteration fix which could lead to some wb's being
    skipped or repeated during e.g. sync under memory pressure.

    - Simplification of wb work wait mechanism.

    - Writeback tracepoints updated to report cgroup.

    Series 2 is is a set of updates for the CFQ cgroup writeback handling:

    cfq has always charged all async IOs to the root cgroup. It didn't
    have much choice as writeback didn't know about cgroups and there
    was no way to tell who to blame for a given writeback IO.
    writeback finally grew support for cgroups and now tags each
    writeback IO with the appropriate cgroup to charge it against.

    This patchset updates cfq so that it follows the blkcg each bio is
    tagged with. Async cfq_queues are now shared across cfq_group,
    which is per-cgroup, instead of per-request_queue cfq_data. This
    makes all IOs follow the weight based IO resource distribution
    implemented by cfq.

    - Switched from GFP_ATOMIC to GFP_NOWAIT as suggested by Jeff.

    - Other misc review points addressed, acks added and rebased.

    Series 3 is the blkcg policy cleanup patches:

    This patchset contains assorted cleanups for blkcg_policy methods
    and blk[c]g_policy_data handling.

    - alloc/free added for blkg_policy_data. exit dropped.

    - alloc/free added for blkcg_policy_data.

    - blk-throttle's async percpu allocation is replaced with direct
    allocation.

    - all methods now take blk[c]g_policy_data instead of blkcg_gq or
    blkcg.

    And finally, series 4 is a set of patches cleaning up the blkcg stats
    handling:

    blkcg's stats have always been somwhat of a mess. This patchset
    tries to improve the situation a bit.

    - The following patches added to consolidate blkcg entry point and
    blkg creation. This is in itself is an improvement and helps
    colllecting common stats on bio issue.

    - per-blkg stats now accounted on bio issue rather than request
    completion so that bio based and request based drivers can behave
    the same way. The issue was spotted by Vivek.

    - cfq-iosched implements custom recursive stats and blk-throttle
    implements custom per-cpu stats. This patchset make blkcg core
    support both by default.

    - cfq-iosched and blk-throttle keep track of the same stats
    multiple times. Unify them"

    * 'for-4.3/blkcg' of git://git.kernel.dk/linux-block: (45 commits)
    blkcg: use CGROUP_WEIGHT_* scale for io.weight on the unified hierarchy
    blkcg: s/CFQ_WEIGHT_*/CFQ_WEIGHT_LEGACY_*/
    blkcg: implement interface for the unified hierarchy
    blkcg: misc preparations for unified hierarchy interface
    blkcg: separate out tg_conf_updated() from tg_set_conf()
    blkcg: move body parsing from blkg_conf_prep() to its callers
    blkcg: mark existing cftypes as legacy
    blkcg: rename subsystem name from blkio to io
    blkcg: refine error codes returned during blkcg configuration
    blkcg: remove unnecessary NULL checks from __cfqg_set_weight_device()
    blkcg: reduce stack usage of blkg_rwstat_recursive_sum()
    blkcg: remove cfqg_stats->sectors
    blkcg: move io_service_bytes and io_serviced stats into blkcg_gq
    blkcg: make blkg_[rw]stat_recursive_sum() to be able to index into blkcg_gq
    blkcg: make blkcg_[rw]stat per-cpu
    blkcg: add blkg_[rw]stat->aux_cnt and replace cfq_group->dead_stats with it
    blkcg: consolidate blkg creation in blkcg_bio_issue_check()
    blk-throttle: improve queue bypass handling
    blkcg: move root blkg lookup optimization from throtl_lookup_tg() to __blkg_lookup()
    blkcg: inline [__]blkg_lookup()
    ...

    Linus Torvalds
     
  • Merge third patch-bomb from Andrew Morton:

    - even more of the rest of MM

    - lib/ updates

    - checkpatch updates

    - small changes to a few scruffy filesystems

    - kmod fixes/cleanups

    - kexec updates

    - a dma-mapping cleanup series from hch

    * emailed patches from Andrew Morton : (81 commits)
    dma-mapping: consolidate dma_set_mask
    dma-mapping: consolidate dma_supported
    dma-mapping: cosolidate dma_mapping_error
    dma-mapping: consolidate dma_{alloc,free}_noncoherent
    dma-mapping: consolidate dma_{alloc,free}_{attrs,coherent}
    mm: use vma_is_anonymous() in create_huge_pmd() and wp_huge_pmd()
    mm: make sure all file VMAs have ->vm_ops set
    mm, mpx: add "vm_flags_t vm_flags" arg to do_mmap_pgoff()
    mm: mark most vm_operations_struct const
    namei: fix warning while make xmldocs caused by namei.c
    ipc: convert invalid scenarios to use WARN_ON
    zlib_deflate/deftree: remove bi_reverse()
    lib/decompress_unlzma: Do a NULL check for pointer
    lib/decompressors: use real out buf size for gunzip with kernel
    fs/affs: make root lookup from blkdev logical size
    sysctl: fix int -> unsigned long assignments in INT_MIN case
    kexec: export KERNEL_IMAGE_SIZE to vmcoreinfo
    kexec: align crash_notes allocation to make it be inside one physical page
    kexec: remove unnecessary test in kimage_alloc_crash_control_pages()
    kexec: split kexec_load syscall from kexec core code
    ...

    Linus Torvalds
     
  • With two exceptions (drm/qxl and drm/radeon) all vm_operations_struct
    structs should be constant.

    Signed-off-by: Kirill A. Shutemov
    Reviewed-by: Oleg Nesterov
    Cc: "H. Peter Anvin"
    Cc: Andy Lutomirski
    Cc: Dave Hansen
    Cc: Ingo Molnar
    Cc: Minchan Kim
    Cc: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • Fix the following warnings:

    Warning(.//fs/namei.c:2422): No description found for parameter 'nd'
    Warning(.//fs/namei.c:2422): Excess function parameter 'nameidata'
    description in 'path_mountpoint'

    Signed-off-by: Masanari Iida
    Acked-by: Randy Dunlap
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Masanari Iida
     
  • This patch resolves https://bugzilla.kernel.org/show_bug.cgi?id=16531.

    When logical blkdev size > 512 then sector numbers become larger than the
    device can support.

    Make affs start lookup based on the device's logical sector size instead
    of 512.

    Reported-by: Mark
    Suggested-by: Mark
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pranay Kr. Srivastava
     
  • This introduces a new helper and switches current users to use it. All
    patches are compiled tested. kmemleak is tested via its own test suite.

    This patch (of 6):

    The new seq_hex_dump() is a complete analogue of print_hex_dump().

    We have few users of this functionality already. It allows to reduce their
    codebase.

    Signed-off-by: Andy Shevchenko
    Cc: Alexander Viro
    Cc: Joe Perches
    Cc: Tadeusz Struk
    Cc: Helge Deller
    Cc: Ingo Tuchscherer
    Cc: Catalin Marinas
    Cc: Vladimir Kondratiev
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andy Shevchenko
     
  • On a filesystem like vfat, all files are created with the same owner
    and mode independent of who created the file. When a vfat filesystem
    is mounted with root as owner of all files and read access for everyone,
    root's processes left world-readable coredumps on it (but other
    users' processes only left empty corefiles when given write access
    because of the uid mismatch).

    Given that the old behavior was inconsistent and insecure, I don't see
    a problem with changing it. Now, all processes refuse to dump core unless
    the resulting corefile will only be readable by their owner.

    Signed-off-by: Jann Horn
    Acked-by: Kees Cook
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jann Horn
     
  • It was possible for an attacking user to trick root (or another user) into
    writing his coredumps into an attacker-readable, pre-existing file using
    rename() or link(), causing the disclosure of secret data from the victim
    process' virtual memory. Depending on the configuration, it was also
    possible to trick root into overwriting system files with coredumps. Fix
    that issue by never writing coredumps into existing files.

    Requirements for the attack:
    - The attack only applies if the victim's process has a nonzero
    RLIMIT_CORE and is dumpable.
    - The attacker can trick the victim into coredumping into an
    attacker-writable directory D, either because the core_pattern is
    relative and the victim's cwd is attacker-writable or because an
    absolute core_pattern pointing to a world-writable directory is used.
    - The attacker has one of these:
    A: on a system with protected_hardlinks=0:
    execute access to a folder containing a victim-owned,
    attacker-readable file on the same partition as D, and the
    victim-owned file will be deleted before the main part of the attack
    takes place. (In practice, there are lots of files that fulfill
    this condition, e.g. entries in Debian's /var/lib/dpkg/info/.)
    This does not apply to most Linux systems because most distros set
    protected_hardlinks=1.
    B: on a system with protected_hardlinks=1:
    execute access to a folder containing a victim-owned,
    attacker-readable and attacker-writable file on the same partition
    as D, and the victim-owned file will be deleted before the main part
    of the attack takes place.
    (This seems to be uncommon.)
    C: on any system, independent of protected_hardlinks:
    write access to a non-sticky folder containing a victim-owned,
    attacker-readable file on the same partition as D
    (This seems to be uncommon.)

    The basic idea is that the attacker moves the victim-owned file to where
    he expects the victim process to dump its core. The victim process dumps
    its core into the existing file, and the attacker reads the coredump from
    it.

    If the attacker can't move the file because he does not have write access
    to the containing directory, he can instead link the file to a directory
    he controls, then wait for the original link to the file to be deleted
    (because the kernel checks that the link count of the corefile is 1).

    A less reliable variant that requires D to be non-sticky works with link()
    and does not require deletion of the original link: link() the file into
    D, but then unlink() it directly before the kernel performs the link count
    check.

    On systems with protected_hardlinks=0, this variant allows an attacker to
    not only gain information from coredumps, but also clobber existing,
    victim-writable files with coredumps. (This could theoretically lead to a
    privilege escalation.)

    Signed-off-by: Jann Horn
    Cc: Kees Cook
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jann Horn
     
  • Fix B-tree corruption when a new record is inserted at position 0 in the
    node in hfs_brec_insert().

    This is an identical change to the corresponding hfs b-tree code to Sergei
    Antonov's "hfsplus: fix B-tree corruption after insertion at position 0",
    to keep similar code paths in the hfs and hfsplus drivers in sync, where
    appropriate.

    Signed-off-by: Hin-Tak Leung
    Cc: Sergei Antonov
    Cc: Joe Perches
    Reviewed-by: Vyacheslav Dubeyko
    Cc: Anton Altaparmakov
    Cc: Al Viro
    Cc: Christoph Hellwig
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hin-Tak Leung
     
  • Pages looked up by __hfs_bnode_create() (called by hfs_bnode_create() and
    hfs_bnode_find() for finding or creating pages corresponding to an inode)
    are immediately kmap()'ed and used (both read and write) and kunmap()'ed,
    and should not be page_cache_release()'ed until hfs_bnode_free().

    This patch fixes a problem I first saw in July 2012: merely running "du"
    on a large hfsplus-mounted directory a few times on a reasonably loaded
    system would get the hfsplus driver all confused and complaining about
    B-tree inconsistencies, and generates a "BUG: Bad page state". Most
    recently, I can generate this problem on up-to-date Fedora 22 with shipped
    kernel 4.0.5, by running "du /" (="/" + "/home" + "/mnt" + other smaller
    mounts) and "du /mnt" simultaneously on two windows, where /mnt is a
    lightly-used QEMU VM image of the full Mac OS X 10.9:

    $ df -i / /home /mnt
    Filesystem Inodes IUsed IFree IUse% Mounted on
    /dev/mapper/fedora-root 3276800 551665 2725135 17% /
    /dev/mapper/fedora-home 52879360 716221 52163139 2% /home
    /dev/nbd0p2 4294967295 1387818 4293579477 1% /mnt

    After applying the patch, I was able to run "du /" (60+ times) and "du
    /mnt" (150+ times) continuously and simultaneously for 6+ hours.

    There are many reports of the hfsplus driver getting confused under load
    and generating "BUG: Bad page state" or other similar issues over the
    years. [1]

    The unpatched code [2] has always been wrong since it entered the kernel
    tree. The only reason why it gets away with it is that the
    kmap/memcpy/kunmap follow very quickly after the page_cache_release() so
    the kernel has not had a chance to reuse the memory for something else,
    most of the time.

    The current RW driver appears to have followed the design and development
    of the earlier read-only hfsplus driver [3], where-by version 0.1 (Dec
    2001) had a B-tree node-centric approach to
    read_cache_page()/page_cache_release() per bnode_get()/bnode_put(),
    migrating towards version 0.2 (June 2002) of caching and releasing pages
    per inode extents. When the current RW code first entered the kernel [2]
    in 2005, there was an REF_PAGES conditional (and "//" commented out code)
    to switch between B-node centric paging to inode-centric paging. There
    was a mistake with the direction of one of the REF_PAGES conditionals in
    __hfs_bnode_create(). In a subsequent "remove debug code" commit [4], the
    read_cache_page()/page_cache_release() per bnode_get()/bnode_put() were
    removed, but a page_cache_release() was mistakenly left in (propagating
    the "REF_PAGES !REF_PAGE" mistake), and the commented-out
    page_cache_release() in bnode_release() (which should be spanned by
    !REF_PAGES) was never enabled.

    References:
    [1]:
    Michael Fox, Apr 2013
    http://www.spinics.net/lists/linux-fsdevel/msg63807.html
    ("hfsplus volume suddenly inaccessable after 'hfs: recoff %d too large'")

    Sasha Levin, Feb 2015
    http://lkml.org/lkml/2015/2/20/85 ("use after free")

    https://bugs.launchpad.net/ubuntu/+source/linux/+bug/740814
    https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1027887
    https://bugzilla.kernel.org/show_bug.cgi?id=42342
    https://bugzilla.kernel.org/show_bug.cgi?id=63841
    https://bugzilla.kernel.org/show_bug.cgi?id=78761

    [2]:
    http://git.kernel.org/cgit/linux/kernel/git/tglx/history.git/commit/\
    fs/hfs/bnode.c?id=d1081202f1d0ee35ab0beb490da4b65d4bc763db
    commit d1081202f1d0ee35ab0beb490da4b65d4bc763db
    Author: Andrew Morton
    Date: Wed Feb 25 16:17:36 2004 -0800

    [PATCH] HFS rewrite

    http://git.kernel.org/cgit/linux/kernel/git/tglx/history.git/commit/\
    fs/hfsplus/bnode.c?id=91556682e0bf004d98a529bf829d339abb98bbbd

    commit 91556682e0bf004d98a529bf829d339abb98bbbd
    Author: Andrew Morton
    Date: Wed Feb 25 16:17:48 2004 -0800

    [PATCH] HFS+ support

    [3]:
    http://sourceforge.net/projects/linux-hfsplus/

    http://sourceforge.net/projects/linux-hfsplus/files/Linux%202.4.x%20patch/hfsplus%200.1/
    http://sourceforge.net/projects/linux-hfsplus/files/Linux%202.4.x%20patch/hfsplus%200.2/

    http://linux-hfsplus.cvs.sourceforge.net/viewvc/linux-hfsplus/linux/\
    fs/hfsplus/bnode.c?r1=1.4&r2=1.5

    Date: Thu Jun 6 09:45:14 2002 +0000
    Use buffer cache instead of page cache in bnode.c. Cache inode extents.

    [4]:
    http://git.kernel.org/cgit/linux/kernel/git/\
    stable/linux-stable.git/commit/?id=a5e3985fa014029eb6795664c704953720cc7f7d

    commit a5e3985fa014029eb6795664c704953720cc7f7d
    Author: Roman Zippel
    Date: Tue Sep 6 15:18:47 2005 -0700

    [PATCH] hfs: remove debug code

    Signed-off-by: Hin-Tak Leung
    Signed-off-by: Sergei Antonov
    Reviewed-by: Anton Altaparmakov
    Reported-by: Sasha Levin
    Cc: Al Viro
    Cc: Christoph Hellwig
    Cc: Vyacheslav Dubeyko
    Cc: Sougata Santra
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hin-Tak Leung
     
  • Dan Carpenter discovered a buffer overflow in the Coda file system
    readlink code. A userspace file system daemon can return a 4096 byte
    result which then triggers a one byte write past the allocated readlink
    result buffer.

    This does not trigger with an unmodified Coda implementation because Coda
    has a 1024 byte limit for symbolic links, however other userspace file
    systems using the Coda kernel module could be affected.

    Although this is an obvious overflow, I don't think this has to be handled
    as too sensitive from a security perspective because the overflow is on
    the Coda userspace daemon side which already needs root to open Coda's
    kernel device and to mount the file system before we get to the point that
    links can be read.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Jan Harkes
    Reported-by: Dan Carpenter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Harkes
     
  • Convert from manual allocation/copy_from_user/... to kstrto*() family
    which were designed for exactly that.

    One case can not be converted to kstrto*_from_user() to make code even
    more simpler because of whitespace stripping, oh well...

    Signed-off-by: Alexey Dobriyan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     
  • The proc_subdir_lock spinlock is used to allow only one task to make
    change to the proc directory structure as well as looking up information
    in it. However, the information lookup part can actually be entered by
    more than one task as the pde_get() and pde_put() reference count update
    calls in the critical sections are atomic increment and decrement
    respectively and so are safe with concurrent updates.

    The x86 architecture has already used qrwlock which is fair and other
    architectures like ARM are in the process of switching to qrwlock. So
    unfairness shouldn't be a concern in that conversion.

    This patch changed the proc_subdir_lock to a rwlock in order to enable
    concurrent lookup. The following functions were modified to take a
    write lock:
    - proc_register()
    - remove_proc_entry()
    - remove_proc_subtree()

    The following functions were modified to take a read lock:
    - xlate_proc_name()
    - proc_lookup_de()
    - proc_readdir_de()

    A parallel /proc filesystem search with the "find" command (1000 threads)
    was run on a 4-socket Haswell-EX box (144 threads). Before the patch, the
    parallel search took about 39s. After the patch, the parallel find took
    only 25s, a saving of about 14s.

    The micro-benchmark that I used was artificial, but it was used to
    reproduce an exit hanging problem that I saw in real application. In
    fact, only allow one task to do a lookup seems too limiting to me.

    Signed-off-by: Waiman Long
    Acked-by: "Eric W. Biederman"
    Cc: Alexey Dobriyan
    Cc: Nicolas Dichtel
    Cc: Al Viro
    Cc: Scott J Norton
    Cc: Douglas Hatch
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Waiman Long
     
  • Currently, /proc//map_files/ is restricted to CAP_SYS_ADMIN, and is
    only exposed if CONFIG_CHECKPOINT_RESTORE is set.

    Each mapped file region gets a symlink in /proc//map_files/
    corresponding to the virtual address range at which it is mapped. The
    symlinks work like the symlinks in /proc//fd/, so you can follow them
    to the backing file even if that backing file has been unlinked.

    Currently, files which are mapped, unlinked, and closed are impossible to
    stat() from userspace. Exposing /proc//map_files/ closes this
    functionality "hole".

    Not being able to stat() such files makes noticing and explicitly
    accounting for the space they use on the filesystem impossible. You can
    work around this by summing up the space used by every file in the
    filesystem and subtracting that total from what statfs() tells you, but
    that obviously isn't great, and it becomes unworkable once your filesystem
    becomes large enough.

    This patch moves map_files/ out from behind CONFIG_CHECKPOINT_RESTORE, and
    adjusts the permissions enforced on it as follows:

    * proc_map_files_lookup()
    * proc_map_files_readdir()
    * map_files_d_revalidate()

    Remove the CAP_SYS_ADMIN restriction, leaving only the current
    restriction requiring PTRACE_MODE_READ. The information made
    available to userspace by these three functions is already
    available in /proc/PID/maps with MODE_READ, so I don't see any
    reason to limit them any further (see below for more detail).

    * proc_map_files_follow_link()

    This stub has been added, and requires that the user have
    CAP_SYS_ADMIN in order to follow the links in map_files/,
    since there was concern on LKML both about the potential for
    bypassing permissions on ancestor directories in the path to
    files pointed to, and about what happens with more exotic
    memory mappings created by some drivers (ie dma-buf).

    In older versions of this patch, I changed every permission check in
    the four functions above to enforce MODE_ATTACH instead of MODE_READ.
    This was an oversight on my part, and after revisiting the discussion
    it seems that nobody was concerned about anything outside of what is
    made possible by ->follow_link(). So in this version, I've left the
    checks for PTRACE_MODE_READ as-is.

    [akpm@linux-foundation.org: catch up with concurrent proc_pid_follow_link() changes]
    Signed-off-by: Calvin Owens
    Reviewed-by: Kees Cook
    Cc: Andy Lutomirski
    Cc: Cyrill Gorcunov
    Cc: Joe Perches
    Cc: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Calvin Owens
     
  • Reading/writing a /proc/kpage* file may take long on machines with a lot
    of RAM installed.

    Signed-off-by: Vladimir Davydov
    Suggested-by: Andres Lagar-Cavilla
    Reviewed-by: Andres Lagar-Cavilla
    Cc: Minchan Kim
    Cc: Raghavendra K T
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Greg Thelen
    Cc: Michel Lespinasse
    Cc: David Rientjes
    Cc: Pavel Emelyanov
    Cc: Cyrill Gorcunov
    Cc: Jonathan Corbet
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • As noted by Minchan, a benefit of reading idle flag from /proc/kpageflags
    is that one can easily filter dirty and/or unevictable pages while
    estimating the size of unused memory.

    Note that idle flag read from /proc/kpageflags may be stale in case the
    page was accessed via a PTE, because it would be too costly to iterate
    over all page mappings on each /proc/kpageflags read to provide an
    up-to-date value. To make sure the flag is up-to-date one has to read
    /sys/kernel/mm/page_idle/bitmap first.

    Signed-off-by: Vladimir Davydov
    Reviewed-by: Andres Lagar-Cavilla
    Cc: Minchan Kim
    Cc: Raghavendra K T
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Greg Thelen
    Cc: Michel Lespinasse
    Cc: David Rientjes
    Cc: Pavel Emelyanov
    Cc: Cyrill Gorcunov
    Cc: Jonathan Corbet
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • Knowing the portion of memory that is not used by a certain application or
    memory cgroup (idle memory) can be useful for partitioning the system
    efficiently, e.g. by setting memory cgroup limits appropriately.
    Currently, the only means to estimate the amount of idle memory provided
    by the kernel is /proc/PID/{clear_refs,smaps}: the user can clear the
    access bit for all pages mapped to a particular process by writing 1 to
    clear_refs, wait for some time, and then count smaps:Referenced. However,
    this method has two serious shortcomings:

    - it does not count unmapped file pages
    - it affects the reclaimer logic

    To overcome these drawbacks, this patch introduces two new page flags,
    Idle and Young, and a new sysfs file, /sys/kernel/mm/page_idle/bitmap.
    A page's Idle flag can only be set from userspace by setting bit in
    /sys/kernel/mm/page_idle/bitmap at the offset corresponding to the page,
    and it is cleared whenever the page is accessed either through page tables
    (it is cleared in page_referenced() in this case) or using the read(2)
    system call (mark_page_accessed()). Thus by setting the Idle flag for
    pages of a particular workload, which can be found e.g. by reading
    /proc/PID/pagemap, waiting for some time to let the workload access its
    working set, and then reading the bitmap file, one can estimate the amount
    of pages that are not used by the workload.

    The Young page flag is used to avoid interference with the memory
    reclaimer. A page's Young flag is set whenever the Access bit of a page
    table entry pointing to the page is cleared by writing to the bitmap file.
    If page_referenced() is called on a Young page, it will add 1 to its
    return value, therefore concealing the fact that the Access bit was
    cleared.

    Note, since there is no room for extra page flags on 32 bit, this feature
    uses extended page flags when compiled on 32 bit.

    [akpm@linux-foundation.org: fix build]
    [akpm@linux-foundation.org: kpageidle requires an MMU]
    [akpm@linux-foundation.org: decouple from page-flags rework]
    Signed-off-by: Vladimir Davydov
    Reviewed-by: Andres Lagar-Cavilla
    Cc: Minchan Kim
    Cc: Raghavendra K T
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Greg Thelen
    Cc: Michel Lespinasse
    Cc: David Rientjes
    Cc: Pavel Emelyanov
    Cc: Cyrill Gorcunov
    Cc: Jonathan Corbet
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • /proc/kpagecgroup contains a 64-bit inode number of the memory cgroup each
    page is charged to, indexed by PFN. Having this information is useful for
    estimating a cgroup working set size.

    The file is present if CONFIG_PROC_PAGE_MONITOR && CONFIG_MEMCG.

    Signed-off-by: Vladimir Davydov
    Reviewed-by: Andres Lagar-Cavilla
    Cc: Minchan Kim
    Cc: Raghavendra K T
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Greg Thelen
    Cc: Michel Lespinasse
    Cc: David Rientjes
    Cc: Pavel Emelyanov
    Cc: Cyrill Gorcunov
    Cc: Jonathan Corbet
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     

10 Sep, 2015

3 commits

  • Followup to the UFS series - with the way we clear the new blocks (via
    buffer cache, possibly on more than a page worth of file) we really
    should not insert a reference to new block into inode block tree until
    after we'd cleared it.

    Signed-off-by: Al Viro
    Signed-off-by: Linus Torvalds

    Al Viro
     
  • Pull cifs updates from Steve French:
    "Small cifs fix and a patch for improved debugging"

    * 'for-next' of git://git.samba.org/sfrench/cifs-2.6:
    cifs: Fix use-after-free on mid_q_entry
    Update cifs version number
    Add way to query server fs info for smb3

    Linus Torvalds
     
  • As part of the v4.3 merge window the DAX code was updated by Matthew and
    Kirill to handle PMD pages. Also as part of the v4.3 merge window we
    updated the DAX code to do proper PMEM flushing (commit 2765cfbb342c:
    "dax: update I/O path to do proper PMEM flushing").

    The additional code added by the DAX PMD patches also needs to be
    updated to properly use the PMEM API. This ensures that after a PMD
    fault is handled the zeros written to the newly allocated pages are
    durable on the DIMMs.

    linux/dax.h is included to get rid of a bunch of sparse warnings.

    Signed-off-by: Ross Zwisler
    Cc: Matthew Wilcox ,
    Cc: Dan Williams
    Cc: Kirill Shutemov
    Signed-off-by: Linus Torvalds

    Ross Zwisler
     

09 Sep, 2015

10 commits

  • When readahead encounters file holes, osd reply returns error -ENOENT,
    finish_read() skips adding pages to the the page cache. So readahead
    does not work for file holes. The fix is adding zero pages to the
    page cache when -ENOENT is returned.

    Signed-off-by: Yan, Zheng

    Yan, Zheng
     
  • Signed-off-by: Yan, Zheng

    Yan, Zheng
     
  • Merge second patch-bomb from Andrew Morton:
    "Almost all of the rest of MM. There was an unusually large amount of
    MM material this time"

    * emailed patches from Andrew Morton : (141 commits)
    zpool: remove no-op module init/exit
    mm: zbud: constify the zbud_ops
    mm: zpool: constify the zpool_ops
    mm: swap: zswap: maybe_preload & refactoring
    zram: unify error reporting
    zsmalloc: remove null check from destroy_handle_cache()
    zsmalloc: do not take class lock in zs_shrinker_count()
    zsmalloc: use class->pages_per_zspage
    zsmalloc: consider ZS_ALMOST_FULL as migrate source
    zsmalloc: partial page ordering within a fullness_list
    zsmalloc: use shrinker to trigger auto-compaction
    zsmalloc: account the number of compacted pages
    zsmalloc/zram: introduce zs_pool_stats api
    zsmalloc: cosmetic compaction code adjustments
    zsmalloc: introduce zs_can_compact() function
    zsmalloc: always keep per-class stats
    zsmalloc: drop unused variable `nr_to_migrate'
    mm/memblock.c: fix comment in __next_mem_range()
    mm/page_alloc.c: fix type information of memoryless node
    memory-hotplug: fix comments in zone_spanned_pages_in_node() and zone_spanned_pages_in_node()
    ...

    Linus Torvalds
     
  • Pull regmap updates from Mark Brown:
    "This has been a busy release for regmap.

    By far the biggest set of changes here are those from Markus Pargmann
    which implement support for block transfers in smbus devices. This
    required quite a bit of refactoring but leaves us better able to
    handle odd restrictions that controllers may have and with better
    performance on smbus.

    Other new features include:

    - Fix interactions with lockdep for nested regmaps (eg, when a device
    using regmap is connected to a bus where the bus controller has a
    separate regmap). Lockdep's default class identification is too
    crude to work without help.

    - Support for must write bitfield operations, useful for operations
    which require writing a bit to trigger them from Kuniori Morimoto.

    - Support for delaying during register patch application from Nariman
    Poushin.

    - Support for overriding cache state via the debugfs implementation
    from Richard Fitzgerald"

    * tag 'regmap-v4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regmap: (25 commits)
    regmap: fix a NULL pointer dereference in __regmap_init
    regmap: Support bulk reads for devices without raw formatting
    regmap-i2c: Add smbus i2c block support
    regmap: Add raw_write/read checks for max_raw_write/read sizes
    regmap: regmap max_raw_read/write getter functions
    regmap: Introduce max_raw_read/write for regmap_bulk_read/write
    regmap: Add missing comments about struct regmap_bus
    regmap: No multi_write support if bus->write does not exist
    regmap: Split use_single_rw internally into use_single_read/write
    regmap: Fix regmap_bulk_write for bus writes
    regmap: regmap_raw_read return error on !bus->read
    regulator: core: Print at debug level on debugfs creation failure
    regmap: Fix regmap_can_raw_write check
    regmap: fix typos in regmap.c
    regmap: Fix integertypes for register address and value
    regmap: Move documentation to regmap.h
    regmap: Use different lockdep class for each regmap init call
    thermal: sti: Add parentheses around bridge->ops->regmap_init call
    mfd: vexpress: Add parentheses around bridge->ops->regmap_init call
    regmap: debugfs: Fix misuse of IS_ENABLED
    ...

    Linus Torvalds
     
  • This is based on the shmem version, but it has diverged quite a bit. We
    have no swap to worry about, nor the new file sealing. Add
    synchronication via the fault mutex table to coordinate page faults,
    fallocate allocation and fallocate hole punch.

    What this allows us to do is move physical memory in and out of a
    hugetlbfs file without having it mapped. This also gives us the ability
    to support MADV_REMOVE since it is currently implemented using
    fallocate(). MADV_REMOVE lets madvise() remove pages from the middle of
    a hugetlbfs file, which wasn't possible before.

    hugetlbfs fallocate only operates on whole huge pages.

    Based on code by Dave Hansen.

    Signed-off-by: Mike Kravetz
    Reviewed-by: Naoya Horiguchi
    Acked-by: Hillf Danton
    Cc: Dave Hansen
    Cc: David Rientjes
    Cc: Hugh Dickins
    Cc: Davidlohr Bueso
    Cc: Aneesh Kumar
    Cc: Christoph Hellwig
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Kravetz
     
  • Modify truncate_hugepages() to take a range of pages (start, end)
    instead of simply start. If an end value of LLONG_MAX is passed, the
    current "truncate" functionality is maintained. Existing callers are
    modified to pass LLONG_MAX as end of range. By keying off end ==
    LLONG_MAX, the routine behaves differently for truncate and hole punch.
    Page removal is now synchronized with page allocation via faults by
    using the fault mutex table. The hole punch case can experience the
    rare region_del error and must handle accordingly.

    Add the routine hugetlb_fix_reserve_counts to fix up reserve counts in
    the case where region_del returns an error.

    Since the routine handles more than just the truncate case, it is
    renamed to remove_inode_hugepages(). To be consistent, the routine
    truncate_huge_page() is renamed remove_huge_page().

    Downstream of remove_inode_hugepages(), the routine
    hugetlb_unreserve_pages() is also modified to take a range of pages.
    hugetlb_unreserve_pages is modified to detect an error from region_del and
    pass it back to the caller.

    Signed-off-by: Mike Kravetz
    Reviewed-by: Naoya Horiguchi
    Acked-by: Hillf Danton
    Cc: Dave Hansen
    Cc: David Rientjes
    Cc: Hugh Dickins
    Cc: Davidlohr Bueso
    Cc: Aneesh Kumar
    Cc: Christoph Hellwig
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Kravetz
     
  • fallocate hole punch will want to unmap a specific range of pages.
    Modify the existing hugetlb_vmtruncate_list() routine to take a
    start/end range. If end is 0, this indicates all pages after start
    should be unmapped. This is the same as the existing truncate
    functionality. Modify existing callers to add 0 as end of range.

    Since the routine will be used in hole punch as well as truncate
    operations, it is more appropriately renamed to hugetlb_vmdelete_list().

    Signed-off-by: Mike Kravetz
    Reviewed-by: Naoya Horiguchi
    Acked-by: Hillf Danton
    Cc: Dave Hansen
    Cc: David Rientjes
    Cc: Hugh Dickins
    Cc: Davidlohr Bueso
    Cc: Aneesh Kumar
    Cc: Christoph Hellwig
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Kravetz
     
  • We want to know per-process workingset size for smart memory management
    on userland and we use swap(ex, zram) heavily to maximize memory
    efficiency so workingset includes swap as well as RSS.

    On such system, if there are lots of shared anonymous pages, it's really
    hard to figure out exactly how many each process consumes memory(ie, rss
    + wap) if the system has lots of shared anonymous memory(e.g, android).

    This patch introduces SwapPss field on /proc//smaps so we can get
    more exact workingset size per process.

    Bongkyu tested it. Result is below.

    1. 50M used swap
    SwapTotal: 461976 kB
    SwapFree: 411192 kB

    $ adb shell cat /proc/*/smaps | grep "SwapPss:" | awk '{sum += $2} END {print sum}';
    48236
    $ adb shell cat /proc/*/smaps | grep "Swap:" | awk '{sum += $2} END {print sum}';
    141184

    2. 240M used swap
    SwapTotal: 461976 kB
    SwapFree: 216808 kB

    $ adb shell cat /proc/*/smaps | grep "SwapPss:" | awk '{sum += $2} END {print sum}';
    230315
    $ adb shell cat /proc/*/smaps | grep "Swap:" | awk '{sum += $2} END {print sum}';
    1387744

    [akpm@linux-foundation.org: simplify kunmap_atomic() call]
    Signed-off-by: Minchan Kim
    Reported-by: Bongkyu Kim
    Tested-by: Bongkyu Kim
    Cc: Hugh Dickins
    Cc: Sergey Senozhatsky
    Cc: Jonathan Corbet
    Cc: Jerome Marchand
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • This patch sets bit 56 in pagemap if this page is mapped only once. It
    allows to detect exclusively used pages without exposing PFN:

    present file exclusive state
    0 0 0 non-present
    1 1 0 file page mapped somewhere else
    1 1 1 file page mapped only here
    1 0 0 anon non-CoWed page (shared with parent/child)
    1 0 1 anon CoWed page (or never forked)

    CoWed pages in (MAP_FILE | MAP_PRIVATE) areas are anon in this context.

    MMap-exclusive bit doesn't reflect potential page-sharing via swapcache:
    page could be mapped once but has several swap-ptes which point to it.
    Application could detect that by swap bit in pagemap entry and touch that
    pte via /proc/pid/mem to get real information.

    See http://lkml.kernel.org/r/CAEVpBa+_RyACkhODZrRvQLs80iy0sqpdrd0AaP_-tgnX3Y9yNQ@mail.gmail.com

    Requested by Mark Williamson.

    [akpm@linux-foundation.org: fix spello]
    Signed-off-by: Konstantin Khlebnikov
    Reviewed-by: Mark Williamson
    Tested-by: Mark Williamson
    Reviewed-by: Naoya Horiguchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     
  • This patch makes pagemap readable for normal users and hides physical
    addresses from them. For some use-cases PFN isn't required at all.

    See http://lkml.kernel.org/r/1425935472-17949-1-git-send-email-kirill@shutemov.name

    Fixes: ab676b7d6fbf ("pagemap: do not leak physical addresses to non-privileged userspace")
    Signed-off-by: Konstantin Khlebnikov
    Cc: Naoya Horiguchi
    Reviewed-by: Mark Williamson
    Tested-by: Mark Williamson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov