20 Dec, 2014

1 commit


19 Dec, 2014

1 commit

  • This patch include CMA info (CMATotal, CMAFree) in /proc/meminfo.
    Currently, in a CMA enabled system, if somebody wants to know the total
    CMA size declared, there is no way to tell, other than the dmesg or
    /var/log/messages logs.

    With this patch we are showing the CMA info as part of meminfo, so that it
    can be determined at any point of time. This will be populated only when
    CMA is enabled.

    Below is the sample output from a ARM based device with RAM:512MB and CMA:16MB.

    MemTotal: 471172 kB
    MemFree: 111712 kB
    MemAvailable: 271172 kB
    .
    .
    .
    CmaTotal: 16384 kB
    CmaFree: 6144 kB

    This patch also fix below checkpatch errors that were found during these changes.

    ERROR: space required after that ',' (ctx:ExV)
    199: FILE: fs/proc/meminfo.c:199:
    + ,atomic_long_read(&num_poisoned_pages) << (PAGE_SHIFT - 10)
    ^

    ERROR: space required after that ',' (ctx:ExV)
    202: FILE: fs/proc/meminfo.c:202:
    + ,K(global_page_state(NR_ANON_TRANSPARENT_HUGEPAGES) *
    ^

    ERROR: space required after that ',' (ctx:ExV)
    206: FILE: fs/proc/meminfo.c:206:
    + ,K(totalcma_pages)
    ^

    total: 3 errors, 0 warnings, 2 checks, 236 lines checked

    Signed-off-by: Pintu Kumar
    Signed-off-by: Vishnu Pratap Singh
    Acked-by: Michal Nazarewicz
    Cc: Rafael Aquini
    Cc: Jerome Marchand
    Cc: Marek Szyprowski
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pintu Kumar
     

18 Dec, 2014

1 commit

  • Pull user namespace related fixes from Eric Biederman:
    "As these are bug fixes almost all of thes changes are marked for
    backporting to stable.

    The first change (implicitly adding MNT_NODEV on remount) addresses a
    regression that was created when security issues with unprivileged
    remount were closed. I go on to update the remount test to make it
    easy to detect if this issue reoccurs.

    Then there are a handful of mount and umount related fixes.

    Then half of the changes deal with the a recently discovered design
    bug in the permission checks of gid_map. Unix since the beginning has
    allowed setting group permissions on files to less than the user and
    other permissions (aka ---rwx---rwx). As the unix permission checks
    stop as soon as a group matches, and setgroups allows setting groups
    that can not later be dropped, results in a situtation where it is
    possible to legitimately use a group to assign fewer privileges to a
    process. Which means dropping a group can increase a processes
    privileges.

    The fix I have adopted is that gid_map is now no longer writable
    without privilege unless the new file /proc/self/setgroups has been
    set to permanently disable setgroups.

    The bulk of user namespace using applications even the applications
    using applications using user namespaces without privilege remain
    unaffected by this change. Unfortunately this ix breaks a couple user
    space applications, that were relying on the problematic behavior (one
    of which was tools/selftests/mount/unprivileged-remount-test.c).

    To hopefully prevent needing a regression fix on top of my security
    fix I rounded folks who work with the container implementations mostly
    like to be affected and encouraged them to test the changes.

    > So far nothing broke on my libvirt-lxc test bed. :-)
    > Tested with openSUSE 13.2 and libvirt 1.2.9.
    > Tested-by: Richard Weinberger

    > Tested on Fedora20 with libvirt 1.2.11, works fine.
    > Tested-by: Chen Hanxiao

    > Ok, thanks - yes, unprivileged lxc is working fine with your kernels.
    > Just to be sure I was testing the right thing I also tested using
    > my unprivileged nsexec testcases, and they failed on setgroup/setgid
    > as now expected, and succeeded there without your patches.
    > Tested-by: Serge Hallyn

    > I tested this with Sandstorm. It breaks as is and it works if I add
    > the setgroups thing.
    > Tested-by: Andy Lutomirski # breaks things as designed :("

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
    userns: Unbreak the unprivileged remount tests
    userns; Correct the comment in map_write
    userns: Allow setting gid_maps without privilege when setgroups is disabled
    userns: Add a knob to disable setgroups on a per user namespace basis
    userns: Rename id_map_mutex to userns_state_mutex
    userns: Only allow the creator of the userns unprivileged mappings
    userns: Check euid no fsuid when establishing an unprivileged uid mapping
    userns: Don't allow unprivileged creation of gid mappings
    userns: Don't allow setgroups until a gid mapping has been setablished
    userns: Document what the invariant required for safe unprivileged mappings.
    groups: Consolidate the setgroups permission checks
    mnt: Clear mnt_expire during pivot_root
    mnt: Carefully set CL_UNPRIVILEGED in clone_mnt
    mnt: Move the clear of MNT_LOCKED from copy_tree to it's callers.
    umount: Do not allow unmounting rootfs.
    umount: Disallow unprivileged mount force
    mnt: Update unprivileged remount test
    mnt: Implicitly add MNT_NODEV on remount when it was implicitly added by mount

    Linus Torvalds
     

17 Dec, 2014

1 commit

  • Pull vfs pile #2 from Al Viro:
    "Next pile (and there'll be one or two more).

    The large piece in this one is getting rid of /proc/*/ns/* weirdness;
    among other things, it allows to (finally) make nameidata completely
    opaque outside of fs/namei.c, making for easier further cleanups in
    there"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    coda_venus_readdir(): use file_inode()
    fs/namei.c: fold link_path_walk() call into path_init()
    path_init(): don't bother with LOOKUP_PARENT in argument
    fs/namei.c: new helper (path_cleanup())
    path_init(): store the "base" pointer to file in nameidata itself
    make default ->i_fop have ->open() fail with ENXIO
    make nameidata completely opaque outside of fs/namei.c
    kill proc_ns completely
    take the targets of /proc/*/ns/* symlinks to separate fs
    bury struct proc_ns in fs/proc
    copy address of proc_ns_ops into ns_common
    new helpers: ns_alloc_inum/ns_free_inum
    make proc_ns_operations work with struct ns_common * instead of void *
    switch the rest of proc_ns_operations to working with &...->ns
    netns: switch ->get()/->put()/->install()/->inum() to working with &net->ns
    make mntns ->get()/->put()/->install()/->inum() work with &mnt_ns->ns
    common object embedded into various struct ....ns

    Linus Torvalds
     

13 Dec, 2014

1 commit

  • Since the rework of the sparse interrupt code to actually free the
    unused interrupt descriptors there exists a race between the /proc
    interfaces to the irq subsystem and the code which frees the interrupt
    descriptor.

    CPU0 CPU1
    show_interrupts()
    desc = irq_to_desc(X);
    free_desc(desc)
    remove_from_radix_tree();
    kfree(desc);
    raw_spinlock_irq(&desc->lock);

    /proc/interrupts is the only interface which can actively corrupt
    kernel memory via the lock access. /proc/stat can only read from freed
    memory. Extremly hard to trigger, but possible.

    The interfaces in /proc/irq/N/ are not affected by this because the
    removal of the proc file is serialized in procfs against concurrent
    readers/writers. The removal happens before the descriptor is freed.

    For architectures which have CONFIG_SPARSE_IRQ=n this is a non issue
    as the descriptor is never freed. It's merely cleared out with the irq
    descriptor lock held. So any concurrent proc access will either see
    the old correct value or the cleared out ones.

    Protect the lookup and access to the irq descriptor in
    show_interrupts() with the sparse_irq_lock.

    Provide kstat_irqs_usr() which is protecting the lookup and access
    with sparse_irq_lock and switch /proc/stat to use it.

    Document the existing kstat_irqs interfaces so it's clear that the
    caller needs to take care about protection. The users of these
    interfaces are either not affected due to SPARSE_IRQ=n or already
    protected against removal.

    Fixes: 1f5a5b87f78f "genirq: Implement a sane sparse_irq allocator"
    Signed-off-by: Thomas Gleixner
    Cc: stable@vger.kernel.org

    Thomas Gleixner
     

12 Dec, 2014

1 commit

  • - Expose the knob to user space through a proc file /proc//setgroups

    A value of "deny" means the setgroups system call is disabled in the
    current processes user namespace and can not be enabled in the
    future in this user namespace.

    A value of "allow" means the segtoups system call is enabled.

    - Descendant user namespaces inherit the value of setgroups from
    their parents.

    - A proc file is used (instead of a sysctl) as sysctls currently do
    not allow checking the permissions at open time.

    - Writing to the proc file is restricted to before the gid_map
    for the user namespace is set.

    This ensures that disabling setgroups at a user namespace
    level will never remove the ability to call setgroups
    from a process that already has that ability.

    A process may opt in to the setgroups disable for itself by
    creating, entering and configuring a user namespace or by calling
    setns on an existing user namespace with setgroups disabled.
    Processes without privileges already can not call setgroups so this
    is a noop. Prodcess with privilege become processes without
    privilege when entering a user namespace and as with any other path
    to dropping privilege they would not have the ability to call
    setgroups. So this remains within the bounds of what is possible
    without a knob to disable setgroups permanently in a user namespace.

    Cc: stable@vger.kernel.org
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

11 Dec, 2014

14 commits

  • Merge first patchbomb from Andrew Morton:
    - a few minor cifs fixes
    - dma-debug upadtes
    - ocfs2
    - slab
    - about half of MM
    - procfs
    - kernel/exit.c
    - panic.c tweaks
    - printk upates
    - lib/ updates
    - checkpatch updates
    - fs/binfmt updates
    - the drivers/rtc tree
    - nilfs
    - kmod fixes
    - more kernel/exit.c
    - various other misc tweaks and fixes

    * emailed patches from Andrew Morton : (190 commits)
    exit: pidns: fix/update the comments in zap_pid_ns_processes()
    exit: pidns: alloc_pid() leaks pid_namespace if child_reaper is exiting
    exit: exit_notify: re-use "dead" list to autoreap current
    exit: reparent: call forget_original_parent() under tasklist_lock
    exit: reparent: avoid find_new_reaper() if no children
    exit: reparent: introduce find_alive_thread()
    exit: reparent: introduce find_child_reaper()
    exit: reparent: document the ->has_child_subreaper checks
    exit: reparent: s/while_each_thread/for_each_thread/ in find_new_reaper()
    exit: reparent: fix the cross-namespace PR_SET_CHILD_SUBREAPER reparenting
    exit: reparent: fix the dead-parent PR_SET_CHILD_SUBREAPER reparenting
    exit: proc: don't try to flush /proc/tgid/task/tgid
    exit: release_task: fix the comment about group leader accounting
    exit: wait: drop tasklist_lock before psig->c* accounting
    exit: wait: don't use zombie->real_parent
    exit: wait: cleanup the ptrace_reparented() checks
    usermodehelper: kill the kmod_thread_locker logic
    usermodehelper: don't use CLONE_VFORK for ____call_usermodehelper()
    fs/hfs/catalog.c: fix comparison bug in hfs_cat_keycmp
    nilfs2: fix the nilfs_iget() vs. nilfs_new_inode() races
    ...

    Linus Torvalds
     
  • Al Viro
     
  • procfs inodes need only the ns_ops part; nsfs inodes don't need it at all

    Signed-off-by: Al Viro

    Al Viro
     
  • New pseudo-filesystem: nsfs. Targets of /proc/*/ns/* live there now.
    It's not mountable (not even registered, so it's not in /proc/filesystems,
    etc.). Files on it *are* bindable - we explicitly permit that in do_loopback().

    This stuff lives in fs/nsfs.c now; proc_ns_fget() moved there as well.
    get_proc_ns() is a macro now (it's simply returning ->i_private; would
    have been an inline, if not for header ordering headache).
    proc_ns_inode() is an ex-parrot. The interface used in procfs is
    ns_get_path(path, task, ops) and ns_get_name(buf, size, task, ops).

    Dentries and inodes are never hashed; a non-counting reference to dentry
    is stashed in ns_common (removed by ->d_prune()) and reused by ns_get_path()
    if present. See ns_get_path()/ns_prune_dentry/nsfs_evict() for details
    of that mechanism.

    As the result, proc_ns_follow_link() has stopped poking in nd->path.mnt;
    it does nd_jump_link() on a consistent pair it gets
    from ns_get_path().

    Signed-off-by: Al Viro

    Al Viro
     
  • proc_flush_task_mnt() always tries to flush task/pid, but this is
    pointless if we reap the leader. d_invalidate() is recursive, and
    if nothing else the next d_hash_and_lookup(tgid) should fail anyway.

    Signed-off-by: Oleg Nesterov
    Cc: Aaron Tomlin
    Cc: "Eric W. Biederman"
    Cc: Rik van Riel
    Cc: Sterling Alexander
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • p->ptrace != 0 means that release_task(p) was not called, so pid_alive()
    buys nothing and we can remove this check. Other callers already use it
    directly without additional checks.

    Note: with or without this patch ptrace_parent() can return the pointer to
    the freed task, this will be explained/fixed later.

    Signed-off-by: Oleg Nesterov
    Cc: Aaron Tomlin
    Cc: Alexey Dobriyan
    Cc: "Eric W. Biederman" ,
    Cc: Sterling Alexander
    Cc: Peter Zijlstra
    Cc: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • task_state() does seq_printf() under rcu_read_lock(), but this is only
    needed for task_tgid_nr_ns() and task_numa_group_id(). We can calculate
    tgid/ngid and drop rcu lock.

    Signed-off-by: Oleg Nesterov
    Cc: Aaron Tomlin
    Cc: Alexey Dobriyan
    Cc: "Eric W. Biederman" ,
    Cc: Sterling Alexander
    Cc: Peter Zijlstra
    Cc: Roland McGrath
    Reviewed-by: Paul E. McKenney
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • 1. The usage of fdt looks very ugly, it can't be NULL if ->files is
    not NULL. We can use "unsigned int max_fds" instead.

    2. This also allows to move seq_printf(max_fds) outside of task_lock()
    and join it with the previous seq_printf(). See also the next patch.

    Signed-off-by: Oleg Nesterov
    Cc: Aaron Tomlin
    Cc: Alexey Dobriyan
    Cc: "Eric W. Biederman" ,
    Cc: Sterling Alexander
    Cc: Peter Zijlstra
    Cc: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • task_state() reads cred->group_info under task_lock() because a long ago
    it was task_struct->group_info and it was actually protected by
    task->alloc_lock. Today this task_unlock() after rcu_read_unlock() just
    adds the confusion, move task_unlock() up.

    Signed-off-by: Oleg Nesterov
    Cc: Aaron Tomlin
    Cc: Alexey Dobriyan
    Cc: "Eric W. Biederman" ,
    Cc: Sterling Alexander
    Cc: Peter Zijlstra
    Cc: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Better to use existing macro that rewriting them.

    Signed-off-by: Nicolas Dichtel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nicolas Dichtel
     
  • proc_register() error paths are leaking inodes and directory refcounts.

    Signed-off-by: Debabrata Banerjee
    Cc: Alexander Viro
    Acked-by: Nicolas Dichtel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Debabrata Banerjee
     
  • When a lot of netdevices are created, one of the bottleneck is the
    creation of proc entries. This serie aims to accelerate this part.

    The current implementation for the directories in /proc is using a single
    linked list. This is slow when handling directories with large numbers of
    entries (eg netdevice-related entries when lots of tunnels are opened).

    This patch replaces this linked list by a red-black tree.

    Here are some numbers:

    dummy30000.batch contains 30 000 times 'link add type dummy'.

    Before the patch:
    $ time ip -b dummy30000.batch
    real 2m31.950s
    user 0m0.440s
    sys 2m21.440s
    $ time rmmod dummy
    real 1m35.764s
    user 0m0.000s
    sys 1m24.088s

    After the patch:
    $ time ip -b dummy30000.batch
    real 2m0.874s
    user 0m0.448s
    sys 1m49.720s
    $ time rmmod dummy
    real 1m13.988s
    user 0m0.000s
    sys 1m1.008s

    The idea of improving this part was suggested by Thierry Herbelot.

    [akpm@linux-foundation.org: initialise proc_root.subdir at compile time]
    Signed-off-by: Nicolas Dichtel
    Acked-by: David S. Miller
    Cc: Thierry Herbelot .
    Acked-by: "Eric W. Biederman"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nicolas Dichtel
     
  • As a small zero page, huge zero page should not be accounted in smaps
    report as normal page.

    For small pages we rely on vm_normal_page() to filter out zero page, but
    vm_normal_page() is not designed to handle pmds. We only get here due
    hackish cast pmd to pte in smaps_pte_range() -- pte and pmd format is not
    necessary compatible on each and every architecture.

    Let's add separate codepath to handle pmds. follow_trans_huge_pmd() will
    detect huge zero page for us.

    We would need pmd_dirty() helper to do this properly. The patch adds it
    to THP-enabled architectures which don't yet have one.

    [akpm@linux-foundation.org: use do_div to fix 32-bit build]
    Signed-off-by: "Kirill A. Shutemov"
    Reported-by: Fengguang Wu
    Tested-by: Fengwei Yin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • Pull VFS changes from Al Viro:
    "First pile out of several (there _definitely_ will be more). Stuff in
    this one:

    - unification of d_splice_alias()/d_materialize_unique()

    - iov_iter rewrite

    - killing a bunch of ->f_path.dentry users (and f_dentry macro).

    Getting that completed will make life much simpler for
    unionmount/overlayfs, since then we'll be able to limit the places
    sensitive to file _dentry_ to reasonably few. Which allows to have
    file_inode(file) pointing to inode in a covered layer, with dentry
    pointing to (negative) dentry in union one.

    Still not complete, but much closer now.

    - crapectomy in lustre (dead code removal, mostly)

    - "let's make seq_printf return nothing" preparations

    - assorted cleanups and fixes

    There _definitely_ will be more piles"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (63 commits)
    copy_from_iter_nocache()
    new helper: iov_iter_kvec()
    csum_and_copy_..._iter()
    iov_iter.c: handle ITER_KVEC directly
    iov_iter.c: convert copy_to_iter() to iterate_and_advance
    iov_iter.c: convert copy_from_iter() to iterate_and_advance
    iov_iter.c: get rid of bvec_copy_page_{to,from}_iter()
    iov_iter.c: convert iov_iter_zero() to iterate_and_advance
    iov_iter.c: convert iov_iter_get_pages_alloc() to iterate_all_kinds
    iov_iter.c: convert iov_iter_get_pages() to iterate_all_kinds
    iov_iter.c: convert iov_iter_npages() to iterate_all_kinds
    iov_iter.c: iterate_and_advance
    iov_iter.c: macros for iterating over iov_iter
    kill f_dentry macro
    dcache: fix kmemcheck warning in switch_names
    new helper: audit_file()
    nfsd_vfs_write(): use file_inode()
    ncpfs: use file_inode()
    kill f_dentry uses
    lockd: get rid of ->f_path.dentry->d_sb
    ...

    Linus Torvalds
     

05 Dec, 2014

3 commits


20 Nov, 2014

2 commits

  • …git/rostedt/linux-trace into for-next

    Pull the beginning of seq_file cleanup from Steven:
    "I'm looking to clean up the seq_file code and to eventually merge the
    trace_seq code with seq_file as well, since they basically do the same thing.

    Part of this process is to remove the return code of seq_printf() and friends
    as they are rather inconsistent. It is better to use the new function
    seq_has_overflowed() if you want to stop processing when the buffer
    is full. Note, if the buffer is full, the seq_file code will throw away
    the contents, allocate a bigger buffer, and then call your code again
    to fill in the data. The only thing that breaking out of the function
    early does is to save a little time which is probably never noticed.

    I started with patches from Joe Perches and modified them as well.
    There's many more places that need to be updated before we can convert
    seq_printf() and friends to return void. But this patch set introduces
    the seq_has_overflowed() and does some initial updates."

    Al Viro
     
  • Signed-off-by: Al Viro

    Al Viro
     

18 Nov, 2014

1 commit

  • MPX-enabled applications using large swaths of memory can
    potentially have large numbers of bounds tables in process
    address space to save bounds information. These tables can take
    up huge swaths of memory (as much as 80% of the memory on the
    system) even if we clean them up aggressively. In the worst-case
    scenario, the tables can be 4x the size of the data structure
    being tracked. IOW, a 1-page structure can require 4 bounds-table
    pages.

    Being this huge, our expectation is that folks using MPX are
    going to be keen on figuring out how much memory is being
    dedicated to it. So we need a way to track memory use for MPX.

    If we want to specifically track MPX VMAs we need to be able to
    distinguish them from normal VMAs, and keep them from getting
    merged with normal VMAs. A new VM_ flag set only on MPX VMAs does
    both of those things. With this flag, MPX bounds-table VMAs can
    be distinguished from other VMAs, and userspace can also walk
    /proc/$pid/smaps to get memory usage for MPX.

    In addition to this flag, we also introduce a special ->vm_ops
    specific to MPX VMAs (see the patch "add MPX specific mmap
    interface"), but currently different ->vm_ops do not by
    themselves prevent VMA merging, so we still need this flag.

    We understand that VM_ flags are scarce and are open to other
    options.

    Signed-off-by: Qiaowei Ren
    Signed-off-by: Dave Hansen
    Cc: linux-mm@kvack.org
    Cc: linux-mips@linux-mips.org
    Cc: Dave Hansen
    Link: http://lkml.kernel.org/r/20141114151825.565625B3@viggo.jf.intel.com
    Signed-off-by: Thomas Gleixner

    Qiaowei Ren
     

06 Nov, 2014

1 commit

  • seq_printf functions shouldn't really check the return value.
    Checking seq_has_overflowed() occasionally is used instead.

    Update vfs documentation.

    Link: http://lkml.kernel.org/p/e37e6e7b76acbdcc3bb4ab2a57c8f8ca1ae11b9a.1412031505.git.joe@perches.com

    Cc: David S. Miller
    Cc: Al Viro
    Signed-off-by: Joe Perches
    [ did a few clean ups ]
    Signed-off-by: Steven Rostedt

    Joe Perches
     

14 Oct, 2014

1 commit

  • For VMAs that don't want write notifications, PTEs created for read faults
    have their write bit set. If the read fault happens after VM_SOFTDIRTY is
    cleared, then the PTE's softdirty bit will remain clear after subsequent
    writes.

    Here's a simple code snippet to demonstrate the bug:

    char* m = mmap(NULL, getpagesize(), PROT_READ | PROT_WRITE,
    MAP_ANONYMOUS | MAP_SHARED, -1, 0);
    system("echo 4 > /proc/$PPID/clear_refs"); /* clear VM_SOFTDIRTY */
    assert(*m == '\0'); /* new PTE allows write access */
    assert(!soft_dirty(x));
    *m = 'x'; /* should dirty the page */
    assert(soft_dirty(x)); /* fails */

    With this patch, write notifications are enabled when VM_SOFTDIRTY is
    cleared. Furthermore, to avoid unnecessary faults, write notifications
    are disabled when VM_SOFTDIRTY is set.

    As a side effect of enabling and disabling write notifications with
    care, this patch fixes a bug in mprotect where vm_page_prot bits set by
    drivers were zapped on mprotect. An analogous bug was fixed in mmap by
    commit c9d0bf241451 ("mm: uncached vma support with writenotify").

    Signed-off-by: Peter Feiner
    Reported-by: Peter Feiner
    Suggested-by: Kirill A. Shutemov
    Cc: Cyrill Gorcunov
    Cc: Pavel Emelyanov
    Cc: Jamie Liu
    Cc: Hugh Dickins
    Cc: Naoya Horiguchi
    Cc: Bjorn Helgaas
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Feiner
     

13 Oct, 2014

1 commit

  • Pull vfs updates from Al Viro:
    "The big thing in this pile is Eric's unmount-on-rmdir series; we
    finally have everything we need for that. The final piece of prereqs
    is delayed mntput() - now filesystem shutdown always happens on
    shallow stack.

    Other than that, we have several new primitives for iov_iter (Matt
    Wilcox, culled from his XIP-related series) pushing the conversion to
    ->read_iter()/ ->write_iter() a bit more, a bunch of fs/dcache.c
    cleanups and fixes (including the external name refcounting, which
    gives consistent behaviour of d_move() wrt procfs symlinks for long
    and short names alike) and assorted cleanups and fixes all over the
    place.

    This is just the first pile; there's a lot of stuff from various
    people that ought to go in this window. Starting with
    unionmount/overlayfs mess... ;-/"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (60 commits)
    fs/file_table.c: Update alloc_file() comment
    vfs: Deduplicate code shared by xattr system calls operating on paths
    reiserfs: remove pointless forward declaration of struct nameidata
    don't need that forward declaration of struct nameidata in dcache.h anymore
    take dname_external() into fs/dcache.c
    let path_init() failures treated the same way as subsequent link_path_walk()
    fix misuses of f_count() in ppp and netlink
    ncpfs: use list_for_each_entry() for d_subdirs walk
    vfs: move getname() from callers to do_mount()
    gfs2_atomic_open(): skip lookups on hashed dentry
    [infiniband] remove pointless assignments
    gadgetfs: saner API for gadgetfs_create_file()
    f_fs: saner API for ffs_sb_create_file()
    jfs: don't hash direct inode
    [s390] remove pointless assignment of ->f_op in vmlogrdr ->open()
    ecryptfs: ->f_op is never NULL
    android: ->f_op is never NULL
    nouveau: __iomem misannotations
    missing annotation in fs/file.c
    fs: namespace: suppress 'may be used uninitialized' warnings
    ...

    Linus Torvalds
     

10 Oct, 2014

11 commits

  • Pull cgroup updates from Tejun Heo:
    "Nothing too interesting. Just a handful of cleanup patches"

    * 'for-3.18' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
    Revert "cgroup: remove redundant variable in cgroup_mount()"
    cgroup: remove redundant variable in cgroup_mount()
    cgroup: fix missing unlock in cgroup_release_agent()
    cgroup: remove CGRP_RELEASABLE flag
    perf/cgroup: Remove perf_put_cgroup()
    cgroup: remove redundant check in cgroup_ino()
    cpuset: simplify proc_cpuset_show()
    cgroup: simplify proc_cgroup_show()
    cgroup: use a per-cgroup work for release agent
    cgroup: remove bogus comments
    cgroup: remove redundant code in cgroup_rmdir()
    cgroup: remove some useless forward declarations
    cgroup: fix a typo in comment.

    Linus Torvalds
     
  • Always mark pages with PageBalloon even if balloon compaction is disabled
    and expose this mark in /proc/kpageflags as KPF_BALLOON.

    Also this patch adds three counters into /proc/vmstat: "balloon_inflate",
    "balloon_deflate" and "balloon_migrate". They accumulate balloon
    activity. Current size of balloon is (balloon_inflate - balloon_deflate)
    pages.

    All generic balloon code now gathered under option CONFIG_MEMORY_BALLOON.
    It should be selected by ballooning driver which wants use this feature.
    Currently virtio-balloon is the only user.

    Signed-off-by: Konstantin Khlebnikov
    Cc: Rafael Aquini
    Cc: Andrey Ryabinin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     
  • If a /proc/pid/pagemap read spans a [VMA, an unmapped region, then a
    VM_SOFTDIRTY VMA], the virtual pages in the unmapped region are reported
    as softdirty. Here's a program to demonstrate the bug:

    int main() {
    const uint64_t PAGEMAP_SOFTDIRTY = 1ul << 55;
    uint64_t pme[3];
    int fd = open("/proc/self/pagemap", O_RDONLY);;
    char *m = mmap(NULL, 3 * getpagesize(), PROT_READ,
    MAP_ANONYMOUS | MAP_SHARED, -1, 0);
    munmap(m + getpagesize(), getpagesize());
    pread(fd, pme, 24, (unsigned long) m / getpagesize() * 8);
    assert(pme[0] & PAGEMAP_SOFTDIRTY); /* passes */
    assert(!(pme[1] & PAGEMAP_SOFTDIRTY)); /* fails */
    assert(pme[2] & PAGEMAP_SOFTDIRTY); /* passes */
    return 0;
    }

    (Note that all pages in new VMAs are softdirty until cleared).

    Tested:
    Used the program given above. I'm going to include this code in
    a selftest in the future.

    [n-horiguchi@ah.jp.nec.com: prevent pagemap_pte_range() from overrunning]
    Signed-off-by: Peter Feiner
    Cc: "Kirill A. Shutemov"
    Cc: Cyrill Gorcunov
    Cc: Pavel Emelyanov
    Cc: Jamie Liu
    Cc: Hugh Dickins
    Cc: Naoya Horiguchi
    Signed-off-by: Naoya Horiguchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Feiner
     
  • 9e7814404b77 "hold task->mempolicy while numa_maps scans." fixed the
    race with the exiting task but this is not enough.

    The current code assumes that get_vma_policy(task) should either see
    task->mempolicy == NULL or it should be equal to ->task_mempolicy saved
    by hold_task_mempolicy(), so we can never race with __mpol_put(). But
    this can only work if we can't race with do_set_mempolicy(), and thus
    we can't race with another do_set_mempolicy() or do_exit() after that.

    However, do_set_mempolicy()->down_write(mmap_sem) can not prevent this
    race. This task can exec, change it's ->mm, and call do_set_mempolicy()
    after that; in this case they take 2 different locks.

    Change hold_task_mempolicy() to use get_task_policy(), it never returns
    NULL, and change show_numa_map() to use __get_vma_policy() or fall back
    to proc_priv->task_mempolicy.

    Note: this is the minimal fix, we will cleanup this code later. I think
    hold_task_mempolicy() and release_task_mempolicy() should die, we can
    move this logic into show_numa_map(). Or we can move get_task_policy()
    outside of ->mmap_sem and !CONFIG_NUMA code at least.

    Signed-off-by: Oleg Nesterov
    Cc: KAMEZAWA Hiroyuki
    Cc: David Rientjes
    Cc: KOSAKI Motohiro
    Cc: Alexander Viro
    Cc: Cyrill Gorcunov
    Cc: "Eric W. Biederman"
    Cc: "Kirill A. Shutemov"
    Cc: Peter Zijlstra
    Cc: Hugh Dickins
    Cc: Andi Kleen
    Cc: Naoya Horiguchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • On some ARCHs modules range is eauql to vmalloc range. E.g on i686

    "#define MODULES_VADDR VMALLOC_START"
    "#define MODULES_END VMALLOC_END"

    This will cause 2 duplicate program segments in /proc/kcore, and no flag
    to indicate they are different. This is confusing. And usually people
    who need check the elf header or read the content of kcore will check
    memory ranges. Two program segments which are the same are unnecessary.

    So check if the modules range is equal to vmalloc range. If so, just skip
    adding the modules range.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Baoquan He
    Cc: Xishi Qiu
    Cc: Paul Gortmaker
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Baoquan He
     
  • - Rename vm_is_stack() to task_of_stack() and change it to return
    "struct task_struct *" rather than the global (and thus wrong in
    general) pid_t.

    - Add the new pid_of_stack() helper which calls task_of_stack() and
    uses the right namespace to report the correct pid_t.

    Unfortunately we need to define this helper twice, in task_mmu.c
    and in task_nommu.c. perhaps it makes sense to add fs/proc/util.c
    and move at least pid_of_stack/task_of_stack there to avoid the
    code duplication.

    - Change show_map_vma() and show_numa_map() to use the new helper.

    Signed-off-by: Oleg Nesterov
    Cc: Alexander Viro
    Cc: Cyrill Gorcunov
    Cc: "Eric W. Biederman"
    Cc: Greg Ungerer
    Cc: "Kirill A. Shutemov"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • m_start() can use get_proc_task() instead, and "struct inode *"
    provides more potentially useful info, see the next changes.

    Signed-off-by: Oleg Nesterov
    Cc: Alexander Viro
    Cc: Cyrill Gorcunov
    Cc: "Eric W. Biederman"
    Cc: Greg Ungerer
    Cc: "Kirill A. Shutemov"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • I do not know if CONFIG_PREEMPT/SMP is possible without CONFIG_MMU
    but the usage of task->mm in m_stop(). The task can exit/exec before
    we take mmap_sem, in this case m_stop() can hit NULL or unlock the
    wrong rw_semaphore.

    Also, this code uses priv->task != NULL to decide whether we need
    up_read/mmput. This is correct, but we will probably kill priv->task.
    Change m_start/m_stop to rely on IS_ERR_OR_NULL() like task_mmu.c does.

    Signed-off-by: Oleg Nesterov
    Cc: Kirill A. Shutemov
    Cc: Cyrill Gorcunov
    Cc: "Eric W. Biederman"
    Acked-by: Greg Ungerer
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Copy-and-paste the changes from "fs/proc/task_mmu.c: shift mm_access()
    from m_start() to proc_maps_open()" into task_nommu.c.

    Change maps_open() to initialize priv->mm using proc_mem_open(), m_start()
    can rely on atomic_inc_not_zero(mm_users) like task_mmu.c does.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Oleg Nesterov
    Cc: Kirill A. Shutemov
    Cc: Cyrill Gorcunov
    Cc: "Eric W. Biederman"
    Acked-by: Greg Ungerer
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Cleanup and preparation. maps_open() can use __seq_open_private()
    like proc_maps_open() does.

    [akpm@linux-foundation.org: deuglify]
    Signed-off-by: Oleg Nesterov
    Cc: Kirill A. Shutemov
    Cc: Cyrill Gorcunov
    Cc: "Eric W. Biederman"
    Acked-by: Greg Ungerer
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Change the main loop in m_start() to update m->version. Mostly for
    consistency, but this can help to avoid the same loop if the very
    1st ->show() fails due to seq_overflow().

    Signed-off-by: Oleg Nesterov
    Cc: Kirill A. Shutemov
    Cc: Cyrill Gorcunov
    Cc: "Eric W. Biederman"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov