13 Apr, 2014

2 commits

  • Pull vfs updates from Al Viro:
    "The first vfs pile, with deep apologies for being very late in this
    window.

    Assorted cleanups and fixes, plus a large preparatory part of iov_iter
    work. There's a lot more of that, but it'll probably go into the next
    merge window - it *does* shape up nicely, removes a lot of
    boilerplate, gets rid of locking inconsistencie between aio_write and
    splice_write and I hope to get Kent's direct-io rewrite merged into
    the same queue, but some of the stuff after this point is having
    (mostly trivial) conflicts with the things already merged into
    mainline and with some I want more testing.

    This one passes LTP and xfstests without regressions, in addition to
    usual beating. BTW, readahead02 in ltp syscalls testsuite has started
    giving failures since "mm/readahead.c: fix readahead failure for
    memoryless NUMA nodes and limit readahead pages" - might be a false
    positive, might be a real regression..."

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (63 commits)
    missing bits of "splice: fix racy pipe->buffers uses"
    cifs: fix the race in cifs_writev()
    ceph_sync_{,direct_}write: fix an oops on ceph_osdc_new_request() failure
    kill generic_file_buffered_write()
    ocfs2_file_aio_write(): switch to generic_perform_write()
    ceph_aio_write(): switch to generic_perform_write()
    xfs_file_buffered_aio_write(): switch to generic_perform_write()
    export generic_perform_write(), start getting rid of generic_file_buffer_write()
    generic_file_direct_write(): get rid of ppos argument
    btrfs_file_aio_write(): get rid of ppos
    kill the 5th argument of generic_file_buffered_write()
    kill the 4th argument of __generic_file_aio_write()
    lustre: don't open-code kernel_recvmsg()
    ocfs2: don't open-code kernel_recvmsg()
    drbd: don't open-code kernel_recvmsg()
    constify blk_rq_map_user_iov() and friends
    lustre: switch to kernel_sendmsg()
    ocfs2: don't open-code kernel_sendmsg()
    take iov_iter stuff to mm/iov_iter.c
    process_vm_access: tidy up a bit
    ...

    Linus Torvalds
     
  • Pull audit updates from Eric Paris.

    * git://git.infradead.org/users/eparis/audit: (28 commits)
    AUDIT: make audit_is_compat depend on CONFIG_AUDIT_COMPAT_GENERIC
    audit: renumber AUDIT_FEATURE_CHANGE into the 1300 range
    audit: do not cast audit_rule_data pointers pointlesly
    AUDIT: Allow login in non-init namespaces
    audit: define audit_is_compat in kernel internal header
    kernel: Use RCU_INIT_POINTER(x, NULL) in audit.c
    sched: declare pid_alive as inline
    audit: use uapi/linux/audit.h for AUDIT_ARCH declarations
    syscall_get_arch: remove useless function arguments
    audit: remove stray newline from audit_log_execve_info() audit_panic() call
    audit: remove stray newlines from audit_log_lost messages
    audit: include subject in login records
    audit: remove superfluous new- prefix in AUDIT_LOGIN messages
    audit: allow user processes to log from another PID namespace
    audit: anchor all pid references in the initial pid namespace
    audit: convert PPIDs to the inital PID namespace.
    pid: get pid_t ppid of task in init_pid_ns
    audit: rename the misleading audit_get_context() to audit_take_context()
    audit: Add generic compat syscall support
    audit: Add CONFIG_HAVE_ARCH_AUDITSYSCALL
    ...

    Linus Torvalds
     

08 Apr, 2014

10 commits

  • /proc/self/make-it-fail is a boolean, but accepts any number, including
    negative ones. Change variable to unsigned, and cap upper bound at 1.

    [akpm@linux-foundation.org: don't make make_it_fail unsigned]
    Signed-off-by: Dave Jones
    Reviewed-by: Akinobu Mita
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dave Jones
     
  • Currently when an empty PT_NOTE is detected, vmcore initialization
    fails. It sounds too harsh. Because PT_NOTE could be empty, for
    example, one offlined a cpu but never restarted kdump service, and after
    crash, PT_NOTE program header is there but no data contains. It's
    better to warn about the empty PT_NOTE and continue to initialise
    vmcore.

    And ultimately the multiple PT_NOTE are merged into a single one, all
    empty PT_NOTE are discarded naturally during the merge. So empty
    PT_NOTE is not visible to user space and vmcore is as good as expected.

    Signed-off-by: WANG Chao
    Cc: Vivek Goyal
    Cc: HATAYAMA Daisuke
    Cc: Greg Pearson
    Cc: Baoquan He
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    WANG Chao
     
  • Eliminate the following warning in proc/vmcore.c:

    fs/proc/vmcore.c:1088:6: warning: no previous prototype for `vmcore_cleanup' [-Wmissing-prototypes]

    [akpm@linux-foundation.org: clean up powerpc, remove unneeded EXPORT_SYMBOL]
    Signed-off-by: Rashika Kheria
    Reviewed-by: Josh Triplett
    Cc: Benjamin Herrenschmidt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rashika Kheria
     
  • get_task_state() uses the most significant bit to report the state to
    user-space, this means that EXIT_ZOMBIE->EXIT_TRACE->EXIT_DEAD transition
    can be noticed via /proc as Z -> X -> Z change. Note that this was
    possible even before EXIT_TRACE was introduced.

    This is not really bad but imho it make sense to hide EXIT_TRACE from
    user-space completely. So the patch simply swaps EXIT_ZOMBIE and
    EXIT_DEAD, this way EXIT_TRACE will be seen as EXIT_ZOMBIE by user-space.

    Signed-off-by: Oleg Nesterov
    Cc: Jan Kratochvil
    Cc: Michal Schmidt
    Cc: Al Viro
    Cc: Lennart Poettering
    Cc: Roland McGrath
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • The /proc/*/pagemap contain sensitive information and currently its mode
    is 0444. Change this to 0400, so the VFS will prevent unprivileged
    processes from getting file descriptors on arbitrary privileged
    /proc/*/pagemap files.

    This reduces the scope of address space leaking and bypasses by protecting
    already running processes.

    Signed-off-by: Djalal Harouni
    Acked-by: Kees Cook
    Acked-by: Andy Lutomirski
    Cc: Eric W. Biederman
    Cc: Al Viro
    Cc: Oleg Nesterov
    Cc: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Djalal Harouni
     
  • These procfs files contain sensitive information and currently their
    mode is 0444. Change this to 0400, so the VFS will be able to block
    unprivileged processes from getting file descriptors on arbitrary
    privileged /proc/*/{stack,syscall,personality} files.

    This reduces the scope of ASLR leaking and bypasses by protecting already
    running processes.

    Signed-off-by: Djalal Harouni
    Acked-by: Kees Cook
    Acked-by: Andy Lutomirski
    Cc: Eric W. Biederman
    Cc: Al Viro
    Cc: Oleg Nesterov
    Cc: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Djalal Harouni
     
  • Replace rcu_assign_pointer(x, NULL) with RCU_INIT_POINTER(x, NULL)

    The rcu_assign_pointer() ensures that the initialization of a structure
    is carried out before storing a pointer to that structure. And in the
    case of the NULL pointer, there is no structure to initialize. So,
    rcu_assign_pointer(p, NULL) can be safely converted to
    RCU_INIT_POINTER(p, NULL)

    Signed-off-by: Monam Agarwal
    Cc: "Paul E. McKenney"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Monam Agarwal
     
  • Currently we don't have a way how to determing from which mount point
    file has been opened. This information is required for proper dumping
    and restoring file descriptos due to presence of mount namespaces. It's
    possible, that two file descriptors are opened using the same paths, but
    one fd references mount point from one namespace while the other fd --
    from other namespace.

    $ ls -l /proc/1/fd/1
    lrwx------ 1 root root 64 Mar 19 23:54 /proc/1/fd/1 -> /dev/null

    $ cat /proc/1/fdinfo/1
    pos: 0
    flags: 0100002
    mnt_id: 16

    $ cat /proc/1/mountinfo | grep ^16
    16 32 0:4 / /dev rw,nosuid shared:2 - devtmpfs devtmpfs rw,size=1013356k,nr_inodes=253339,mode=755

    Signed-off-by: Andrey Vagin
    Acked-by: Pavel Emelyanov
    Acked-by: Cyrill Gorcunov
    Cc: Rob Landley
    Cc: Al Viro
    Cc: Oleg Nesterov
    Cc: "Eric W. Biederman"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Vagin
     
  • It should read "reclaimable slab" and not "reclaimable swap".

    Signed-off-by: Luiz Capitulino
    Reviewed-by: Rik van Riel
    Acked-by: Rafael Aquini
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Luiz Capitulino
     
  • This patch is a continuation of efforts trying to optimize find_vma(),
    avoiding potentially expensive rbtree walks to locate a vma upon faults.
    The original approach (https://lkml.org/lkml/2013/11/1/410), where the
    largest vma was also cached, ended up being too specific and random,
    thus further comparison with other approaches were needed. There are
    two things to consider when dealing with this, the cache hit rate and
    the latency of find_vma(). Improving the hit-rate does not necessarily
    translate in finding the vma any faster, as the overhead of any fancy
    caching schemes can be too high to consider.

    We currently cache the last used vma for the whole address space, which
    provides a nice optimization, reducing the total cycles in find_vma() by
    up to 250%, for workloads with good locality. On the other hand, this
    simple scheme is pretty much useless for workloads with poor locality.
    Analyzing ebizzy runs shows that, no matter how many threads are
    running, the mmap_cache hit rate is less than 2%, and in many situations
    below 1%.

    The proposed approach is to replace this scheme with a small per-thread
    cache, maximizing hit rates at a very low maintenance cost.
    Invalidations are performed by simply bumping up a 32-bit sequence
    number. The only expensive operation is in the rare case of a seq
    number overflow, where all caches that share the same address space are
    flushed. Upon a miss, the proposed replacement policy is based on the
    page number that contains the virtual address in question. Concretely,
    the following results are seen on an 80 core, 8 socket x86-64 box:

    1) System bootup: Most programs are single threaded, so the per-thread
    scheme does improve ~50% hit rate by just adding a few more slots to
    the cache.

    +----------------+----------+------------------+
    | caching scheme | hit-rate | cycles (billion) |
    +----------------+----------+------------------+
    | baseline | 50.61% | 19.90 |
    | patched | 73.45% | 13.58 |
    +----------------+----------+------------------+

    2) Kernel build: This one is already pretty good with the current
    approach as we're dealing with good locality.

    +----------------+----------+------------------+
    | caching scheme | hit-rate | cycles (billion) |
    +----------------+----------+------------------+
    | baseline | 75.28% | 11.03 |
    | patched | 88.09% | 9.31 |
    +----------------+----------+------------------+

    3) Oracle 11g Data Mining (4k pages): Similar to the kernel build workload.

    +----------------+----------+------------------+
    | caching scheme | hit-rate | cycles (billion) |
    +----------------+----------+------------------+
    | baseline | 70.66% | 17.14 |
    | patched | 91.15% | 12.57 |
    +----------------+----------+------------------+

    4) Ebizzy: There's a fair amount of variation from run to run, but this
    approach always shows nearly perfect hit rates, while baseline is just
    about non-existent. The amounts of cycles can fluctuate between
    anywhere from ~60 to ~116 for the baseline scheme, but this approach
    reduces it considerably. For instance, with 80 threads:

    +----------------+----------+------------------+
    | caching scheme | hit-rate | cycles (billion) |
    +----------------+----------+------------------+
    | baseline | 1.06% | 91.54 |
    | patched | 99.97% | 14.18 |
    +----------------+----------+------------------+

    [akpm@linux-foundation.org: fix nommu build, per Davidlohr]
    [akpm@linux-foundation.org: document vmacache_valid() logic]
    [akpm@linux-foundation.org: attempt to untangle header files]
    [akpm@linux-foundation.org: add vmacache_find() BUG_ON]
    [hughd@google.com: add vmacache_valid_mm() (from Oleg)]
    [akpm@linux-foundation.org: coding-style fixes]
    [akpm@linux-foundation.org: adjust and enhance comments]
    Signed-off-by: Davidlohr Bueso
    Reviewed-by: Rik van Riel
    Acked-by: Linus Torvalds
    Reviewed-by: Michel Lespinasse
    Cc: Oleg Nesterov
    Tested-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     

05 Apr, 2014

1 commit

  • Pull ext4 updates from Ted Ts'o:
    "Major changes for 3.14 include support for the newly added ZERO_RANGE
    and COLLAPSE_RANGE fallocate operations, and scalability improvements
    in the jbd2 layer and in xattr handling when the extended attributes
    spill over into an external block.

    Other than that, the usual clean ups and minor bug fixes"

    * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (42 commits)
    ext4: fix premature freeing of partial clusters split across leaf blocks
    ext4: remove unneeded test of ret variable
    ext4: fix comment typo
    ext4: make ext4_block_zero_page_range static
    ext4: atomically set inode->i_flags in ext4_set_inode_flags()
    ext4: optimize Hurd tests when reading/writing inodes
    ext4: kill i_version support for Hurd-castrated file systems
    ext4: each filesystem creates and uses its own mb_cache
    fs/mbcache.c: doucple the locking of local from global data
    fs/mbcache.c: change block and index hash chain to hlist_bl_node
    ext4: Introduce FALLOC_FL_ZERO_RANGE flag for fallocate
    ext4: refactor ext4_fallocate code
    ext4: Update inode i_size after the preallocation
    ext4: fix partial cluster handling for bigalloc file systems
    ext4: delete path dealloc code in ext4_ext_handle_uninitialized_extents
    ext4: only call sync_filesystm() when remounting read-only
    fs: push sync_filesystem() down to the file system's remount_fs()
    jbd2: improve error messages for inconsistent journal heads
    jbd2: minimize region locked by j_list_lock in jbd2_journal_forget()
    jbd2: minimize region locked by j_list_lock in journal_get_create_access()
    ...

    Linus Torvalds
     

04 Apr, 2014

1 commit

  • Reclaim will be leaving shadow entries in the page cache radix tree upon
    evicting the real page. As those pages are found from the LRU, an
    iput() can lead to the inode being freed concurrently. At this point,
    reclaim must no longer install shadow pages because the inode freeing
    code needs to ensure the page tree is really empty.

    Add an address_space flag, AS_EXITING, that the inode freeing code sets
    under the tree lock before doing the final truncate. Reclaim will check
    for this flag before installing shadow pages.

    Signed-off-by: Johannes Weiner
    Reviewed-by: Rik van Riel
    Reviewed-by: Minchan Kim
    Cc: Andrea Arcangeli
    Cc: Bob Liu
    Cc: Christoph Hellwig
    Cc: Dave Chinner
    Cc: Greg Thelen
    Cc: Hugh Dickins
    Cc: Jan Kara
    Cc: KOSAKI Motohiro
    Cc: Luigi Semenzato
    Cc: Mel Gorman
    Cc: Metin Doslu
    Cc: Michel Lespinasse
    Cc: Ozgun Erdogan
    Cc: Peter Zijlstra
    Cc: Roman Gushchin
    Cc: Ryan Mallon
    Cc: Tejun Heo
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

03 Apr, 2014

1 commit

  • Pull devicetree changes from Grant Likely:
    "Updates to devicetree core code. This branch contains the following
    notable changes:

    - add reserved memory binding
    - make struct device_node a kobject and remove legacy
    /proc/device-tree
    - ePAPR conformance fixes
    - update in-kernel DTC copy to version v1.4.0
    - preparatory changes for dynamic device tree overlays
    - minor bug fixes and documentation changes

    The most significant change in this branch is the conversion of struct
    device_node to be a kobject that is exposed via sysfs and removal of
    the old /proc/device-tree code. This simplifies the device tree
    handling code and tightens up the lifecycle on device tree nodes.

    [updated: added fix for dangling select PROC_DEVICETREE]"

    * tag 'dt-for-linus' of git://git.secretlab.ca/git/linux: (29 commits)
    dt: Remove dangling "select PROC_DEVICETREE"
    of: Add support for ePAPR "stdout-path" property
    of: device_node kobject lifecycle fixes
    of: only scan for reserved mem when fdt present
    powerpc: add support for reserved memory defined by device tree
    arm64: add support for reserved memory defined by device tree
    of: add missing major vendors
    of: add vendor prefix for SMSC
    of: remove /proc/device-tree
    of/selftest: Add self tests for manipulation of properties
    of: Make device nodes kobjects so they show up in sysfs
    arm: add support for reserved memory defined by device tree
    drivers: of: add support for custom reserved memory drivers
    drivers: of: add initialization code for dynamic reserved memory
    drivers: of: add initialization code for static reserved memory
    of: document bindings for reserved-memory nodes
    Revert "of: fix of_update_property()"
    kbuild: dtbs_install: new make target
    ARM: mvebu: Allows to get the SoC ID even without PCI enabled
    of: Allows to use the PCI translator without the PCI core
    ...

    Linus Torvalds
     

02 Apr, 2014

2 commits

  • Signed-off-by: Al Viro

    Al Viro
     
  • Pull timer updates from Ingo Molnar:
    "The main purpose is to fix a full dynticks bug related to
    virtualization, where steal time accounting appears to be zero in
    /proc/stat even after a few seconds of competing guests running busy
    loops in a same host CPU. It's not a regression though as it was
    there since the beginning.

    The other commits are preparatory work to fix the bug and various
    cleanups"

    * 'timers-nohz-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    arch: Remove stub cputime.h headers
    sched: Remove needless round trip nsecs tick conversion of steal time
    cputime: Fix jiffies based cputime assumption on steal accounting
    cputime: Bring cputime -> nsecs conversion
    cputime: Default implementation of nsecs -> cputime conversion
    cputime: Fix nsecs_to_cputime() return type cast

    Linus Torvalds
     

31 Mar, 2014

1 commit


20 Mar, 2014

1 commit


13 Mar, 2014

2 commits

  • The architectures that override cputime_t (s390, ppc) don't provide
    any version of nsecs_to_cputime(). Indeed this cputime_t implementation
    by backend only happens when CONFIG_VIRT_CPU_ACCOUNTING_NATIVE=y under
    which the core code doesn't make any use of nsecs_to_cputime().

    At least for now.

    We are going to make a broader use of it so lets provide a default
    version with a per usecs granularity. It should be good enough for most
    usecases.

    Cc: Ingo Molnar
    Cc: Marcelo Tosatti
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Acked-by: Rik van Riel
    Signed-off-by: Frederic Weisbecker

    Frederic Weisbecker
     
  • Previously, the no-op "mount -o mount /dev/xxx" operation when the
    file system is already mounted read-write causes an implied,
    unconditional syncfs(). This seems pretty stupid, and it's certainly
    documented or guaraunteed to do this, nor is it particularly useful,
    except in the case where the file system was mounted rw and is getting
    remounted read-only.

    However, it's possible that there might be some file systems that are
    actually depending on this behavior. In most file systems, it's
    probably fine to only call sync_filesystem() when transitioning from
    read-write to read-only, and there are some file systems where this is
    not needed at all (for example, for a pseudo-filesystem or something
    like romfs).

    Signed-off-by: "Theodore Ts'o"
    Cc: linux-fsdevel@vger.kernel.org
    Cc: Christoph Hellwig
    Cc: Artem Bityutskiy
    Cc: Adrian Hunter
    Cc: Evgeniy Dushistov
    Cc: Jan Kara
    Cc: OGAWA Hirofumi
    Cc: Anders Larsen
    Cc: Phillip Lougher
    Cc: Kees Cook
    Cc: Mikulas Patocka
    Cc: Petr Vandrovec
    Cc: xfs@oss.sgi.com
    Cc: linux-btrfs@vger.kernel.org
    Cc: linux-cifs@vger.kernel.org
    Cc: samba-technical@lists.samba.org
    Cc: codalist@coda.cs.cmu.edu
    Cc: linux-ext4@vger.kernel.org
    Cc: linux-f2fs-devel@lists.sourceforge.net
    Cc: fuse-devel@lists.sourceforge.net
    Cc: cluster-devel@redhat.com
    Cc: linux-mtd@lists.infradead.org
    Cc: jfs-discussion@lists.sourceforge.net
    Cc: linux-nfs@vger.kernel.org
    Cc: linux-nilfs@vger.kernel.org
    Cc: linux-ntfs-dev@lists.sourceforge.net
    Cc: ocfs2-devel@oss.oracle.com
    Cc: reiserfs-devel@vger.kernel.org

    Theodore Ts'o
     

12 Mar, 2014

1 commit

  • The same data is now available in sysfs, so we can remove the code
    that exports it in /proc and replace it with a symlink to the sysfs
    version.

    Tested on versatile qemu model and mpc5200 eval board. More testing
    would be appreciated.

    v5: Fixed up conflicts with mainline changes

    Signed-off-by: Grant Likely
    Cc: Rob Herring
    Cc: Benjamin Herrenschmidt
    Cc: David S. Miller
    Cc: Nathan Fontenot
    Cc: Pantelis Antoniou

    Grant Likely
     

11 Mar, 2014

1 commit

  • The expected logic of proc_map_files_get_link() is either to return 0
    and initialize 'path' or return an error and leave 'path' uninitialized.

    By the time dname_to_vma_addr() returns 0 the corresponding vma may have
    already be gone. In this case the path is not initialized but the
    return value is still 0. This results in 'general protection fault'
    inside d_path().

    Steps to reproduce:

    CONFIG_CHECKPOINT_RESTORE=y

    fd = open(...);
    while (1) {
    mmap(fd, ...);
    munmap(fd, ...);
    }

    ls -la /proc/$PID/map_files

    Addresses https://bugzilla.kernel.org/show_bug.cgi?id=68991

    Signed-off-by: Artem Fetishev
    Signed-off-by: Aleksandr Terekhov
    Reported-by:
    Acked-by: Pavel Emelyanov
    Acked-by: Cyrill Gorcunov
    Reviewed-by: "Eric W. Biederman"
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Artem Fetishev
     

04 Mar, 2014

1 commit

  • Commit bf6bddf1924e ("mm: introduce compaction and migration for
    ballooned pages") introduces page_count(page) into memory compaction
    which dereferences page->first_page if PageTail(page).

    This results in a very rare NULL pointer dereference on the
    aforementioned page_count(page). Indeed, anything that does
    compound_head(), including page_count() is susceptible to racing with
    prep_compound_page() and seeing a NULL or dangling page->first_page
    pointer.

    This patch uses Andrea's implementation of compound_trans_head() that
    deals with such a race and makes it the default compound_head()
    implementation. This includes a read memory barrier that ensures that
    if PageTail(head) is true that we return a head page that is neither
    NULL nor dangling. The patch then adds a store memory barrier to
    prep_compound_page() to ensure page->first_page is set.

    This is the safest way to ensure we see the head page that we are
    expecting, PageTail(page) is already in the unlikely() path and the
    memory barriers are unfortunately required.

    Hugetlbfs is the exception, we don't enforce a store memory barrier
    during init since no race is possible.

    Signed-off-by: David Rientjes
    Cc: Holger Kiehl
    Cc: Christoph Lameter
    Cc: Rafael Aquini
    Cc: Vlastimil Babka
    Cc: Michal Hocko
    Cc: Mel Gorman
    Cc: Andrea Arcangeli
    Cc: Rik van Riel
    Cc: "Kirill A. Shutemov"
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     

11 Feb, 2014

1 commit

  • Currently, update_note_header_size_elf64() and
    update_note_header_size_elf32() will add the size of a PT_NOTE entry to
    real_sz even if that causes real_sz to exceeds max_sz. This patch
    corrects the while loop logic in those routines to ensure that does not
    happen and prints a warning if a PT_NOTE entry is dropped. If zero
    PT_NOTE entries are found or this condition is encountered because the
    only entry was dropped, a warning is printed and an error is returned.

    One possible negative side effect of exceeding the max_sz limit is an
    allocation failure in merge_note_headers_elf64() or
    merge_note_headers_elf32() which would produce console output such as
    the following while booting the crash kernel.

    vmalloc: allocation failure: 14076997632 bytes
    swapper/0: page allocation failure: order:0, mode:0x80d2
    CPU: 0 PID: 1 Comm: swapper/0 Not tainted 3.10.0-gbp1 #7
    Call Trace:
    dump_stack+0x19/0x1b
    warn_alloc_failed+0xf0/0x160
    __vmalloc_node_range+0x19e/0x250
    vmalloc_user+0x4c/0x70
    merge_note_headers_elf64.constprop.9+0x116/0x24a
    vmcore_init+0x2d4/0x76c
    do_one_initcall+0xe2/0x190
    kernel_init_freeable+0x17c/0x207
    kernel_init+0xe/0x180
    ret_from_fork+0x7c/0xb0

    Kdump: vmcore not initialized

    kdump: dump target is /dev/sda4
    kdump: saving to /sysroot//var/crash/127.0.0.1-2014.01.28-13:58:52/
    kdump: saving vmcore-dmesg.txt
    Cannot open /proc/vmcore: No such file or directory
    kdump: saving vmcore-dmesg.txt failed
    kdump: saving vmcore
    kdump: saving vmcore failed

    This type of failure has been seen on a four socket prototype system
    with certain memory configurations. Most PT_NOTE sections have a single
    entry similar to:

    n_namesz = 0x5
    n_descsz = 0x150
    n_type = 0x1

    Occasionally, a second entry is encountered with very large n_namesz and
    n_descsz sizes:

    n_namesz = 0x80000008
    n_descsz = 0x510ae163
    n_type = 0x80000008

    Not yet sure of the source of these extra entries, they seem bogus, but
    they shouldn't cause crash dump to fail.

    Signed-off-by: Greg Pearson
    Acked-by: Vivek Goyal
    Cc: HATAYAMA Daisuke
    Cc: Michael Holzheu
    Cc: "Eric W. Biederman"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Greg Pearson
     

24 Jan, 2014

10 commits

  • Change the remaining next_thread (ab)users to use while_each_thread().

    The last user which should be changed is next_tid(), but we can't do this
    now.

    __exit_signal() and complete_signal() are fine, they actually need
    next_thread() logic.

    This patch (of 3):

    do_task_stat() can use while_each_thread(), no changes in
    the compiled code.

    Signed-off-by: Oleg Nesterov
    Cc: KOSAKI Motohiro
    Cc: Al Viro
    Cc: Kees Cook
    Reviewed-by: Sameer Nanda
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • PROC_FS is a bool, so this code is either present or absent. It will
    never be modular, so using module_init as an alias for __initcall is
    rather misleading.

    Fix this up now, so that we can relocate module_init from init.h into
    module.h in the future. If we don't do this, we'd have to add module.h to
    obviously non-modular code, and that would be ugly at best.

    Note that direct use of __initcall is discouraged, vs. one of the
    priority categorized subgroups. As __initcall gets mapped onto
    device_initcall, our use of fs_initcall (which makes sense for fs code)
    will thus change these registrations from level 6-device to level 5-fs
    (i.e. slightly earlier). However no observable impact of that small
    difference has been observed during testing, or is expected.

    Also note that this change uncovers a missing semicolon bug in the
    registration of vmcore_init as an initcall.

    Signed-off-by: Paul Gortmaker
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Gortmaker
     
  • Distribution kernels might want to build in support for /proc/device-tree
    for kernels that might end up running on hardware that doesn't support
    openfirmware. This results in an empty /proc/device-tree existing.
    Remove it if the OFW root node doesn't exist.

    This situation actually confuses grub2, resulting in install failures.
    grub2 sees the /proc/device-tree and picks the wrong install target cf.
    http://bzr.savannah.gnu.org/lh/grub/trunk/grub/annotate/4300/util/grub-install.in#L311
    grub should be more robust, but still, leaving an empty proc dir seems
    pointless.

    Addresses https://bugzilla.redhat.com/show_bug.cgi?id=818378.

    Signed-off-by: Dave Jones
    Cc: Al Viro
    Cc: Paul Mackerras
    Cc: Josh Boyer
    Cc: Benjamin Herrenschmidt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dave Jones
     
  • Use existing accessors proc_set_user() and proc_set_size() to set
    attributes. Just a cleanup.

    Signed-off-by: Rui Xiang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rui Xiang
     
  • 1. proc_task_readdir()->first_tid() path truncates f_pos to int, this
    is wrong even on 64bit.

    We could check that f_pos < PID_MAX or even INT_MAX in
    proc_task_readdir(), but this patch simply checks the potential
    overflow in first_tid(), this check is nop on 64bit. We do not care if
    it was negative and the new unsigned value is huge, all we need to
    ensure is that we never wrongly return !NULL.

    2. Remove the 2nd "nr != 0" check before get_nr_threads(),
    nr_threads == 0 is not distinguishable from !pid_task() above.

    Signed-off-by: Oleg Nesterov
    Cc: "Eric W. Biederman"
    Cc: Michal Hocko
    Cc: Sameer Nanda
    Cc: Sergey Dyasly
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • proc_task_readdir() does not really need "leader", first_tid() has to
    revalidate it anyway. Just pass proc_pid(inode) to first_tid() instead,
    it can do pid_task(PIDTYPE_PID) itself and read ->group_leader only if
    necessary.

    The patch also extracts the "inode is dead" code from
    pid_delete_dentry(dentry) into the new trivial helper,
    proc_inode_is_dead(inode), proc_task_readdir() uses it to return -ENOENT
    if this dir was removed.

    This is a bit racy, but the race is very inlikely and the getdents() after
    openndir() can see the empty "." + ".." dir only once.

    Signed-off-by: Oleg Nesterov
    Cc: "Eric W. Biederman"
    Cc: Michal Hocko
    Cc: Sameer Nanda
    Cc: Sergey Dyasly
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Rerwrite the main loop to use while_each_thread() instead of
    next_thread(). We are going to fix or replace while_each_thread(),
    next_thread() should be avoided whenever possible.

    Signed-off-by: Oleg Nesterov
    Cc: "Eric W. Biederman"
    Cc: Michal Hocko
    Cc: Sameer Nanda
    Cc: Sergey Dyasly
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • proc_task_readdir() verifies that the result of get_proc_task() is
    pid_alive() and thus its ->group_leader is fine too. However this is not
    necessarily true after rcu_read_unlock(), we need to recheck this again
    after first_tid() does rcu_read_lock(). Otherwise
    leader->thread_group.next (used by next_thread()) can be invalid if the
    rcu grace period expires in between.

    The race is subtle and unlikely, but still it is possible afaics. To
    simplify lets ignore the "likely" case when tid != 0, f_version can be
    cleared by proc_task_operations->llseek().

    Suppose we have a main thread M and its subthread T. Suppose that f_pos
    == 3, iow first_tid() should return T. Now suppose that the following
    happens between rcu_read_unlock() and rcu_read_lock():

    1. T execs and becomes the new leader. This removes M from
    ->thread_group but next_thread(M) is still T.

    2. T creates another thread X which does exec as well, T
    goes away.

    3. X creates another subthread, this increments nr_threads.

    4. first_tid() does next_thread(M) and returns the already
    dead T.

    Note also that we need 2. and 3. only because of get_nr_threads() check,
    and this check was supposed to be optimization only.

    Signed-off-by: Oleg Nesterov
    Cc: "Eric W. Biederman"
    Cc: Michal Hocko
    Cc: Sameer Nanda
    Cc: Sergey Dyasly
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • get_task_state() and task_state_array[] look confusing and suboptimal, it
    is not clear what it can actually report to user-space and
    task_state_array[] blows .data for no reason.

    1. state = (tsk->state & TASK_REPORT) | tsk->exit_state is not
    clear. TASK_REPORT is self-documenting but it is not clear
    what ->exit_state can add.

    Move the potential exit_state's (EXIT_ZOMBIE and EXIT_DEAD)
    into TASK_REPORT and use it to calculate the final result.

    2. With the change above it is obvious that task_state_array[]
    has the unused entries just to make BUILD_BUG_ON() happy.

    Change this BUILD_BUG_ON() to use TASK_REPORT rather than
    TASK_STATE_MAX and shrink task_state_array[].

    3. Turn the "while (state)" loop into fls(state).

    Signed-off-by: Oleg Nesterov
    Cc: Peter Zijlstra
    Cc: David Laight
    Cc: Geert Uytterhoeven
    Cc: Ingo Molnar
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • stable_page_flags() checks !PageHuge && PageTransCompound && PageLRU to
    know that a specified page is thp or not. But sometimes it's not enough
    and we fail to detect thp when the thp is on pagevec. This happens only
    for a few seconds after LRU list operations, but it makes it difficult
    to control our applications depending on this flag.

    So this patch adds another check PageAnon to detect thps on pagevec. It
    might not give the future extensibility for thp pagecache, but it's OK
    at least for now.

    Signed-off-by: Naoya Horiguchi
    Cc: David Rientjes
    Cc: KOSAKI Motohiro
    Cc: Wu Fengguang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     

22 Jan, 2014

1 commit

  • Many load balancing and workload placing programs check /proc/meminfo to
    estimate how much free memory is available. They generally do this by
    adding up "free" and "cached", which was fine ten years ago, but is
    pretty much guaranteed to be wrong today.

    It is wrong because Cached includes memory that is not freeable as page
    cache, for example shared memory segments, tmpfs, and ramfs, and it does
    not include reclaimable slab memory, which can take up a large fraction
    of system memory on mostly idle systems with lots of files.

    Currently, the amount of memory that is available for a new workload,
    without pushing the system into swap, can be estimated from MemFree,
    Active(file), Inactive(file), and SReclaimable, as well as the "low"
    watermarks from /proc/zoneinfo.

    However, this may change in the future, and user space really should not
    be expected to know kernel internals to come up with an estimate for the
    amount of free memory.

    It is more convenient to provide such an estimate in /proc/meminfo. If
    things change in the future, we only have to change it in one place.

    Signed-off-by: Rik van Riel
    Reported-by: Erik Mouw
    Acked-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rik van Riel
     

13 Dec, 2013

1 commit

  • Commit fad1a86e25e0 ("procfs: call default get_unmapped_area on
    MMU-present architectures"), as its title says, took care of only the
    MMU case, leaving the !MMU side still in the regressed state (returning
    -EIO in all cases where pde->proc_fops->get_unmapped_area is NULL).

    From the fad1a86e25e0 changelog:

    "Commit c4fe24485729 ("sparc: fix PCI device proc file mmap(2)") added
    proc_reg_get_unmapped_area in proc_reg_file_ops and
    proc_reg_file_ops_no_compat, by which now mmap always returns EIO if
    get_unmapped_area method is not defined for the target procfs file, which
    causes regression of mmap on /proc/vmcore.

    To address this issue, like get_unmapped_area(), call default
    current->mm->get_unmapped_area on MMU-present architectures if
    pde->proc_fops->get_unmapped_area, i.e. the one in actual file operation
    in the procfs file, is not defined"

    Signed-off-by: Jan Beulich
    Cc: HATAYAMA Daisuke
    Cc: Alexey Dobriyan
    Cc: David S. Miller
    Cc: [3.12.x]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Beulich
     

22 Nov, 2013

1 commit

  • Pull audit updates from Eric Paris:
    "Nothing amazing. Formatting, small bug fixes, couple of fixes where
    we didn't get records due to some old VFS changes, and a change to how
    we collect execve info..."

    Fixed conflict in fs/exec.c as per Eric and linux-next.

    * git://git.infradead.org/users/eparis/audit: (28 commits)
    audit: fix type of sessionid in audit_set_loginuid()
    audit: call audit_bprm() only once to add AUDIT_EXECVE information
    audit: move audit_aux_data_execve contents into audit_context union
    audit: remove unused envc member of audit_aux_data_execve
    audit: Kill the unused struct audit_aux_data_capset
    audit: do not reject all AUDIT_INODE filter types
    audit: suppress stock memalloc failure warnings since already managed
    audit: log the audit_names record type
    audit: add child record before the create to handle case where create fails
    audit: use given values in tty_audit enable api
    audit: use nlmsg_len() to get message payload length
    audit: use memset instead of trying to initialize field by field
    audit: fix info leak in AUDIT_GET requests
    audit: update AUDIT_INODE filter rule to comparator function
    audit: audit feature to set loginuid immutable
    audit: audit feature to only allow unsetting the loginuid
    audit: allow unsetting the loginuid (with priv)
    audit: remove CONFIG_AUDIT_LOGINUID_IMMUTABLE
    audit: loginuid functions coding style
    selinux: apply selinux checks on new audit message types
    ...

    Linus Torvalds
     

16 Nov, 2013

1 commit


15 Nov, 2013

1 commit