03 Apr, 2019

1 commit

  • commit 23da9588037ecdd4901db76a5b79a42b529c4ec3 upstream.

    Syzkaller reports:

    kasan: GPF could be caused by NULL-ptr deref or user memory access
    general protection fault: 0000 [#1] SMP KASAN PTI
    CPU: 1 PID: 5373 Comm: syz-executor.0 Not tainted 5.0.0-rc8+ #3
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
    RIP: 0010:put_links+0x101/0x440 fs/proc/proc_sysctl.c:1599
    Code: 00 0f 85 3a 03 00 00 48 8b 43 38 48 89 44 24 20 48 83 c0 38 48 89 c2 48 89 44 24 28 48 b8 00 00 00 00 00 fc ff df 48 c1 ea 03 3c 02 00 0f 85 fe 02 00 00 48 8b 74 24 20 48 c7 c7 60 2a 9d 91
    RSP: 0018:ffff8881d828f238 EFLAGS: 00010202
    RAX: dffffc0000000000 RBX: ffff8881e01b1140 RCX: ffffffff8ee98267
    RDX: 0000000000000007 RSI: ffffc90001479000 RDI: ffff8881e01b1178
    RBP: dffffc0000000000 R08: ffffed103ee27259 R09: ffffed103ee27259
    R10: 0000000000000001 R11: ffffed103ee27258 R12: fffffffffffffff4
    R13: 0000000000000006 R14: ffff8881f59838c0 R15: dffffc0000000000
    FS: 00007f072254f700(0000) GS:ffff8881f7100000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 00007fff8b286668 CR3: 00000001f0542002 CR4: 00000000007606e0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    PKRU: 55555554
    Call Trace:
    drop_sysctl_table+0x152/0x9f0 fs/proc/proc_sysctl.c:1629
    get_subdir fs/proc/proc_sysctl.c:1022 [inline]
    __register_sysctl_table+0xd65/0x1090 fs/proc/proc_sysctl.c:1335
    br_netfilter_init+0xbc/0x1000 [br_netfilter]
    do_one_initcall+0xfa/0x5ca init/main.c:887
    do_init_module+0x204/0x5f6 kernel/module.c:3460
    load_module+0x66b2/0x8570 kernel/module.c:3808
    __do_sys_finit_module+0x238/0x2a0 kernel/module.c:3902
    do_syscall_64+0x147/0x600 arch/x86/entry/common.c:290
    entry_SYSCALL_64_after_hwframe+0x49/0xbe
    RIP: 0033:0x462e99
    Code: f7 d8 64 89 02 b8 ff ff ff ff c3 66 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 3d 01 f0 ff ff 73 01 c3 48 c7 c1 bc ff ff ff f7 d8 64 89 01 48
    RSP: 002b:00007f072254ec58 EFLAGS: 00000246 ORIG_RAX: 0000000000000139
    RAX: ffffffffffffffda RBX: 000000000073bf00 RCX: 0000000000462e99
    RDX: 0000000000000000 RSI: 0000000020000280 RDI: 0000000000000003
    RBP: 00007f072254ec70 R08: 0000000000000000 R09: 0000000000000000
    R10: 0000000000000000 R11: 0000000000000246 R12: 00007f072254f6bc
    R13: 00000000004bcefa R14: 00000000006f6fb0 R15: 0000000000000004
    Modules linked in: br_netfilter(+) dvb_usb_dibusb_mc_common dib3000mc dibx000_common dvb_usb_dibusb_common dvb_usb_dw2102 dvb_usb classmate_laptop palmas_regulator cn videobuf2_v4l2 v4l2_common snd_soc_bd28623 mptbase snd_usb_usx2y snd_usbmidi_lib snd_rawmidi wmi libnvdimm lockd sunrpc grace rc_kworld_pc150u rc_core rtc_da9063 sha1_ssse3 i2c_cros_ec_tunnel adxl34x_spi adxl34x nfnetlink lib80211 i5500_temp dvb_as102 dvb_core videobuf2_common videodev media videobuf2_vmalloc videobuf2_memops udc_core lnbp22 leds_lp3952 hid_roccat_ryos s1d13xxxfb mtd vport_geneve openvswitch nf_conncount nf_nat_ipv6 nsh geneve udp_tunnel ip6_udp_tunnel snd_soc_mt6351 sis_agp phylink snd_soc_adau1761_spi snd_soc_adau1761 snd_soc_adau17x1 snd_soc_core snd_pcm_dmaengine ac97_bus snd_compress snd_soc_adau_utils snd_soc_sigmadsp_regmap snd_soc_sigmadsp raid_class hid_roccat_konepure hid_roccat_common hid_roccat c2port_duramar2150 core mdio_bcm_unimac iptable_security iptable_raw iptable_mangle
    iptable_nat nf_nat_ipv4 nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 iptable_filter bpfilter ip6_vti ip_vti ip_gre ipip sit tunnel4 ip_tunnel hsr veth netdevsim devlink vxcan batman_adv cfg80211 rfkill chnl_net caif nlmon dummy team bonding vcan bridge stp llc ip6_gre gre ip6_tunnel tunnel6 tun crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel joydev mousedev ide_pci_generic piix aesni_intel aes_x86_64 ide_core crypto_simd atkbd cryptd glue_helper serio_raw ata_generic pata_acpi i2c_piix4 floppy sch_fq_codel ip_tables x_tables ipv6 [last unloaded: lm73]
    Dumping ftrace buffer:
    (ftrace buffer empty)
    ---[ end trace 770020de38961fd0 ]---

    A new dir entry can be created in get_subdir and its 'header->parent' is
    set to NULL. Only after insert_header success, it will be set to 'dir',
    otherwise 'header->parent' is set to NULL and drop_sysctl_table is called.
    However in err handling path of get_subdir, drop_sysctl_table also be
    called on 'new->header' regardless its value of parent pointer. Then
    put_links is called, which triggers NULL-ptr deref when access member of
    header->parent.

    In fact we have multiple error paths which call drop_sysctl_table() there,
    upon failure on insert_links() we also call drop_sysctl_table().And even
    in the successful case on __register_sysctl_table() we still always call
    drop_sysctl_table().This patch fix it.

    Link: http://lkml.kernel.org/r/20190314085527.13244-1-yuehaibing@huawei.com
    Fixes: 0e47c99d7fe25 ("sysctl: Replace root_list with links between sysctl_table_sets")
    Signed-off-by: YueHaibing
    Reported-by: Hulk Robot
    Acked-by: Luis Chamberlain
    Cc: Kees Cook
    Cc: Alexey Dobriyan
    Cc: Alexei Starovoitov
    Cc: Daniel Borkmann
    Cc: Al Viro
    Cc: Eric W. Biederman
    Cc: [3.4+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    YueHaibing
     

14 Mar, 2019

1 commit

  • [ Upstream commit 1fde6f21d90f8ba5da3cb9c54ca991ed72696c43 ]

    /proc entries under /proc/net/* can't be cached into dcache because
    setns(2) can change current net namespace.

    [akpm@linux-foundation.org: coding-style fixes]
    [akpm@linux-foundation.org: avoid vim miscolorization]
    [adobriyan@gmail.com: write test, add dummy ->d_revalidate hook: necessary if /proc/net/* is pinned at setns time]
    Link: http://lkml.kernel.org/r/20190108192350.GA12034@avx2
    Link: http://lkml.kernel.org/r/20190107162336.GA9239@avx2
    Fixes: 1da4d377f943fe4194ffb9fb9c26cc58fad4dd24 ("proc: revalidate misc dentries")
    Signed-off-by: Alexey Dobriyan
    Reported-by: Mateusz Stępień
    Reported-by: Ahmad Fatoum
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin

    Alexey Dobriyan
     

27 Feb, 2019

1 commit

  • commit b2b469939e93458753cfbf8282ad52636495965e upstream.

    Tetsuo has reported that creating a thousands of processes sharing MM
    without SIGHAND (aka alien threads) and setting
    /proc//oom_score_adj will swamp the kernel log and takes ages [1]
    to finish. This is especially worrisome that all that printing is done
    under RCU lock and this can potentially trigger RCU stall or softlockup
    detector.

    The primary reason for the printk was to catch potential users who might
    depend on the behavior prior to 44a70adec910 ("mm, oom_adj: make sure
    processes sharing mm have same view of oom_score_adj") but after more
    than 2 years without a single report I guess it is safe to simply remove
    the printk altogether.

    The next step should be moving oom_score_adj over to the mm struct and
    remove all the tasks crawling as suggested by [2]

    [1] http://lkml.kernel.org/r/97fce864-6f75-bca5-14bc-12c9f890e740@i-love.sakura.ne.jp
    [2] http://lkml.kernel.org/r/20190117155159.GA4087@dhcp22.suse.cz

    Link: http://lkml.kernel.org/r/20190212102129.26288-1-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Reported-by: Tetsuo Handa
    Acked-by: Johannes Weiner
    Cc: David Rientjes
    Cc: Yong-Taek Lee
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Michal Hocko
     

20 Feb, 2019

1 commit

  • commit 27dd768ed8db48beefc4d9e006c58e7a00342bde upstream.

    The 'pss_locked' field of smaps_rollup was being calculated incorrectly.
    It accumulated the current pss everytime a locked VMA was found. Fix
    that by adding to 'pss_locked' the same time as that of 'pss' if the vma
    being walked is locked.

    Link: http://lkml.kernel.org/r/20190203065425.14650-1-sspatil@android.com
    Fixes: 493b0e9d945f ("mm: add /proc/pid/smaps_rollup")
    Signed-off-by: Sandeep Patil
    Acked-by: Vlastimil Babka
    Reviewed-by: Joel Fernandes (Google)
    Cc: Alexey Dobriyan
    Cc: Daniel Colascione
    Cc: [4.14.x, 4.19.x]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Sandeep Patil
     

29 Dec, 2018

1 commit

  • commit ea5751ccd665a2fd1b24f9af81f6167f0718c5f6 upstream.

    proc_sys_lookup can fail with ENOMEM instead of ENOENT when the
    corresponding sysctl table is being unregistered. In our case we see
    this upon opening /proc/sys/net/*/conf files while network interfaces
    are being deleted, which confuses our configuration daemon.

    The problem was successfully reproduced and this fix tested on v4.9.122
    and v4.20-rc6.

    v2: return ERR_PTRs in all cases when proc_sys_make_inode fails instead
    of mixing them with NULL. Thanks Al Viro for the feedback.

    Fixes: ace0c791e6c3 ("proc/sysctl: Don't grab i_lock under sysctl_lock.")
    Cc: stable@vger.kernel.org
    Signed-off-by: Ivan Delalande
    Signed-off-by: Al Viro
    Signed-off-by: Greg Kroah-Hartman

    Ivan Delalande
     

14 Nov, 2018

1 commit

  • commit fa76da461bb0be13c8339d984dcf179151167c8f upstream.

    Leonardo reports an apparent regression in 4.19-rc7:

    BUG: unable to handle kernel NULL pointer dereference at 00000000000000f0
    PGD 0 P4D 0
    Oops: 0000 [#1] PREEMPT SMP PTI
    CPU: 3 PID: 6032 Comm: python Not tainted 4.19.0-041900rc7-lowlatency #201810071631
    Hardware name: LENOVO 80UG/Toronto 4A2, BIOS 0XCN45WW 08/09/2018
    RIP: 0010:smaps_pte_range+0x32d/0x540
    Code: 80 00 00 00 00 74 a9 48 89 de 41 f6 40 52 40 0f 85 04 02 00 00 49 2b 30 48 c1 ee 0c 49 03 b0 98 00 00 00 49 8b 80 a0 00 00 00 8b b8 f0 00 00 00 e8 b7 ef ec ff 48 85 c0 0f 84 71 ff ff ff a8
    RSP: 0018:ffffb0cbc484fb88 EFLAGS: 00010202
    RAX: 0000000000000000 RBX: 0000560ddb9e9000 RCX: 0000000000000000
    RDX: 0000000000000000 RSI: 0000000560ddb9e9 RDI: 0000000000000001
    RBP: ffffb0cbc484fbc0 R08: ffff94a5a227a578 R09: ffff94a5a227a578
    R10: 0000000000000000 R11: 0000560ddbbe7000 R12: ffffe903098ba728
    R13: ffffb0cbc484fc78 R14: ffffb0cbc484fcf8 R15: ffff94a5a2e9cf48
    FS: 00007f6dfb683740(0000) GS:ffff94a5aaf80000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 00000000000000f0 CR3: 000000011c118001 CR4: 00000000003606e0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    Call Trace:
    __walk_page_range+0x3c2/0x6f0
    walk_page_vma+0x42/0x60
    smap_gather_stats+0x79/0xe0
    ? gather_pte_stats+0x320/0x320
    ? gather_hugetlb_stats+0x70/0x70
    show_smaps_rollup+0xcd/0x1c0
    seq_read+0x157/0x400
    __vfs_read+0x3a/0x180
    ? security_file_permission+0x93/0xc0
    ? security_file_permission+0x93/0xc0
    vfs_read+0x8f/0x140
    ksys_read+0x55/0xc0
    __x64_sys_read+0x1a/0x20
    do_syscall_64+0x5a/0x110
    entry_SYSCALL_64_after_hwframe+0x44/0xa9

    Decoded code matched to local compilation+disassembly points to
    smaps_pte_entry():

    } else if (unlikely(IS_ENABLED(CONFIG_SHMEM) && mss->check_shmem_swap
    && pte_none(*pte))) {
    page = find_get_entry(vma->vm_file->f_mapping,
    linear_page_index(vma, addr));

    Here, vma->vm_file is NULL. mss->check_shmem_swap should be false in that
    case, however for smaps_rollup, smap_gather_stats() can set the flag true
    for one vma and leave it true for subsequent vma's where it should be
    false.

    To fix, reset the check_shmem_swap flag to false. There's also related
    bug which sets mss->swap to shmem_swapped, which in the context of
    smaps_rollup overwrites any value accumulated from previous vma's. Fix
    that as well.

    Note that the report suggests a regression between 4.17.19 and 4.19-rc7,
    which makes the 4.19 series ending with commit 258f669e7e88 ("mm:
    /proc/pid/smaps_rollup: convert to single value seq_file") suspicious.
    But the mss was reused for rollup since 493b0e9d945f ("mm: add
    /proc/pid/smaps_rollup") so let's play it safe with the stable backport.

    Link: http://lkml.kernel.org/r/555fbd1f-4ac9-0b58-dcd4-5dc4380ff7ca@suse.cz
    Link: https://bugzilla.kernel.org/show_bug.cgi?id=201377
    Fixes: 493b0e9d945f ("mm: add /proc/pid/smaps_rollup")
    Signed-off-by: Vlastimil Babka
    Reported-by: Leonardo Soares Müller
    Tested-by: Leonardo Soares Müller
    Cc: Greg Kroah-Hartman
    Cc: Daniel Colascione
    Cc: Alexey Dobriyan
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Vlastimil Babka
     

06 Oct, 2018

1 commit

  • Currently, you can use /proc/self/task/*/stack to cause a stack walk on
    a task you control while it is running on another CPU. That means that
    the stack can change under the stack walker. The stack walker does
    have guards against going completely off the rails and into random
    kernel memory, but it can interpret random data from your kernel stack
    as instruction pointers and stack pointers. This can cause exposure of
    kernel stack contents to userspace.

    Restrict the ability to inspect kernel stacks of arbitrary tasks to root
    in order to prevent a local attacker from exploiting racy stack unwinding
    to leak kernel task stack contents. See the added comment for a longer
    rationale.

    There don't seem to be any users of this userspace API that can't
    gracefully bail out if reading from the file fails. Therefore, I believe
    that this change is unlikely to break things. In the case that this patch
    does end up needing a revert, the next-best solution might be to fake a
    single-entry stack based on wchan.

    Link: http://lkml.kernel.org/r/20180927153316.200286-1-jannh@google.com
    Fixes: 2ec220e27f50 ("proc: add /proc/*/stack")
    Signed-off-by: Jann Horn
    Acked-by: Kees Cook
    Cc: Alexey Dobriyan
    Cc: Ken Chen
    Cc: Will Deacon
    Cc: Laura Abbott
    Cc: Andy Lutomirski
    Cc: Catalin Marinas
    Cc: Josh Poimboeuf
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: "H . Peter Anvin"
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Greg Kroah-Hartman

    Jann Horn
     

21 Sep, 2018

1 commit

  • The 'm' kcore_list item could point to kclist_head, and it is incorrect to
    look at m->addr / m->size in this case.

    There is no choice but to run through the list of entries for every
    address if we did not find any entry in the previous iteration

    Reset 'm' to NULL in that case at Omar Sandoval's suggestion.

    [akpm@linux-foundation.org: add comment]
    Link: http://lkml.kernel.org/r/1536100702-28706-1-git-send-email-asmadeus@codewreck.org
    Fixes: bf991c2231117 ("proc/kcore: optimize multiple page reads")
    Signed-off-by: Dominique Martinet
    Reviewed-by: Andrew Morton
    Cc: Omar Sandoval
    Cc: Alexey Dobriyan
    Cc: Eric Biederman
    Cc: James Morse
    Cc: Bhupesh Sharma
    Signed-off-by: Andrew Morton
    Signed-off-by: Greg Kroah-Hartman

    Dominique Martinet
     

27 Aug, 2018

1 commit

  • Pull perf updates from Thomas Gleixner:
    "Kernel:
    - Improve kallsyms coverage
    - Add x86 entry trampolines to kcore
    - Fix ARM SPE handling
    - Correct PPC event post processing

    Tools:
    - Make the build system more robust
    - Small fixes and enhancements all over the place
    - Update kernel ABI header copies
    - Preparatory work for converting libtraceevnt to a shared library
    - License cleanups"

    * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (100 commits)
    tools arch: Update arch/x86/lib/memcpy_64.S copy used in 'perf bench mem memcpy'
    tools arch x86: Update tools's copy of cpufeatures.h
    perf python: Fix pyrf_evlist__read_on_cpu() interface
    perf mmap: Store real cpu number in 'struct perf_mmap'
    perf tools: Remove ext from struct kmod_path
    perf tools: Add gzip_is_compressed function
    perf tools: Add lzma_is_compressed function
    perf tools: Add is_compressed callback to compressions array
    perf tools: Move the temp file processing into decompress_kmodule
    perf tools: Use compression id in decompress_kmodule()
    perf tools: Store compression id into struct dso
    perf tools: Add compression id into 'struct kmod_path'
    perf tools: Make is_supported_compression() static
    perf tools: Make decompress_to_file() function static
    perf tools: Get rid of dso__needs_decompress() call in __open_dso()
    perf tools: Get rid of dso__needs_decompress() call in symbol__disassemble()
    perf tools: Get rid of dso__needs_decompress() call in read_object_code()
    tools lib traceevent: Change to SPDX License format
    perf llvm: Allow passing options to llc in addition to clang
    perf parser: Improve error message for PMU address filters
    ...

    Linus Torvalds
     

24 Aug, 2018

1 commit

  • Without CONFIG_MMU, we get a build warning:

    fs/proc/vmcore.c:228:12: error: 'vmcoredd_mmap_dumps' defined but not used [-Werror=unused-function]
    static int vmcoredd_mmap_dumps(struct vm_area_struct *vma, unsigned long dst,

    The function is only referenced from an #ifdef'ed caller, so
    this uses the same #ifdef around it.

    Link: http://lkml.kernel.org/r/20180525213526.2117790-1-arnd@arndb.de
    Fixes: 7efe48df8a3d ("vmcore: append device dumps to vmcore as elf notes")
    Signed-off-by: Arnd Bergmann
    Cc: Ganesh Goudar
    Cc: "David S. Miller"
    Cc: Rahul Lakkireddy
    Cc: Alexey Dobriyan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Arnd Bergmann
     

23 Aug, 2018

23 commits

  • The vmcoreinfo information is useful for runtime debugging tools, not just
    for crash dumps. A lot of this information can be determined by other
    means, but this is much more convenient, and it only adds a page at most
    to the file.

    Link: http://lkml.kernel.org/r/fddbcd08eed76344863303878b12de1c1e2a04b6.1531953780.git.osandov@fb.com
    Signed-off-by: Omar Sandoval
    Cc: Alexey Dobriyan
    Cc: Bhupesh Sharma
    Cc: Eric Biederman
    Cc: James Morse
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Omar Sandoval
     
  • The current code does a full search of the segment list every time for
    every page. This is wasteful, since it's almost certain that the next
    page will be in the same segment. Instead, check if the previous segment
    covers the current page before doing the list search.

    Link: http://lkml.kernel.org/r/fd346c11090cf93d867e01b8d73a6567c5ac6361.1531953780.git.osandov@fb.com
    Signed-off-by: Omar Sandoval
    Cc: Alexey Dobriyan
    Cc: Bhupesh Sharma
    Cc: Eric Biederman
    Cc: James Morse
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Omar Sandoval
     
  • Currently, the ELF file header, program headers, and note segment are
    allocated all at once, in some icky code dating back to 2.3. Programs
    tend to read the file header, then the program headers, then the note
    segment, all separately, so this is a waste of effort. It's cleaner and
    more efficient to handle the three separately.

    Link: http://lkml.kernel.org/r/19c92cbad0e11f6103ff3274b2e7a7e51a1eb74b.1531953780.git.osandov@fb.com
    Signed-off-by: Omar Sandoval
    Cc: Alexey Dobriyan
    Cc: Bhupesh Sharma
    Cc: Eric Biederman
    Cc: James Morse
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Omar Sandoval
     
  • Now that we're using an rwsem, we can hold it during the entirety of
    read_kcore() and have a common return path. This is preparation for the
    next change.

    [akpm@linux-foundation.org: fix locking bug reported by Tetsuo Handa]
    Link: http://lkml.kernel.org/r/d7cfbc1e8a76616f3b699eaff9df0a2730380534.1531953780.git.osandov@fb.com
    Signed-off-by: Omar Sandoval
    Cc: Alexey Dobriyan
    Cc: Bhupesh Sharma
    Cc: Eric Biederman
    Cc: James Morse
    Cc: Tetsuo Handa
    Cc: Stephen Rothwell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Omar Sandoval
     
  • There's a theoretical race condition that will cause /proc/kcore to miss
    a memory hotplug event:

    CPU0 CPU1
    // hotplug event 1
    kcore_need_update = 1

    open_kcore() open_kcore()
    kcore_update_ram() kcore_update_ram()
    // Walk RAM // Walk RAM
    __kcore_update_ram() __kcore_update_ram()
    kcore_need_update = 0

    // hotplug event 2
    kcore_need_update = 1
    kcore_need_update = 0

    Note that CPU1 set up the RAM kcore entries with the state after hotplug
    event 1 but cleared the flag for hotplug event 2. The RAM entries will
    therefore be stale until there is another hotplug event.

    This is an extremely unlikely sequence of events, but the fix makes the
    synchronization saner, anyways: we serialize the entire update sequence,
    which means that whoever clears the flag will always succeed in replacing
    the kcore list.

    Link: http://lkml.kernel.org/r/6106c509998779730c12400c1b996425df7d7089.1531953780.git.osandov@fb.com
    Signed-off-by: Omar Sandoval
    Cc: Alexey Dobriyan
    Cc: Bhupesh Sharma
    Cc: Eric Biederman
    Cc: James Morse
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Omar Sandoval
     
  • Now we only need kclist_lock from user context and at fs init time, and
    the following changes need to sleep while holding the kclist_lock.

    Link: http://lkml.kernel.org/r/521ba449ebe921d905177410fee9222d07882f0d.1531953780.git.osandov@fb.com
    Signed-off-by: Omar Sandoval
    Reviewed-by: Andrew Morton
    Cc: Alexey Dobriyan
    Cc: Bhupesh Sharma
    Cc: Eric Biederman
    Cc: James Morse
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Omar Sandoval
     
  • The memory hotplug notifier kcore_callback() only needs kclist_lock to
    prevent races with __kcore_update_ram(), but we can easily eliminate that
    race by using an atomic xchg() in __kcore_update_ram(). This is
    preparation for converting kclist_lock to an rwsem.

    Link: http://lkml.kernel.org/r/0a4bc89f4dbde8b5b2ea309f7b4fb6a85fe29df2.1531953780.git.osandov@fb.com
    Signed-off-by: Omar Sandoval
    Reviewed-by: Andrew Morton
    Cc: Alexey Dobriyan
    Cc: Bhupesh Sharma
    Cc: Eric Biederman
    Cc: James Morse
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Omar Sandoval
     
  • Patch series "/proc/kcore improvements", v4.

    This series makes a few improvements to /proc/kcore. It fixes a couple of
    small issues in v3 but is otherwise the same. Patches 1, 2, and 3 are
    prep patches. Patch 4 is a fix/cleanup. Patch 5 is another prep patch.
    Patches 6 and 7 are optimizations to ->read(). Patch 8 makes it possible
    to enable CRASH_CORE on any architecture, which is needed for patch 9.
    Patch 9 adds vmcoreinfo to /proc/kcore.

    This patch (of 9):

    kclist_add() is only called at init time, so there's no point in grabbing
    any locks. We're also going to replace the rwlock with a rwsem, which we
    don't want to try grabbing during early boot.

    While we're here, mark kclist_add() with __init so that we'll get a
    warning if it's called from non-init code.

    Link: http://lkml.kernel.org/r/98208db1faf167aa8b08eebfa968d95c70527739.1531953780.git.osandov@fb.com
    Signed-off-by: Omar Sandoval
    Reviewed-by: Andrew Morton
    Reviewed-by: Bhupesh Sharma
    Tested-by: Bhupesh Sharma
    Cc: Alexey Dobriyan
    Cc: Bhupesh Sharma
    Cc: Eric Biederman
    Cc: James Morse
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Omar Sandoval
     
  • elf_kcore_store_hdr() uses __pa() to find the physical address of
    KCORE_RAM or KCORE_TEXT entries exported as program headers.

    This trips CONFIG_DEBUG_VIRTUAL's checks, as the KCORE_TEXT entries are
    not in the linear map.

    Handle these two cases separately, using __pa_symbol() for the KCORE_TEXT
    entries.

    Link: http://lkml.kernel.org/r/20180711131944.15252-1-james.morse@arm.com
    Signed-off-by: James Morse
    Cc: Alexey Dobriyan
    Cc: Omar Sandoval
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    James Morse
     
  • Use new return type vm_fault_t for fault handler in struct
    vm_operations_struct. For now, this is just documenting that the function
    returns a VM_FAULT value rather than an errno. Once all instances are
    converted, vm_fault_t will become a distinct type.

    See 1c8f422059ae ("mm: change return type to vm_fault_t") for reference.

    Link: http://lkml.kernel.org/r/20180702153325.GA3875@jordon-HP-15-Notebook-PC
    Signed-off-by: Souptick Joarder
    Reviewed-by: Matthew Wilcox
    Cc: Ganesh Goudar
    Cc: Rahul Lakkireddy
    Cc: David S. Miller
    Cc: Alexey Dobriyan
    Cc: Matthew Wilcox
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Souptick Joarder
     
  • Number of CPUs is never high enough to force 64-bit arithmetic.
    Save couple of bytes on x86_64.

    Link: http://lkml.kernel.org/r/20180627200710.GC18434@avx2
    Signed-off-by: Alexey Dobriyan
    Reviewed-by: Andrew Morton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     
  • Link: http://lkml.kernel.org/r/20180627200614.GB18434@avx2
    Signed-off-by: Alexey Dobriyan
    Reviewed-by: Andrew Morton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     
  • ->latency_record is defined as

    struct latency_record[LT_SAVECOUNT];

    so use the same macro whie iterating.

    Link: http://lkml.kernel.org/r/20180627200534.GA18434@avx2
    Signed-off-by: Alexey Dobriyan
    Reviewed-by: Andrew Morton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     
  • Code checks if write is done by current to its own attributes.
    For that get/put pair is unnecessary as it can be done under RCU.

    Note: rcu_read_unlock() can be done even earlier since pointer to a task
    is not dereferenced. It depends if /proc code should look scary or not:

    rcu_read_lock();
    task = pid_task(...);
    rcu_read_unlock();
    if (!task)
    return -ESRCH;
    if (task != current)
    return -EACCESS:

    P.S.: rename "length" variable. Code like this

    length = -EINVAL;

    should not exist.

    Link: http://lkml.kernel.org/r/20180627200218.GF18113@avx2
    Signed-off-by: Alexey Dobriyan
    Reviewed-by: Andrew Morton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     
  • Link: http://lkml.kernel.org/r/20180627195427.GE18113@avx2
    Signed-off-by: Alexey Dobriyan
    Reviewed-by: Andrew Morton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     
  • Readdir context is thread local, so ->pos is thread local,
    move it out of readlock.

    Link: http://lkml.kernel.org/r/20180627195339.GD18113@avx2
    Signed-off-by: Alexey Dobriyan
    Reviewed-by: Andrew Morton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     
  • get_monotonic_boottime() is deprecated and uses the old timespec type.
    Let's convert /proc/uptime to use ktime_get_boottime_ts64().

    Link: http://lkml.kernel.org/r/20180620081746.282742-1-arnd@arndb.de
    Signed-off-by: Arnd Bergmann
    Acked-by: Thomas Gleixner
    Cc: Al Viro
    Cc: Deepa Dinamani
    Cc: Alexey Dobriyan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Arnd Bergmann
     
  • 24074a35c5c975 ("proc: Make inline name size calculation automatic")
    started to put PDE allocations into kmalloc-256 which is unnecessary as
    ~40 character names are very rare.

    Put allocation back into kmalloc-192 cache for 64-bit non-debug builds.

    Put BUILD_BUG_ON to know when PDE size has gotten out of control.

    [adobriyan@gmail.com: fix BUILD_BUG_ON breakage on powerpc64]
    Link: http://lkml.kernel.org/r/20180703191602.GA25521@avx2
    Link: http://lkml.kernel.org/r/20180617215732.GA24688@avx2
    Signed-off-by: Alexey Dobriyan
    Cc: David Howells
    Cc: Al Viro
    Cc: Stephen Rothwell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     
  • Currently, percpu memory only exposes allocation and utilization
    information via debugfs. This more or less is only really useful for
    understanding the fragmentation and allocation information at a per-chunk
    level with a few global counters. This is also gated behind a config.
    BPF and cgroup, for example, have seen an increase in use causing
    increased use of percpu memory. Let's make it easier for someone to
    identify how much memory is being used.

    This patch adds the "Percpu" stat to meminfo to more easily look up how
    much percpu memory is in use. This number includes the cost for all
    allocated backing pages and not just insight at the per a unit, per chunk
    level. Metadata is excluded. I think excluding metadata is fair because
    the backing memory scales with the numbere of cpus and can quickly
    outweigh the metadata. It also makes this calculation light.

    Link: http://lkml.kernel.org/r/20180807184723.74919-1-dennisszhou@gmail.com
    Signed-off-by: Dennis Zhou
    Acked-by: Tejun Heo
    Acked-by: Roman Gushchin
    Reviewed-by: Andrew Morton
    Acked-by: David Rientjes
    Acked-by: Vlastimil Babka
    Cc: Johannes Weiner
    Cc: Christoph Lameter
    Cc: Alexey Dobriyan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dennis Zhou (Facebook)
     
  • The /proc/pid/smaps_rollup file is currently implemented via the
    m_start/m_next/m_stop seq_file iterators shared with the other maps files,
    that iterate over vma's. However, the rollup file doesn't print anything
    for each vma, only accumulate the stats.

    There are some issues with the current code as reported in [1] - the
    accumulated stats can get skewed if seq_file start()/stop() op is called
    multiple times, if show() is called multiple times, and after seeks to
    non-zero position.

    Patch [1] fixed those within existing design, but I believe it is
    fundamentally wrong to expose the vma iterators to the seq_file mechanism
    when smaps_rollup shows logically a single set of values for the whole
    address space.

    This patch thus refactors the code to provide a single "value" at offset
    0, with vma iteration to gather the stats done internally. This fixes the
    situations where results are skewed, and simplifies the code, especially
    in show_smap(), at the expense of somewhat less code reuse.

    [1] https://marc.info/?l=linux-mm&m=151927723128134&w=2

    [vbabka@suse.c: use seq_file infrastructure]
    Link: http://lkml.kernel.org/r/bf4525b0-fd5b-4c4c-2cb3-adee3dd95a48@suse.cz
    Link: http://lkml.kernel.org/r/20180723111933.15443-5-vbabka@suse.cz
    Signed-off-by: Vlastimil Babka
    Reported-by: Daniel Colascione
    Reviewed-by: Alexey Dobriyan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • To prepare for handling /proc/pid/smaps_rollup differently from
    /proc/pid/smaps factor out from show_smap() printing the parts of output
    that are common for both variants, which is the bulk of the gathered
    memory stats.

    [vbabka@suse.cz: add const, per Alexey]
    Link: http://lkml.kernel.org/r/b45f319f-cd04-337b-37f8-77f99786aa8a@suse.cz
    Link: http://lkml.kernel.org/r/20180723111933.15443-4-vbabka@suse.cz
    Signed-off-by: Vlastimil Babka
    Reviewed-by: Alexey Dobriyan
    Cc: Daniel Colascione
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • To prepare for handling /proc/pid/smaps_rollup differently from
    /proc/pid/smaps factor out vma mem stats gathering from show_smap() - it
    will be used by both.

    Link: http://lkml.kernel.org/r/20180723111933.15443-3-vbabka@suse.cz
    Signed-off-by: Vlastimil Babka
    Reviewed-by: Alexey Dobriyan
    Cc: Daniel Colascione
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • Patch series "cleanups and refactor of /proc/pid/smaps*".

    The recent regression in /proc/pid/smaps made me look more into the code.
    Especially the issues with smaps_rollup reported in [1] as explained in
    Patch 4, which fixes them by refactoring the code. Patches 2 and 3 are
    preparations for that. Patch 1 is me realizing that there's a lot of
    boilerplate left from times where we tried (unsuccessfuly) to mark thread
    stacks in the output.

    Originally I had also plans to rework the translation from
    /proc/pid/*maps* file offsets to the internal structures. Now the offset
    means "vma number", which is not really stable (vma's can come and go
    between read() calls) and there's an extra caching of last vma's address.
    My idea was that offsets would be interpreted directly as addresses, which
    would also allow meaningful seeks (see the ugly seek_to_smaps_entry() in
    tools/testing/selftests/vm/mlock2.h). However loff_t is (signed) long
    long so that might be insufficient somewhere for the unsigned long
    addresses.

    So the result is fixed issues with skewed /proc/pid/smaps_rollup results,
    simpler smaps code, and a lot of unused code removed.

    [1] https://marc.info/?l=linux-mm&m=151927723128134&w=2

    This patch (of 4):

    Commit b76437579d13 ("procfs: mark thread stack correctly in
    proc//maps") introduced differences between /proc/PID/maps and
    /proc/PID/task/TID/maps to mark thread stacks properly, and this was
    also done for smaps and numa_maps. However it didn't work properly and
    was ultimately removed by commit b18cb64ead40 ("fs/proc: Stop trying to
    report thread stacks").

    Now the is_pid parameter for the related show_*() functions is unused
    and we can remove it together with wrapper functions and ops structures
    that differ for PID and TID cases only in this parameter.

    Link: http://lkml.kernel.org/r/20180723111933.15443-2-vbabka@suse.cz
    Signed-off-by: Vlastimil Babka
    Reviewed-by: Alexey Dobriyan
    Cc: Daniel Colascione
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     

15 Aug, 2018

1 commit

  • Without program headers for PTI entry trampoline pages, the trampoline
    virtual addresses do not map to anything.

    Example before:

    sudo gdb --quiet vmlinux /proc/kcore
    Reading symbols from vmlinux...done.
    [New process 1]
    Core was generated by `BOOT_IMAGE=/boot/vmlinuz-4.16.0 root=UUID=a6096b83-b763-4101-807e-f33daff63233'.
    #0 0x0000000000000000 in irq_stack_union ()
    (gdb) x /21ib 0xfffffe0000006000
    0xfffffe0000006000: Cannot access memory at address 0xfffffe0000006000
    (gdb) quit

    After:

    sudo gdb --quiet vmlinux /proc/kcore
    [sudo] password for ahunter:
    Reading symbols from vmlinux...done.
    [New process 1]
    Core was generated by `BOOT_IMAGE=/boot/vmlinuz-4.16.0-fix-4-00005-gd6e65a8b4072 root=UUID=a6096b83-b7'.
    #0 0x0000000000000000 in irq_stack_union ()
    (gdb) x /21ib 0xfffffe0000006000
    0xfffffe0000006000: swapgs
    0xfffffe0000006003: mov %rsp,-0x3e12(%rip) # 0xfffffe00000021f8
    0xfffffe000000600a: xchg %ax,%ax
    0xfffffe000000600c: mov %cr3,%rsp
    0xfffffe000000600f: bts $0x3f,%rsp
    0xfffffe0000006014: and $0xffffffffffffe7ff,%rsp
    0xfffffe000000601b: mov %rsp,%cr3
    0xfffffe000000601e: mov -0x3019(%rip),%rsp # 0xfffffe000000300c
    0xfffffe0000006025: pushq $0x2b
    0xfffffe0000006027: pushq -0x3e35(%rip) # 0xfffffe00000021f8
    0xfffffe000000602d: push %r11
    0xfffffe000000602f: pushq $0x33
    0xfffffe0000006031: push %rcx
    0xfffffe0000006032: push %rdi
    0xfffffe0000006033: mov $0xffffffff91a00010,%rdi
    0xfffffe000000603a: callq 0xfffffe0000006046
    0xfffffe000000603f: pause
    0xfffffe0000006041: lfence
    0xfffffe0000006044: jmp 0xfffffe000000603f
    0xfffffe0000006046: mov %rdi,(%rsp)
    0xfffffe000000604a: retq
    (gdb) quit

    In addition, entry trampolines all map to the same page. Represent that
    by giving the corresponding program headers in kcore the same offset.

    This has the benefit that, when perf tools uses /proc/kcore as a source
    for kernel object code, samples from different CPU trampolines are
    aggregated together. Note, such aggregation is normal for profiling
    i.e. people want to profile the object code, not every different virtual
    address the object code might be mapped to (across different processes
    for example).

    Notes by PeterZ:

    This also adds the KCORE_REMAP functionality.

    Signed-off-by: Adrian Hunter
    Acked-by: Andi Kleen
    Acked-by: Peter Zijlstra (Intel)
    Cc: Alexander Shishkin
    Cc: Andy Lutomirski
    Cc: Dave Hansen
    Cc: H. Peter Anvin
    Cc: Jiri Olsa
    Cc: Joerg Roedel
    Cc: Thomas Gleixner
    Cc: x86@kernel.org
    Link: http://lkml.kernel.org/r/1528289651-4113-4-git-send-email-adrian.hunter@intel.com
    Signed-off-by: Arnaldo Carvalho de Melo

    Adrian Hunter
     

15 Jul, 2018

1 commit

  • Thomas reports:
    "While looking around in /proc on my v4.14.52 system I noticed that all
    processes got a lot of "Locked" memory in /proc/*/smaps. A lot more
    memory than a regular user can usually lock with mlock().

    Commit 493b0e9d945f (in v4.14-rc1) seems to have changed the behavior
    of "Locked".

    Before that commit the code was like this. Notice the VM_LOCKED check.

    (vma->vm_flags & VM_LOCKED) ?
    (unsigned long)(mss.pss >> (10 + PSS_SHIFT)) : 0);

    After that commit Locked is now the same as Pss:

    (unsigned long)(mss->pss >> (10 + PSS_SHIFT)));

    This looks like a mistake."

    Indeed, the commit has added mss->pss_locked with the correct value that
    depends on VM_LOCKED, but forgot to actually use it. Fix it.

    Link: http://lkml.kernel.org/r/ebf6c7fb-fec3-6a26-544f-710ed193c154@suse.cz
    Fixes: 493b0e9d945f ("mm: add /proc/pid/smaps_rollup")
    Signed-off-by: Vlastimil Babka
    Reported-by: Thomas Lindroth
    Cc: Alexey Dobriyan
    Cc: Daniel Colascione
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     

02 Jul, 2018

1 commit


28 Jun, 2018

1 commit

  • kmemleak reported some memory leak on reading proc files. After adding
    some debug lines, find that proc_seq_fops is using seq_release as
    release handler, which won't handle the free of 'private' field of
    seq_file, while in fact the open handler proc_seq_open could create
    the private data with __seq_open_private when state_size is greater
    than zero. So after reading files created with proc_create_seq_private,
    such as /proc/timer_list and /proc/vmallocinfo, the private mem of a
    seq_file is not freed. Fix it by adding the paired proc_seq_release
    as the default release handler of proc_seq_ops instead of seq_release.

    Fixes: 44414d82cfe0 ("proc: introduce proc_create_seq_private")
    Reviewed-by: Christoph Hellwig
    CC: Christoph Hellwig
    Signed-off-by: Chunyu Hu
    Signed-off-by: Al Viro

    Chunyu Hu
     

20 Jun, 2018

1 commit

  • The rewrite of the cmdline fetching missed the fact that we used to also
    return the final terminating NUL character of the last argument. I
    hadn't noticed, and none of the tools I tested cared, but something
    obviously must care, because Michal Kubecek noticed the change in
    behavior.

    Tweak the "find the end" logic to actually include the NUL character,
    and once past the eend of argv, always start the strnlen() at the
    expected (original) argument end.

    This whole "allow people to rewrite their arguments in place" is a nasty
    hack and requires that odd slop handling at the end of the argv array,
    but it's our traditional model, so we continue to support it.

    Repored-and-bisected-by: Michal Kubecek
    Reviewed-and-tested-by: Michal Kubecek
    Cc: Alexey Dobriyan
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

16 Jun, 2018

1 commit

  • Pull AFS updates from Al Viro:
    "Assorted AFS stuff - ended up in vfs.git since most of that consists
    of David's AFS-related followups to Christoph's procfs series"

    * 'afs-proc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    afs: Optimise callback breaking by not repeating volume lookup
    afs: Display manually added cells in dynamic root mount
    afs: Enable IPv6 DNS lookups
    afs: Show all of a server's addresses in /proc/fs/afs/servers
    afs: Handle CONFIG_PROC_FS=n
    proc: Make inline name size calculation automatic
    afs: Implement network namespacing
    afs: Mark afs_net::ws_cell as __rcu and set using rcu functions
    afs: Fix a Sparse warning in xdr_decode_AFSFetchStatus()
    proc: Add a way to make network proc files writable
    afs: Rearrange fs/afs/proc.c to remove remaining predeclarations.
    afs: Rearrange fs/afs/proc.c to move the show routines up
    afs: Rearrange fs/afs/proc.c by moving fops and open functions down
    afs: Move /proc management functions to the end of the file

    Linus Torvalds
     

15 Jun, 2018

1 commit

  • Make calculation of the size of the inline name in struct proc_dir_entry
    automatic, rather than having to manually encode the numbers and failing to
    allow for lockdep.

    Require a minimum inline name size of 33+1 to allow for names that look
    like two hex numbers with a dash between.

    Reported-by: Al Viro
    Signed-off-by: David Howells
    Signed-off-by: Al Viro

    David Howells