07 Dec, 2020

1 commit

  • 'format_corename()' will splite 'core_pattern' on spaces when it is in
    pipe mode, and take helper_argv[0] as the path to usermode executable.
    It works fine in most cases.

    However, if there is a space between '|' and '/file/path', such as
    '| /usr/lib/systemd/systemd-coredump %P %u %g', then helper_argv[0] will
    be parsed as '', and users will get a 'Core dump to | disabled'.

    It is not friendly to users, as the pattern above was valid previously.
    Fix this by ignoring the spaces between '|' and '/file/path'.

    Fixes: 315c69261dd3 ("coredump: split pipe command whitespace before expanding template")
    Signed-off-by: Menglong Dong
    Signed-off-by: Andrew Morton
    Cc: Paul Wise
    Cc: Jakub Wilk [https://bugs.debian.org/924398]
    Cc: Neil Horman
    Cc:
    Link: https://lkml.kernel.org/r/5fb62870.1c69fb81.8ef5d.af76@mx.google.com
    Signed-off-by: Linus Torvalds

    Menglong Dong
     

17 Oct, 2020

4 commits

  • In both binfmt_elf and binfmt_elf_fdpic, use a new helper
    dump_vma_snapshot() to take a snapshot of the VMA list (including the gate
    VMA, if we have one) while protected by the mmap_lock, and then use that
    snapshot instead of walking the VMA list without locking.

    An alternative approach would be to keep the mmap_lock held across the
    entire core dumping operation; however, keeping the mmap_lock locked while
    we may be blocked for an unbounded amount of time (e.g. because we're
    dumping to a FUSE filesystem or so) isn't really optimal; the mmap_lock
    blocks things like the ->release handler of userfaultfd, and we don't
    really want critical system daemons to grind to a halt just because
    someone "gifted" them SCM_RIGHTS to an eternally-locked userfaultfd, or
    something like that.

    Since both the normal ELF code and the FDPIC ELF code need this
    functionality (and if any other binfmt wants to add coredump support in
    the future, they'd probably need it, too), implement this with a common
    helper in fs/coredump.c.

    A downside of this approach is that we now need a bigger amount of kernel
    memory per userspace VMA in the normal ELF case, and that we need O(n)
    kernel memory in the FDPIC ELF case at all; but 40 bytes per VMA shouldn't
    be terribly bad.

    There currently is a data race between stack expansion and anything that
    reads ->vm_start or ->vm_end under the mmap_lock held in read mode; to
    mitigate that for core dumping, take the mmap_lock in write mode when
    taking a snapshot of the VMA hierarchy. (If we only took the mmap_lock in
    read mode, we could end up with a corrupted core dump if someone does
    get_user_pages_remote() concurrently. Not really a major problem, but
    taking the mmap_lock either way works here, so we might as well avoid the
    issue.) (This doesn't do anything about the existing data races with stack
    expansion in other mm code.)

    Signed-off-by: Jann Horn
    Signed-off-by: Andrew Morton
    Acked-by: Linus Torvalds
    Cc: Christoph Hellwig
    Cc: Alexander Viro
    Cc: "Eric W . Biederman"
    Cc: Oleg Nesterov
    Cc: Hugh Dickins
    Link: http://lkml.kernel.org/r/20200827114932.3572699-6-jannh@google.com
    Signed-off-by: Linus Torvalds

    Jann Horn
     
  • At the moment, the binfmt_elf and binfmt_elf_fdpic code have slightly
    different code to figure out which VMAs should be dumped, and if so,
    whether the dump should contain the entire VMA or just its first page.

    Eliminate duplicate code by reworking the binfmt_elf version into a
    generic core dumping helper in coredump.c.

    As part of that, change the heuristic for detecting executable/library
    header pages to check whether the inode is executable instead of looking
    at the file mode.

    This is less problematic in terms of locking because it lets us avoid
    get_user() under the mmap_sem. (And arguably it looks nicer and makes
    more sense in generic code.)

    Adjust a little bit based on the binfmt_elf_fdpic version: ->anon_vma is
    only meaningful under CONFIG_MMU, otherwise we have to assume that the VMA
    has been written to.

    Suggested-by: Linus Torvalds
    Signed-off-by: Jann Horn
    Signed-off-by: Andrew Morton
    Acked-by: Linus Torvalds
    Cc: Christoph Hellwig
    Cc: Alexander Viro
    Cc: "Eric W . Biederman"
    Cc: Oleg Nesterov
    Cc: Hugh Dickins
    Link: http://lkml.kernel.org/r/20200827114932.3572699-5-jannh@google.com
    Signed-off-by: Linus Torvalds

    Jann Horn
     
  • Both fs/binfmt_elf.c and fs/binfmt_elf_fdpic.c need to dump ranges of
    pages into the coredump file. Extract that logic into a common helper.

    Signed-off-by: Jann Horn
    Signed-off-by: Andrew Morton
    Acked-by: Linus Torvalds
    Cc: Christoph Hellwig
    Cc: Alexander Viro
    Cc: "Eric W . Biederman"
    Cc: Oleg Nesterov
    Cc: Hugh Dickins
    Link: http://lkml.kernel.org/r/20200827114932.3572699-4-jannh@google.com
    Signed-off-by: Linus Torvalds

    Jann Horn
     
  • dump_emit() has a retry loop, but there seems to be no way for that retry
    logic to actually be used; and it was also buggy, writing the same data
    repeatedly after a short write.

    Let's just bail out on a short write.

    Suggested-by: Linus Torvalds
    Signed-off-by: Jann Horn
    Signed-off-by: Andrew Morton
    Acked-by: Linus Torvalds
    Cc: Christoph Hellwig
    Cc: Alexander Viro
    Cc: "Eric W . Biederman"
    Cc: Oleg Nesterov
    Cc: Hugh Dickins
    Link: http://lkml.kernel.org/r/20200827114932.3572699-3-jannh@google.com
    Signed-off-by: Linus Torvalds

    Jann Horn
     

13 Aug, 2020

1 commit

  • The document reads "%e" should be "executable filename" while actually it
    could be changed by things like pr_ctl PR_SET_NAME. People who uses "%e"
    in core_pattern get surprised when they find out they get thread name
    instead of executable filename.

    This is either a bug of document or a bug of code. Since the behavior of
    "%e" is there for long time, it could bring another surprise for users if
    we "fix" the code.

    So we just "fix" the document. And more, for users who really need the
    "executable filename" in core_pattern, we introduce a new "%f" for the
    real executable filename. We already have "%E" for executable path in
    kernel, so just reuse most of its code for the new added "%f" format.

    Signed-off-by: Lepton Wu
    Signed-off-by: Andrew Morton
    Link: http://lkml.kernel.org/r/20200701031432.2978761-1-ytht.net@gmail.com
    Signed-off-by: Linus Torvalds

    Lepton Wu
     

10 Jun, 2020

2 commits

  • Convert comments that reference mmap_sem to reference mmap_lock instead.

    [akpm@linux-foundation.org: fix up linux-next leftovers]
    [akpm@linux-foundation.org: s/lockaphore/lock/, per Vlastimil]
    [akpm@linux-foundation.org: more linux-next fixups, per Michel]

    Signed-off-by: Michel Lespinasse
    Signed-off-by: Andrew Morton
    Reviewed-by: Vlastimil Babka
    Reviewed-by: Daniel Jordan
    Cc: Davidlohr Bueso
    Cc: David Rientjes
    Cc: Hugh Dickins
    Cc: Jason Gunthorpe
    Cc: Jerome Glisse
    Cc: John Hubbard
    Cc: Laurent Dufour
    Cc: Liam Howlett
    Cc: Matthew Wilcox
    Cc: Peter Zijlstra
    Cc: Ying Han
    Link: http://lkml.kernel.org/r/20200520052908.204642-13-walken@google.com
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • This change converts the existing mmap_sem rwsem calls to use the new mmap
    locking API instead.

    The change is generated using coccinelle with the following rule:

    // spatch --sp-file mmap_lock_api.cocci --in-place --include-headers --dir .

    @@
    expression mm;
    @@
    (
    -init_rwsem
    +mmap_init_lock
    |
    -down_write
    +mmap_write_lock
    |
    -down_write_killable
    +mmap_write_lock_killable
    |
    -down_write_trylock
    +mmap_write_trylock
    |
    -up_write
    +mmap_write_unlock
    |
    -downgrade_write
    +mmap_write_downgrade
    |
    -down_read
    +mmap_read_lock
    |
    -down_read_killable
    +mmap_read_lock_killable
    |
    -down_read_trylock
    +mmap_read_trylock
    |
    -up_read
    +mmap_read_unlock
    )
    -(&mm->mmap_sem)
    +(mm)

    Signed-off-by: Michel Lespinasse
    Signed-off-by: Andrew Morton
    Reviewed-by: Daniel Jordan
    Reviewed-by: Laurent Dufour
    Reviewed-by: Vlastimil Babka
    Cc: Davidlohr Bueso
    Cc: David Rientjes
    Cc: Hugh Dickins
    Cc: Jason Gunthorpe
    Cc: Jerome Glisse
    Cc: John Hubbard
    Cc: Liam Howlett
    Cc: Matthew Wilcox
    Cc: Peter Zijlstra
    Cc: Ying Han
    Link: http://lkml.kernel.org/r/20200520052908.204642-5-walken@google.com
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     

28 Apr, 2020

1 commit

  • Commit 64e90a8acb859 ("Introduce STATIC_USERMODEHELPER to mediate
    call_usermodehelper()") added the optiont to disable all
    call_usermodehelper() calls by setting STATIC_USERMODEHELPER_PATH to
    an empty string. When this is done, and crashdump is triggered, it
    will crash on null pointer dereference, since we make assumptions
    over what call_usermodehelper_exec() did.

    This has been reported by Sergey when one triggers a a coredump
    with the following configuration:

    ```
    CONFIG_STATIC_USERMODEHELPER=y
    CONFIG_STATIC_USERMODEHELPER_PATH=""
    kernel.core_pattern = |/usr/lib/systemd/systemd-coredump %P %u %g %s %t %c %h %e
    ```

    The way disabling the umh was designed was that call_usermodehelper_exec()
    would just return early, without an error. But coredump assumes
    certain variables are set up for us when this happens, and calls
    ile_start_write(cprm.file) with a NULL file.

    [ 2.819676] BUG: kernel NULL pointer dereference, address: 0000000000000020
    [ 2.819859] #PF: supervisor read access in kernel mode
    [ 2.820035] #PF: error_code(0x0000) - not-present page
    [ 2.820188] PGD 0 P4D 0
    [ 2.820305] Oops: 0000 [#1] SMP PTI
    [ 2.820436] CPU: 2 PID: 89 Comm: a Not tainted 5.7.0-rc1+ #7
    [ 2.820680] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20190711_202441-buildvm-armv7-10.arm.fedoraproject.org-2.fc31 04/01/2014
    [ 2.821150] RIP: 0010:do_coredump+0xd80/0x1060
    [ 2.821385] Code: e8 95 11 ed ff 48 c7 c6 cc a7 b4 81 48 8d bd 28 ff
    ff ff 89 c2 e8 70 f1 ff ff 41 89 c2 85 c0 0f 84 72 f7 ff ff e9 b4 fe ff
    ff 8b 57 20 0f b7 02 66 25 00 f0 66 3d 00 8
    0 0f 84 9c 01 00 00 44
    [ 2.822014] RSP: 0000:ffffc9000029bcb8 EFLAGS: 00010246
    [ 2.822339] RAX: 0000000000000000 RBX: ffff88803f860000 RCX: 000000000000000a
    [ 2.822746] RDX: 0000000000000009 RSI: 0000000000000282 RDI: 0000000000000000
    [ 2.823141] RBP: ffffc9000029bde8 R08: 0000000000000000 R09: ffffc9000029bc00
    [ 2.823508] R10: 0000000000000001 R11: ffff88803dec90be R12: ffffffff81c39da0
    [ 2.823902] R13: ffff88803de84400 R14: 0000000000000000 R15: 0000000000000000
    [ 2.824285] FS: 00007fee08183540(0000) GS:ffff88803e480000(0000) knlGS:0000000000000000
    [ 2.824767] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [ 2.825111] CR2: 0000000000000020 CR3: 000000003f856005 CR4: 0000000000060ea0
    [ 2.825479] Call Trace:
    [ 2.825790] get_signal+0x11e/0x720
    [ 2.826087] do_signal+0x1d/0x670
    [ 2.826361] ? force_sig_info_to_task+0xc1/0xf0
    [ 2.826691] ? force_sig_fault+0x3c/0x40
    [ 2.826996] ? do_trap+0xc9/0x100
    [ 2.827179] exit_to_usermode_loop+0x49/0x90
    [ 2.827359] prepare_exit_to_usermode+0x77/0xb0
    [ 2.827559] ? invalid_op+0xa/0x30
    [ 2.827747] ret_from_intr+0x20/0x20
    [ 2.827921] RIP: 0033:0x55e2c76d2129
    [ 2.828107] Code: 2d ff ff ff e8 68 ff ff ff 5d c6 05 18 2f 00 00 01
    c3 0f 1f 80 00 00 00 00 c3 0f 1f 80 00 00 00 00 e9 7b ff ff ff 55 48 89
    e5 0b b8 00 00 00 00 5d c3 66 2e 0f 1f 84 0
    0 00 00 00 00 0f 1f 40
    [ 2.828603] RSP: 002b:00007fffeba5e080 EFLAGS: 00010246
    [ 2.828801] RAX: 000055e2c76d2125 RBX: 0000000000000000 RCX: 00007fee0817c718
    [ 2.829034] RDX: 00007fffeba5e188 RSI: 00007fffeba5e178 RDI: 0000000000000001
    [ 2.829257] RBP: 00007fffeba5e080 R08: 0000000000000000 R09: 00007fee08193c00
    [ 2.829482] R10: 0000000000000009 R11: 0000000000000000 R12: 000055e2c76d2040
    [ 2.829727] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
    [ 2.829964] CR2: 0000000000000020
    [ 2.830149] ---[ end trace ceed83d8c68a1bf1 ]---
    ```

    Cc: # v4.11+
    Fixes: 64e90a8acb85 ("Introduce STATIC_USERMODEHELPER to mediate call_usermodehelper()")
    BugLink: https://bugzilla.kernel.org/show_bug.cgi?id=199795
    Reported-by: Tony Vroon
    Reported-by: Sergey Kvachonok
    Tested-by: Sergei Trofimovich
    Signed-off-by: Luis Chamberlain
    Link: https://lore.kernel.org/r/20200416162859.26518-1-mcgrof@kernel.org
    Signed-off-by: Greg Kroah-Hartman

    Luis Chamberlain
     

22 Apr, 2020

1 commit

  • If the core_pattern is set to "|" and any process segfaults then we get
    a null pointer derefernce while trying to coredump. The call stack shows:

    RIP: do_coredump+0x628/0x11c0

    When the core_pattern has only "|" there is no use of trying the
    coredump and we can check that while formating the corename and exit
    with an error.

    After this change I get:

    format_corename failed
    Aborting core

    Fixes: 315c69261dd3 ("coredump: split pipe command whitespace before expanding template")
    Reported-by: Matthew Ruffell
    Signed-off-by: Sudip Mukherjee
    Signed-off-by: Andrew Morton
    Cc: Paul Wise
    Cc: Alexander Viro
    Cc: Neil Horman
    Cc:
    Link: http://lkml.kernel.org/r/20200416194612.21418-1-sudipm.mukherjee@gmail.com
    Signed-off-by: Linus Torvalds

    Sudip Mukherjee
     

09 Feb, 2020

1 commit

  • This makes the pipe code use separate wait-queues and exclusive waiting
    for readers and writers, avoiding a nasty thundering herd problem when
    there are lots of readers waiting for data on a pipe (or, less commonly,
    lots of writers waiting for a pipe to have space).

    While this isn't a common occurrence in the traditional "use a pipe as a
    data transport" case, where you typically only have a single reader and
    a single writer process, there is one common special case: using a pipe
    as a source of "locking tokens" rather than for data communication.

    In particular, the GNU make jobserver code ends up using a pipe as a way
    to limit parallelism, where each job consumes a token by reading a byte
    from the jobserver pipe, and releases the token by writing a byte back
    to the pipe.

    This pattern is fairly traditional on Unix, and works very well, but
    will waste a lot of time waking up a lot of processes when only a single
    reader needs to be woken up when a writer releases a new token.

    A simplified test-case of just this pipe interaction is to create 64
    processes, and then pass a single token around between them (this
    test-case also intentionally passes another token that gets ignored to
    test the "wake up next" logic too, in case anybody wonders about it):

    #include

    int main(int argc, char **argv)
    {
    int fd[2], counters[2];

    pipe(fd);
    counters[0] = 0;
    counters[1] = -1;
    write(fd[1], counters, sizeof(counters));

    /* 64 processes */
    fork(); fork(); fork(); fork(); fork(); fork();

    do {
    int i;
    read(fd[0], &i, sizeof(i));
    if (i < 0)
    continue;
    counters[0] = i+1;
    write(fd[1], counters, (1+(i & 1)) *sizeof(int));
    } while (counters[0] < 1000000);
    return 0;
    }

    and in a perfect world, passing that token around should only cause one
    context switch per transfer, when the writer of a token causes a
    directed wakeup of just a single reader.

    But with the "writer wakes all readers" model we traditionally had, on
    my test box the above case causes more than an order of magnitude more
    scheduling: instead of the expected ~1M context switches, "perf stat"
    shows

    231,852.37 msec task-clock # 15.857 CPUs utilized
    11,250,961 context-switches # 0.049 M/sec
    616,304 cpu-migrations # 0.003 M/sec
    1,648 page-faults # 0.007 K/sec
    1,097,903,998,514 cycles # 4.735 GHz
    120,781,778,352 instructions # 0.11 insn per cycle
    27,997,056,043 branches # 120.754 M/sec
    283,581,233 branch-misses # 1.01% of all branches

    14.621273891 seconds time elapsed

    0.018243000 seconds user
    3.611468000 seconds sys

    before this commit.

    After this commit, I get

    5,229.55 msec task-clock # 3.072 CPUs utilized
    1,212,233 context-switches # 0.232 M/sec
    103,951 cpu-migrations # 0.020 M/sec
    1,328 page-faults # 0.254 K/sec
    21,307,456,166 cycles # 4.074 GHz
    12,947,819,999 instructions # 0.61 insn per cycle
    2,881,985,678 branches # 551.096 M/sec
    64,267,015 branch-misses # 2.23% of all branches

    1.702148350 seconds time elapsed

    0.004868000 seconds user
    0.110786000 seconds sys

    instead. Much better.

    [ Note! This kernel improvement seems to be very good at triggering a
    race condition in the make jobserver (in GNU make 4.2.1) for me. It's
    a long known bug that was fixed back in June 2017 by GNU make commit
    b552b0525198 ("[SV 51159] Use a non-blocking read with pselect to
    avoid hangs.").

    But there wasn't a new release of GNU make until 4.3 on Jan 19 2020,
    so a number of distributions may still have the buggy version. Some
    have backported the fix to their 4.2.1 release, though, and even
    without the fix it's quite timing-dependent whether the bug actually
    is hit. ]

    Josh Triplett says:
    "I've been hammering on your pipe fix patch (switching to exclusive
    wait queues) for a month or so, on several different systems, and I've
    run into no issues with it. The patch *substantially* improves
    parallel build times on large (~100 CPU) systems, both with parallel
    make and with other things that use make's pipe-based jobserver.

    All current distributions (including stable and long-term stable
    distributions) have versions of GNU make that no longer have the
    jobserver bug"

    Tested-by: Josh Triplett
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

03 Aug, 2019

1 commit

  • Save the offsets of the start of each argument to avoid having to update
    pointers to each argument after every corename krealloc and to avoid
    having to duplicate the memory for the dump command.

    Executable names containing spaces were previously being expanded from
    %e or %E and then split in the middle of the filename. This is
    incorrect behaviour since an argument list can represent arguments with
    spaces.

    The splitting could lead to extra arguments being passed to the core
    dump handler that it might have interpreted as options or ignored
    completely.

    Core dump handlers that are not aware of this Linux kernel issue will be
    using %e or %E without considering that it may be split and so they will
    be vulnerable to processes with spaces in their names breaking their
    argument list. If their internals are otherwise well written, such as
    if they are written in shell but quote arguments, they will work better
    after this change than before. If they are not well written, then there
    is a slight chance of breakage depending on the details of the code but
    they will already be fairly broken by the split filenames.

    Core dump handlers that are aware of this Linux kernel issue will be
    placing %e or %E as the last item in their core_pattern and then
    aggregating all of the remaining arguments into one, separated by
    spaces. Alternatively they will be obtaining the filename via other
    methods. Both of these will be compatible with the new arrangement.

    A side effect from this change is that unknown template types (for
    example %z) result in an empty argument to the dump handler instead of
    the argument being dropped. This is a desired change as:

    It is easier for dump handlers to process empty arguments than dropped
    ones, especially if they are written in shell or don't pass each
    template item with a preceding command-line option in order to
    differentiate between individual template types. Most core_patterns in
    the wild do not use options so they can confuse different template types
    (especially numeric ones) if an earlier one gets dropped in old kernels.
    If the kernel introduces a new template type and a core_pattern uses it,
    the core dump handler might not expect that the argument can be dropped
    in old kernels.

    For example, this can result in security issues when %d is dropped in
    old kernels. This happened with the corekeeper package in Debian and
    resulted in the interface between corekeeper and Linux having to be
    rewritten to use command-line options to differentiate between template
    types.

    The core_pattern for most core dump handlers is written by the handler
    author who would generally not insert unknown template types so this
    change should be compatible with all the core dump handlers that exist.

    Link: http://lkml.kernel.org/r/20190528051142.24939-1-pabs3@bonedaddy.net
    Fixes: 74aadce98605 ("core_pattern: allow passing of arguments to user mode helper when core_pattern is a pipe")
    Signed-off-by: Paul Wise
    Reported-by: Jakub Wilk [https://bugs.debian.org/924398]
    Reported-by: Paul Wise [https://lore.kernel.org/linux-fsdevel/c8b7ecb8508895bf4adb62a748e2ea2c71854597.camel@bonedaddy.net/]
    Suggested-by: Jakub Wilk
    Acked-by: Neil Horman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Wise
     

03 Oct, 2018

1 commit

  • Linus recently observed that if we did not worry about the padding
    member in struct siginfo it is only about 48 bytes, and 48 bytes is
    much nicer than 128 bytes for allocating on the stack and copying
    around in the kernel.

    The obvious thing of only adding the padding when userspace is
    including siginfo.h won't work as there are sigframe definitions in
    the kernel that embed struct siginfo.

    So split siginfo in two; kernel_siginfo and siginfo. Keeping the
    traditional name for the userspace definition. While the version that
    is used internally to the kernel and ultimately will not be padded to
    128 bytes is called kernel_siginfo.

    The definition of struct kernel_siginfo I have put in include/signal_types.h

    A set of buildtime checks has been added to verify the two structures have
    the same field offsets.

    To make it easy to verify the change kernel_siginfo retains the same
    size as siginfo. The reduction in size comes in a following change.

    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

18 Nov, 2017

1 commit

  • Pull compat and uaccess updates from Al Viro:

    - {get,put}_compat_sigset() series

    - assorted compat ioctl stuff

    - more set_fs() elimination

    - a few more timespec64 conversions

    - several removals of pointless access_ok() in places where it was
    followed only by non-__ variants of primitives

    * 'misc.compat' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (24 commits)
    coredump: call do_unlinkat directly instead of sys_unlink
    fs: expose do_unlinkat for built-in callers
    ext4: take handling of EXT4_IOC_GROUP_ADD into a helper, get rid of set_fs()
    ipmi: get rid of pointless access_ok()
    pi433: sanitize ioctl
    cxlflash: get rid of pointless access_ok()
    mtdchar: get rid of pointless access_ok()
    r128: switch compat ioctls to drm_ioctl_kernel()
    selection: get rid of field-by-field copyin
    VT_RESIZEX: get rid of field-by-field copyin
    i2c compat ioctls: move to ->compat_ioctl()
    sched_rr_get_interval(): move compat to native, get rid of set_fs()
    mips: switch to {get,put}_compat_sigset()
    sparc: switch to {get,put}_compat_sigset()
    s390: switch to {get,put}_compat_sigset()
    ppc: switch to {get,put}_compat_sigset()
    parisc: switch to {get,put}_compat_sigset()
    get_compat_sigset()
    get rid of {get,put}_compat_itimerspec()
    io_getevents: Use timespec64 to represent timeouts
    ...

    Linus Torvalds
     

10 Nov, 2017

1 commit


02 Nov, 2017

1 commit

  • Many source files in the tree are missing licensing information, which
    makes it harder for compliance tools to determine the correct license.

    By default all files without license information are under the default
    license of the kernel, which is GPL version 2.

    Update the files which contain no license information with the 'GPL-2.0'
    SPDX license identifier. The SPDX identifier is a legally binding
    shorthand, which can be used instead of the full boiler plate text.

    This patch is based on work done by Thomas Gleixner and Kate Stewart and
    Philippe Ombredanne.

    How this work was done:

    Patches were generated and checked against linux-4.14-rc6 for a subset of
    the use cases:
    - file had no licensing information it it.
    - file was a */uapi/* one with no licensing information in it,
    - file was a */uapi/* one with existing licensing information,

    Further patches will be generated in subsequent months to fix up cases
    where non-standard license headers were used, and references to license
    had to be inferred by heuristics based on keywords.

    The analysis to determine which SPDX License Identifier to be applied to
    a file was done in a spreadsheet of side by side results from of the
    output of two independent scanners (ScanCode & Windriver) producing SPDX
    tag:value files created by Philippe Ombredanne. Philippe prepared the
    base worksheet, and did an initial spot review of a few 1000 files.

    The 4.13 kernel was the starting point of the analysis with 60,537 files
    assessed. Kate Stewart did a file by file comparison of the scanner
    results in the spreadsheet to determine which SPDX license identifier(s)
    to be applied to the file. She confirmed any determination that was not
    immediately clear with lawyers working with the Linux Foundation.

    Criteria used to select files for SPDX license identifier tagging was:
    - Files considered eligible had to be source code files.
    - Make and config files were included as candidates if they contained >5
    lines of source
    - File already had some variant of a license header in it (even if
    Reviewed-by: Philippe Ombredanne
    Reviewed-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Greg Kroah-Hartman
     

14 Sep, 2017

1 commit

  • GFP_TEMPORARY was introduced by commit e12ba74d8ff3 ("Group short-lived
    and reclaimable kernel allocations") along with __GFP_RECLAIMABLE. It's
    primary motivation was to allow users to tell that an allocation is
    short lived and so the allocator can try to place such allocations close
    together and prevent long term fragmentation. As much as this sounds
    like a reasonable semantic it becomes much less clear when to use the
    highlevel GFP_TEMPORARY allocation flag. How long is temporary? Can the
    context holding that memory sleep? Can it take locks? It seems there is
    no good answer for those questions.

    The current implementation of GFP_TEMPORARY is basically GFP_KERNEL |
    __GFP_RECLAIMABLE which in itself is tricky because basically none of
    the existing caller provide a way to reclaim the allocated memory. So
    this is rather misleading and hard to evaluate for any benefits.

    I have checked some random users and none of them has added the flag
    with a specific justification. I suspect most of them just copied from
    other existing users and others just thought it might be a good idea to
    use without any measuring. This suggests that GFP_TEMPORARY just
    motivates for cargo cult usage without any reasoning.

    I believe that our gfp flags are quite complex already and especially
    those with highlevel semantic should be clearly defined to prevent from
    confusion and abuse. Therefore I propose dropping GFP_TEMPORARY and
    replace all existing users to simply use GFP_KERNEL. Please note that
    SLAB users with shrinkers will still get __GFP_RECLAIMABLE heuristic and
    so they will be placed properly for memory fragmentation prevention.

    I can see reasons we might want some gfp flag to reflect shorterm
    allocations but I propose starting from a clear semantic definition and
    only then add users with proper justification.

    This was been brought up before LSF this year by Matthew [1] and it
    turned out that GFP_TEMPORARY really doesn't have a clear semantic. It
    seems to be a heuristic without any measured advantage for most (if not
    all) its current users. The follow up discussion has revealed that
    opinions on what might be temporary allocation differ a lot between
    developers. So rather than trying to tweak existing users into a
    semantic which they haven't expected I propose to simply remove the flag
    and start from scratch if we really need a semantic for short term
    allocations.

    [1] http://lkml.kernel.org/r/20170118054945.GD18349@bombadil.infradead.org

    [akpm@linux-foundation.org: fix typo]
    [akpm@linux-foundation.org: coding-style fixes]
    [sfr@canb.auug.org.au: drm/i915: fix up]
    Link: http://lkml.kernel.org/r/20170816144703.378d4f4d@canb.auug.org.au
    Link: http://lkml.kernel.org/r/20170728091904.14627-1-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Signed-off-by: Stephen Rothwell
    Acked-by: Mel Gorman
    Acked-by: Vlastimil Babka
    Cc: Matthew Wilcox
    Cc: Neil Brown
    Cc: "Theodore Ts'o"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

02 Mar, 2017

3 commits


15 Jan, 2017

1 commit

  • If the last section of a core file ends with an unmapped or zero page,
    the size of the file does not correspond with the last dump_skip() call.
    gdb complains that the file is truncated and can be confusing to users.

    After all of the vma sections are written, make sure that the file size
    is no smaller than the current file position.

    This problem can be demonstrated with gdb's bigcore testcase on the
    sparc architecture.

    Signed-off-by: Dave Kleikamp
    Cc: Alexander Viro
    Cc: linux-fsdevel@vger.kernel.org
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Al Viro

    Dave Kleikamp
     

25 Dec, 2016

1 commit


12 Nov, 2016

1 commit

  • It could be not possible to freeze coredumping task when it waits for
    'core_state->startup' completion, because threads are frozen in
    get_signal() before they got a chance to complete 'core_state->startup'.

    Inability to freeze a task during suspend will cause suspend to fail.
    Also CRIU uses cgroup freezer during dump operation. So with an
    unfreezable task the CRIU dump will fail because it waits for a
    transition from 'FREEZING' to 'FROZEN' state which will never happen.

    Use freezer_do_not_count() to tell freezer to ignore coredumping task
    while it waits for core_state->startup completion.

    Link: http://lkml.kernel.org/r/1475225434-3753-1-git-send-email-aryabinin@virtuozzo.com
    Signed-off-by: Andrey Ryabinin
    Acked-by: Pavel Machek
    Acked-by: Oleg Nesterov
    Cc: Alexander Viro
    Cc: Tejun Heo
    Cc: "Rafael J. Wysocki"
    Cc: Michal Hocko
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Ryabinin
     

08 Jun, 2016

1 commit

  • The offset in the core file used to be tracked with ->written field of
    the coredump_params structure. The field was retired in favour of
    file->f_pos.

    However, ->f_pos is not maintained for pipes which leads to breakage.

    Restore explicit tracking of the offset in coredump_params. Introduce
    ->pos field for this purpose since ->written was already reused.

    Fixes: a00839395103 ("get rid of coredump_params->written").

    Reported-by: Zbigniew Jędrzejewski-Szmek
    Signed-off-by: Mateusz Guzik
    Reviewed-by: Omar Sandoval
    Signed-off-by: Al Viro

    Mateusz Guzik
     

24 May, 2016

1 commit

  • coredump_wait waits for mmap_sem for write currently which can prevent
    oom_reaper to reclaim the oom victims address space asynchronously
    because that requires mmap_sem for read. This might happen if the oom
    victim is multi threaded and some thread(s) is holding mmap_sem for read
    (e.g. page fault) and it is stuck in the page allocator while other
    thread(s) reached coredump_wait already.

    This patch simply uses down_write_killable and bails out with EINTR if
    the lock got interrupted by the fatal signal. do_coredump will return
    right away and do_group_exit will take care to zap the whole thread
    group.

    Signed-off-by: Michal Hocko
    Acked-by: Oleg Nesterov
    Acked-by: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

13 May, 2016

2 commits

  • Commit 9b56d54380ad ("dump_skip(): dump_seek() replacement taking
    coredump_params") introduced a regression with regard to RLIMIT_CORE.
    Previously, when a core dump was sparse, only the data that was actually
    written out would count against the limit. Now, the sparse ranges are
    also included, which leads to truncated core dumps when the actual disk
    usage is still well below the limit. Restore the old behavior by only
    counting what gets emitted and ignoring what gets skipped.

    Signed-off-by: Omar Sandoval
    Signed-off-by: Al Viro

    Omar Sandoval
     
  • cprm->written is redundant with cprm->file->f_pos, so use that instead.

    Signed-off-by: Omar Sandoval
    Signed-off-by: Al Viro

    Omar Sandoval
     

23 Mar, 2016

1 commit

  • This commit fixes the following security hole affecting systems where
    all of the following conditions are fulfilled:

    - The fs.suid_dumpable sysctl is set to 2.
    - The kernel.core_pattern sysctl's value starts with "/". (Systems
    where kernel.core_pattern starts with "|/" are not affected.)
    - Unprivileged user namespace creation is permitted. (This is
    true on Linux >=3.8, but some distributions disallow it by
    default using a distro patch.)

    Under these conditions, if a program executes under secure exec rules,
    causing it to run with the SUID_DUMP_ROOT flag, then unshares its user
    namespace, changes its root directory and crashes, the coredump will be
    written using fsuid=0 and a path derived from kernel.core_pattern - but
    this path is interpreted relative to the root directory of the process,
    allowing the attacker to control where a coredump will be written with
    root privileges.

    To fix the security issue, always interpret core_pattern for dumps that
    are written under SUID_DUMP_ROOT relative to the root directory of init.

    Signed-off-by: Jann Horn
    Acked-by: Kees Cook
    Cc: Al Viro
    Cc: "Eric W. Biederman"
    Cc: Andy Lutomirski
    Cc: Oleg Nesterov
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jann Horn
     

21 Jan, 2016

1 commit

  • Let %h and %e print empty values as "!", "." as "!" and
    ".." as "!.".

    This prevents hostnames and comm values that are empty or consist of one
    or two dots from changing the directory level at which the corefile will
    be stored.

    Consider the case where someone decides to sort coredumps by hostname
    with a core pattern like "/cores/%h/core.%e.%p.%t" or so. In this
    case, hostnames "" and "." would cause the coredump to land directly in
    /cores, which is not what the intent behind the core pattern is, and
    ".." would cause the coredump to land in /.

    Yeah, there probably aren't many people who do that, but I still don't
    want this edgecase to be kind of broken.

    It seems very unlikely that this caused security issues anywhere, so I'm
    not requesting a stable backport.

    [akpm@linux-foundation.org: tweak code comment]
    Signed-off-by: Jann Horn
    Acked-by: Kees Cook
    Cc: Alexander Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jann Horn
     

07 Dec, 2015

1 commit

  • struct timeval on 32-bit systems will have its tv_sec
    value overflow in year 2038 and beyond.
    Use a 64 bit value to print time of the coredump in seconds.
    ktime_get_real_seconds is chosen here for efficiency reasons.

    Suggested by: Arnd Bergmann
    Signed-off-by: Tina Ruchandani
    Signed-off-by: Arnd Bergmann
    Signed-off-by: Al Viro

    Arnd Bergmann
     

07 Nov, 2015

2 commits

  • Change zap_threads() paths to use for_each_thread() rather than
    while_each_thread().

    While at it, change zap_threads() to avoid the nested if's to make the
    code more readable and lessen the indentation.

    Signed-off-by: Oleg Nesterov
    Cc: David Rientjes
    Cc: Kyle Walker
    Cc: Michal Hocko
    Cc: Stanislav Kozina
    Cc: Tetsuo Handa
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • task_will_free_mem() is wrong in many ways, and in particular the
    SIGNAL_GROUP_COREDUMP check is not reliable: a task can participate in the
    coredumping without SIGNAL_GROUP_COREDUMP bit set.

    change zap_threads() paths to always set SIGNAL_GROUP_COREDUMP even if
    other CLONE_VM processes can't react to SIGKILL. Fortunately, at least
    oom-kill case if fine; it kills all tasks sharing the same mm, so it
    should also kill the process which actually dumps the core.

    The change in prepare_signal() is not strictly necessary, it just ensures
    that the patch does not bring another subtle behavioural change. But it
    reminds us that this SIGNAL_GROUP_EXIT/COREDUMP case needs more changes.

    Signed-off-by: Oleg Nesterov
    Cc: David Rientjes
    Cc: Kyle Walker
    Acked-by: Michal Hocko
    Cc: Stanislav Kozina
    Cc: Tetsuo Handa
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

11 Sep, 2015

2 commits

  • On a filesystem like vfat, all files are created with the same owner
    and mode independent of who created the file. When a vfat filesystem
    is mounted with root as owner of all files and read access for everyone,
    root's processes left world-readable coredumps on it (but other
    users' processes only left empty corefiles when given write access
    because of the uid mismatch).

    Given that the old behavior was inconsistent and insecure, I don't see
    a problem with changing it. Now, all processes refuse to dump core unless
    the resulting corefile will only be readable by their owner.

    Signed-off-by: Jann Horn
    Acked-by: Kees Cook
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jann Horn
     
  • It was possible for an attacking user to trick root (or another user) into
    writing his coredumps into an attacker-readable, pre-existing file using
    rename() or link(), causing the disclosure of secret data from the victim
    process' virtual memory. Depending on the configuration, it was also
    possible to trick root into overwriting system files with coredumps. Fix
    that issue by never writing coredumps into existing files.

    Requirements for the attack:
    - The attack only applies if the victim's process has a nonzero
    RLIMIT_CORE and is dumpable.
    - The attacker can trick the victim into coredumping into an
    attacker-writable directory D, either because the core_pattern is
    relative and the victim's cwd is attacker-writable or because an
    absolute core_pattern pointing to a world-writable directory is used.
    - The attacker has one of these:
    A: on a system with protected_hardlinks=0:
    execute access to a folder containing a victim-owned,
    attacker-readable file on the same partition as D, and the
    victim-owned file will be deleted before the main part of the attack
    takes place. (In practice, there are lots of files that fulfill
    this condition, e.g. entries in Debian's /var/lib/dpkg/info/.)
    This does not apply to most Linux systems because most distros set
    protected_hardlinks=1.
    B: on a system with protected_hardlinks=1:
    execute access to a folder containing a victim-owned,
    attacker-readable and attacker-writable file on the same partition
    as D, and the victim-owned file will be deleted before the main part
    of the attack takes place.
    (This seems to be uncommon.)
    C: on any system, independent of protected_hardlinks:
    write access to a non-sticky folder containing a victim-owned,
    attacker-readable file on the same partition as D
    (This seems to be uncommon.)

    The basic idea is that the attacker moves the victim-owned file to where
    he expects the victim process to dump its core. The victim process dumps
    its core into the existing file, and the attacker reads the coredump from
    it.

    If the attacker can't move the file because he does not have write access
    to the containing directory, he can instead link the file to a directory
    he controls, then wait for the original link to the file to be deleted
    (because the kernel checks that the link count of the corefile is 1).

    A less reliable variant that requires D to be non-sticky works with link()
    and does not require deletion of the original link: link() the file into
    D, but then unlink() it directly before the kernel performs the link count
    check.

    On systems with protected_hardlinks=0, this variant allows an attacker to
    not only gain information from coredumps, but also clobber existing,
    victim-writable files with coredumps. (This could theoretically lead to a
    privilege escalation.)

    Signed-off-by: Jann Horn
    Cc: Kees Cook
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jann Horn
     

05 Jul, 2015

1 commit

  • Pull more vfs updates from Al Viro:
    "Assorted VFS fixes and related cleanups (IMO the most interesting in
    that part are f_path-related things and Eric's descriptor-related
    stuff). UFS regression fixes (it got broken last cycle). 9P fixes.
    fs-cache series, DAX patches, Jan's file_remove_suid() work"

    [ I'd say this is much more than "fixes and related cleanups". The
    file_table locking rule change by Eric Dumazet is a rather big and
    fundamental update even if the patch isn't huge. - Linus ]

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (49 commits)
    9p: cope with bogus responses from server in p9_client_{read,write}
    p9_client_write(): avoid double p9_free_req()
    9p: forgetting to cancel request on interrupted zero-copy RPC
    dax: bdev_direct_access() may sleep
    block: Add support for DAX reads/writes to block devices
    dax: Use copy_from_iter_nocache
    dax: Add block size note to documentation
    fs/file.c: __fget() and dup2() atomicity rules
    fs/file.c: don't acquire files->file_lock in fd_install()
    fs:super:get_anon_bdev: fix race condition could cause dev exceed its upper limitation
    vfs: avoid creation of inode number 0 in get_next_ino
    namei: make set_root_rcu() return void
    make simple_positive() public
    ufs: use dir_pages instead of ufs_dir_pages()
    pagemap.h: move dir_pages() over there
    remove the pointless include of lglock.h
    fs: cleanup slight list_entry abuse
    xfs: Correctly lock inode when removing suid and file capabilities
    fs: Call security_ops->inode_killpriv on truncate
    fs: Provide function telling whether file_remove_privs() will do anything
    ...

    Linus Torvalds
     

26 Jun, 2015

2 commits

  • This allows detecting improper format string at build time, like:

    fs/coredump.c:225:5: warning: format '%ld' expects argument of type 'long int', but argument 3 has type 'int' [-Wformat=]
    err = cn_printf(cn, "%ld", cprm->siginfo->si_signo);
    ^

    As si_signo is always an int, the format should be %d here.

    Signed-off-by: Nicolas Iooss
    Acked-by: "Eric W. Biederman"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nicolas Iooss
     
  • When adding __printf attribute to cn_printf, gcc reports some issues:

    fs/coredump.c:213:5: warning: format '%d' expects argument of type
    'int', but argument 3 has type 'kuid_t' [-Wformat=]
    err = cn_printf(cn, "%d", cred->uid);
    ^
    fs/coredump.c:217:5: warning: format '%d' expects argument of type
    'int', but argument 3 has type 'kgid_t' [-Wformat=]
    err = cn_printf(cn, "%d", cred->gid);
    ^

    These warnings come from the fact that the value of uid/gid needs to be
    extracted from the kuid_t/kgid_t structure before being used as an
    integer. More precisely, cred->uid and cred->gid need to be converted to
    either user-namespace uid/gid or to init_user_ns uid/gid.

    Use init_user_ns in order not to break existing ABI, and document this in
    Documentation/sysctl/kernel.txt.

    While at it, format uid and gid values with %u instead of %d because
    uid_t/__kernel_uid32_t and gid_t/__kernel_gid32_t are unsigned int.

    Signed-off-by: Nicolas Iooss
    Acked-by: "Eric W. Biederman"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nicolas Iooss
     

24 Jun, 2015

1 commit


12 Apr, 2015

1 commit


20 Feb, 2015

1 commit