17 Aug, 2020

1 commit

  • Pull io_uring fixes from Jens Axboe:
    "A few differerent things in here.

    Seems like syzbot got some more io_uring bits wired up, and we got a
    handful of reports and the associated fixes are in here.

    General fixes too, and a lot of them marked for stable.

    Lastly, a bit of fallout from the async buffered reads, where we now
    more easily trigger short reads. Some applications don't really like
    that, so the io_read() code now handles short reads internally, and
    got a cleanup along the way so that it's now easier to read (and
    documented). We're now passing tests that failed before"

    * tag 'io_uring-5.9-2020-08-15' of git://git.kernel.dk/linux-block:
    io_uring: short circuit -EAGAIN for blocking read attempt
    io_uring: sanitize double poll handling
    io_uring: internally retry short reads
    io_uring: retain iov_iter state over io_read/io_write calls
    task_work: only grab task signal lock when needed
    io_uring: enable lookup of links holding inflight files
    io_uring: fail poll arm on queue proc failure
    io_uring: hold 'ctx' reference around task_work queue + execute
    fs: RWF_NOWAIT should imply IOCB_NOIO
    io_uring: defer file table grabbing request cleanup for locked requests
    io_uring: add missing REQ_F_COMP_LOCKED for nested requests
    io_uring: fix recursive completion locking on oveflow flush
    io_uring: use TWA_SIGNAL for task_work uncondtionally
    io_uring: account locked memory before potential error case
    io_uring: set ctx sq/cq entry count earlier
    io_uring: Fix NULL pointer dereference in loop_rw_iter()
    io_uring: add comments on how the async buffered read retry works
    io_uring: io_async_buf_func() need not test page bit

    Linus Torvalds
     

16 Aug, 2020

2 commits

  • One case was missed in the short IO retry handling, and that's hitting
    -EAGAIN on a blocking attempt read (eg from io-wq context). This is a
    problem on sockets that are marked as non-blocking when created, they
    don't carry any REQ_F_NOWAIT information to help us terminate them
    instead of perpetually retrying.

    Fixes: 227c0c9673d8 ("io_uring: internally retry short reads")
    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • There's a bit of confusion on the matching pairs of poll vs double poll,
    depending on if the request is a pure poll (IORING_OP_POLL_ADD) or
    poll driven retry.

    Add io_poll_get_double() that returns the double poll waitqueue, if any,
    and io_poll_get_single() that returns the original poll waitqueue. With
    that, remove the argument to io_poll_remove_double().

    Finally ensure that wait->private is cleared once the double poll handler
    has run, so that remove knows it's already been seen.

    Cc: stable@vger.kernel.org # v5.8
    Reported-by: syzbot+7f617d4a9369028b8a2c@syzkaller.appspotmail.com
    Fixes: 18bceab101ad ("io_uring: allow POLL_ADD with double poll_wait() users")
    Signed-off-by: Jens Axboe

    Jens Axboe
     

15 Aug, 2020

5 commits

  • Pull 9p updates from Dominique Martinet:

    - some code cleanup

    - a couple of static analysis fixes

    - setattr: try to pick a fid associated with the file rather than the
    dentry, which might sometimes matter

    * tag '9p-for-5.9-rc1' of git://github.com/martinetd/linux:
    9p: Remove unneeded cast from memory allocation
    9p: remove unused code in 9p
    net/9p: Fix sparse endian warning in trans_fd.c
    9p: Fix memory leak in v9fs_mount
    9p: retrieve fid from file when file instance exist.

    Linus Torvalds
     
  • Pull cifs fixes from Steve French:
    "Three small cifs/smb3 fixes, one for stable fixing mkdir path with
    the 'idsfromsid' mount option"

    * tag '5.9-rc-smb3-fixes-part2' of git://git.samba.org/sfrench/cifs-2.6:
    SMB3: Fix mkdir when idsfromsid configured on mount
    cifs: Convert to use the fallthrough macro
    cifs: Fix an error pointer dereference in cifs_mount()

    Linus Torvalds
     
  • Pull NFS client updates from Trond Myklebust:
    "Stable fixes:
    - pNFS: Don't return layout segments that are being used for I/O
    - pNFS: Don't move layout segments off the active list when being used for I/O

    Features:
    - NFS: Add support for user xattrs through the NFSv4.2 protocol
    - NFS: Allow applications to speed up readdir+statx() using AT_STATX_DONT_SYNC
    - NFSv4.0 allow nconnect for v4.0

    Bugfixes and cleanups:
    - nfs: ensure correct writeback errors are returned on close()
    - nfs: nfs_file_write() should check for writeback errors
    - nfs: Fix getxattr kernel panic and memory overflow
    - NFS: Fix the pNFS/flexfiles mirrored read failover code
    - SUNRPC: dont update timeout value on connection reset
    - freezer: Add unsafe versions of freezable_schedule_timeout_interruptible for NFS
    - sunrpc: destroy rpc_inode_cachep after unregister_filesystem"

    * tag 'nfs-for-5.9-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (32 commits)
    NFS: Fix flexfiles read failover
    fs: nfs: delete repeated words in comments
    rpc_pipefs: convert comma to semicolon
    nfs: Fix getxattr kernel panic and memory overflow
    NFS: Don't return layout segments that are in use
    NFS: Don't move layouts to plh_return_segs list while in use
    NFS: Add layout segment info to pnfs read/write/commit tracepoints
    NFS: Add tracepoints for layouterror and layoutstats.
    NFS: Report the stateid + status in trace_nfs4_layoutreturn_on_close()
    SUNRPC dont update timeout value on connection reset
    nfs: nfs_file_write() should check for writeback errors
    nfs: ensure correct writeback errors are returned on close()
    NFSv4.2: xattr cache: get rid of cache discard work queue
    NFS: remove redundant initialization of variable result
    NFSv4.0 allow nconnect for v4.0
    freezer: Add unsafe versions of freezable_schedule_timeout_interruptible for NFS
    sunrpc: destroy rpc_inode_cachep after unregister_filesystem
    NFSv4.2: add client side xattr caching.
    NFSv4.2: hook in the user extended attribute handlers
    NFSv4.2: add the extended attribute proc functions.
    ...

    Linus Torvalds
     
  • Drop duplicated words {the, at} in comments.

    Signed-off-by: Randy Dunlap
    Signed-off-by: Andrew Morton
    Acked-by: Ian Kent
    Link: http://lkml.kernel.org/r/20200811021817.24982-1-rdunlap@infradead.org
    Signed-off-by: Linus Torvalds

    Randy Dunlap
     
  • Patch series "Fix S_ISDIR execve() errno".

    Fix an errno change for execve() of directories, noticed by Marc Zyngier.
    Along with the fix, include a regression test to avoid seeing this return
    in the future.

    This patch (of 2):

    The return code for attempting to execute a directory has always been
    EACCES. Adjust the S_ISDIR exec test to reflect the old errno instead of
    the general EISDIR for other kinds of "open" attempts on directories.

    Fixes: 633fb6ac3980 ("exec: move S_ISREG() check earlier")
    Reported-by: Marc Zyngier
    Signed-off-by: Kees Cook
    Signed-off-by: Andrew Morton
    Tested-by: Greg Kroah-Hartman
    Reviewed-by: Greg Kroah-Hartman
    Link: http://lkml.kernel.org/r/20200813231723.2725102-2-keescook@chromium.org
    Link: https://lore.kernel.org/lkml/20200813151305.6191993b@why
    Signed-off-by: Linus Torvalds

    Kees Cook
     

14 Aug, 2020

6 commits

  • mkdir uses a compounded create operation which was not setting
    the security descriptor on create of a directory. Fix so
    mkdir now sets the mode and owner info properly when idsfromsid
    and modefromsid are configured on the mount.

    Signed-off-by: Steve French
    CC: Stable # v5.8
    Reviewed-by: Paulo Alcantara (SUSE)
    Reviewed-by: Pavel Shilovsky

    Steve French
     
  • We've had a few application cases of not handling short reads properly,
    and it is understandable as short reads aren't really expected if the
    application isn't doing non-blocking IO.

    Now that we retain the iov_iter over retries, we can implement internal
    retry pretty trivially. This ensures that we don't return a short read,
    even for buffered reads on page cache conflicts.

    Cleanup the deep nesting and hard to read nature of io_read() as well,
    it's much more straight forward now to read and understand. Added a
    few comments explaining the logic as well.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • Instead of maintaining (and setting/remembering) iov_iter size and
    segment counts, just put the iov_iter in the async part of the IO
    structure.

    This is mostly a preparation patch for doing appropriate internal retries
    for short reads, but it also cleans up the state handling nicely and
    simplifies it quite a bit.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • Pull more btrfs updates from David Sterba:
    "One minor update, the rest are fixes that have arrived a bit late for
    the first batch. There are also some recent fixes for bugs that were
    discovered during the merge window and pop up during testing.

    User visible change:

    - show correct subvolume path in /proc/mounts for bind mounts

    Fixes:

    - fix compression messages when remounting with different level or
    compression algorithm

    - tree-log: fix some memory leaks on error handling paths

    - restore I_VERSION on remount

    - fix return values and error code mixups

    - fix umount crash with quotas enabled when removing sysfs files

    - fix trim range on a shrunk device"

    * tag 'for-5.9-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
    btrfs: trim: fix underflow in trim length to prevent access beyond device boundary
    btrfs: fix return value mixup in btrfs_get_extent
    btrfs: sysfs: fix NULL pointer dereference at btrfs_sysfs_del_qgroups()
    btrfs: check correct variable after allocation in btrfs_backref_iter_alloc
    btrfs: make sure SB_I_VERSION doesn't get unset by remount
    btrfs: fix memory leaks after failure to lookup checksums during inode logging
    btrfs: don't show full path of bind mounts in subvol=
    btrfs: fix messages after changing compression level by remount
    btrfs: only search for left_info if there is no right_info in try_merge_free_space
    btrfs: inode: fix NULL pointer dereference if inode doesn't need compression

    Linus Torvalds
     
  • Pull xfs fixes from Darrick Wong:
    "Two small fixes that have come in during the past week:

    - Fix duplicated words in comments

    - Fix an ubsan complaint about null pointer arithmetic"

    * tag 'xfs-5.9-merge-8' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
    xfs: Fix UBSAN null-ptr-deref in xfs_sysfs_init
    xfs: delete duplicated words + other fixes

    Linus Torvalds
     
  • Pull exfat updates from Namjae Jeon:

    - don't clear MediaFailure and VolumeDirty bit in volume flags if these
    were already set before mounting

    - write multiple dirty buffers at once in sync mode

    - remove unneeded EXFAT_SB_DIRTY bit set

    * tag 'exfat-for-5.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/linkinjeon/exfat:
    exfat: retain 'VolumeFlags' properly
    exfat: optimize exfat_zeroed_cluster()
    exfat: add error check when updating dir-entries
    exfat: write multiple sectors at once
    exfat: remove EXFAT_SB_DIRTY flag

    Linus Torvalds
     

13 Aug, 2020

26 commits

  • When a process exits, we cancel whatever requests it has pending that
    are referencing the file table. However, if a link is holding a
    reference, then we cannot find it by simply looking at the inflight
    list.

    Enable checking of the poll and timeout list to find the link, and
    cancel it appropriately.

    Cc: stable@vger.kernel.org
    Reported-by: Josef
    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • Pull ceph updates from Ilya Dryomov:
    "Xiubo has completed his work on filesystem client metrics, they are
    sent to all available MDSes once per second now.

    Other than that, we have a lot of fixes and cleanups all around the
    filesystem, including a tweak to cut down on MDS request resends in
    multi-MDS setups from Yanhu and fixups for SELinux symlink labeling
    and MClientSession message decoding from Jeff"

    * tag 'ceph-for-5.9-rc1' of git://github.com/ceph/ceph-client: (22 commits)
    ceph: handle zero-length feature mask in session messages
    ceph: use frag's MDS in either mode
    ceph: move sb->wb_pagevec_pool to be a global mempool
    ceph: set sec_context xattr on symlink creation
    ceph: remove redundant initialization of variable mds
    ceph: fix use-after-free for fsc->mdsc
    ceph: remove unused variables in ceph_mdsmap_decode()
    ceph: delete repeated words in fs/ceph/
    ceph: send client provided metric flags in client metadata
    ceph: periodically send perf metrics to MDSes
    ceph: check the sesion state and return false in case it is closed
    libceph: replace HTTP links with HTTPS ones
    ceph: remove unnecessary cast in kfree()
    libceph: just have osd_req_op_init() return a pointer
    ceph: do not access the kiocb after aio requests
    ceph: clean up and optimize ceph_check_delayed_caps()
    ceph: fix potential mdsc use-after-free crash
    ceph: switch to WARN_ON_ONCE in encode_supported_features()
    ceph: add global total_caps to count the mdsc's total caps number
    ceph: add check_session_state() helper and make it global
    ...

    Linus Torvalds
     
  • Merge more updates from Andrew Morton:

    - most of the rest of MM (memcg, hugetlb, vmscan, proc, compaction,
    mempolicy, oom-kill, hugetlbfs, migration, thp, cma, util,
    memory-hotplug, cleanups, uaccess, migration, gup, pagemap),

    - various other subsystems (alpha, misc, sparse, bitmap, lib, bitops,
    checkpatch, autofs, minix, nilfs, ufs, fat, signals, kmod, coredump,
    exec, kdump, rapidio, panic, kcov, kgdb, ipc).

    * emailed patches from Andrew Morton : (164 commits)
    mm/gup: remove task_struct pointer for all gup code
    mm: clean up the last pieces of page fault accountings
    mm/xtensa: use general page fault accounting
    mm/x86: use general page fault accounting
    mm/sparc64: use general page fault accounting
    mm/sparc32: use general page fault accounting
    mm/sh: use general page fault accounting
    mm/s390: use general page fault accounting
    mm/riscv: use general page fault accounting
    mm/powerpc: use general page fault accounting
    mm/parisc: use general page fault accounting
    mm/openrisc: use general page fault accounting
    mm/nios2: use general page fault accounting
    mm/nds32: use general page fault accounting
    mm/mips: use general page fault accounting
    mm/microblaze: use general page fault accounting
    mm/m68k: use general page fault accounting
    mm/ia64: use general page fault accounting
    mm/hexagon: use general page fault accounting
    mm/csky: use general page fault accounting
    ...

    Linus Torvalds
     
  • After the cleanup of page fault accounting, gup does not need to pass
    task_struct around any more. Remove that parameter in the whole gup
    stack.

    Signed-off-by: Peter Xu
    Signed-off-by: Andrew Morton
    Reviewed-by: John Hubbard
    Link: http://lkml.kernel.org/r/20200707225021.200906-26-peterx@redhat.com
    Signed-off-by: Linus Torvalds

    Peter Xu
     
  • The path_noexec() check, like the regular file check, was happening too
    late, letting LSMs see impossible execve()s. Check it earlier as well in
    may_open() and collect the redundant fs/exec.c path_noexec() test under
    the same robustness comment as the S_ISREG() check.

    My notes on the call path, and related arguments, checks, etc:

    do_open_execat()
    struct open_flags open_exec_flags = {
    .open_flag = O_LARGEFILE | O_RDONLY | __FMODE_EXEC,
    .acc_mode = MAY_EXEC,
    ...
    do_filp_open(dfd, filename, open_flags)
    path_openat(nameidata, open_flags, flags)
    file = alloc_empty_file(open_flags, current_cred());
    do_open(nameidata, file, open_flags)
    may_open(path, acc_mode, open_flag)
    /* new location of MAY_EXEC vs path_noexec() test */
    inode_permission(inode, MAY_OPEN | acc_mode)
    security_inode_permission(inode, acc_mode)
    vfs_open(path, file)
    do_dentry_open(file, path->dentry->d_inode, open)
    security_file_open(f)
    open()
    /* old location of path_noexec() test */

    Signed-off-by: Kees Cook
    Signed-off-by: Andrew Morton
    Cc: Alexander Viro
    Cc: Aleksa Sarai
    Cc: Christian Brauner
    Cc: Dmitry Vyukov
    Cc: Eric Biggers
    Cc: Tetsuo Handa
    Link: http://lkml.kernel.org/r/20200605160013.3954297-4-keescook@chromium.org
    Signed-off-by: Linus Torvalds

    Kees Cook
     
  • The execve(2)/uselib(2) syscalls have always rejected non-regular files.
    Recently, it was noticed that a deadlock was introduced when trying to
    execute pipes, as the S_ISREG() test was happening too late. This was
    fixed in commit 73601ea5b7b1 ("fs/open.c: allow opening only regular files
    during execve()"), but it was added after inode_permission() had already
    run, which meant LSMs could see bogus attempts to execute non-regular
    files.

    Move the test into the other inode type checks (which already look for
    other pathological conditions[1]). Since there is no need to use
    FMODE_EXEC while we still have access to "acc_mode", also switch the test
    to MAY_EXEC.

    Also include a comment with the redundant S_ISREG() checks at the end of
    execve(2)/uselib(2) to note that they are present to avoid any mistakes.

    My notes on the call path, and related arguments, checks, etc:

    do_open_execat()
    struct open_flags open_exec_flags = {
    .open_flag = O_LARGEFILE | O_RDONLY | __FMODE_EXEC,
    .acc_mode = MAY_EXEC,
    ...
    do_filp_open(dfd, filename, open_flags)
    path_openat(nameidata, open_flags, flags)
    file = alloc_empty_file(open_flags, current_cred());
    do_open(nameidata, file, open_flags)
    may_open(path, acc_mode, open_flag)
    /* new location of MAY_EXEC vs S_ISREG() test */
    inode_permission(inode, MAY_OPEN | acc_mode)
    security_inode_permission(inode, acc_mode)
    vfs_open(path, file)
    do_dentry_open(file, path->dentry->d_inode, open)
    /* old location of FMODE_EXEC vs S_ISREG() test */
    security_file_open(f)
    open()

    [1] https://lore.kernel.org/lkml/202006041910.9EF0C602@keescook/

    Signed-off-by: Kees Cook
    Signed-off-by: Andrew Morton
    Cc: Aleksa Sarai
    Cc: Alexander Viro
    Cc: Christian Brauner
    Cc: Dmitry Vyukov
    Cc: Eric Biggers
    Cc: Tetsuo Handa
    Link: http://lkml.kernel.org/r/20200605160013.3954297-3-keescook@chromium.org
    Signed-off-by: Linus Torvalds

    Kees Cook
     
  • Patch series "Relocate execve() sanity checks", v2.

    While looking at the code paths for the proposed O_MAYEXEC flag, I saw
    some things that looked like they should be fixed up.

    exec: Change uselib(2) IS_SREG() failure to EACCES
    This just regularizes the return code on uselib(2).

    exec: Move S_ISREG() check earlier
    This moves the S_ISREG() check even earlier than it was already.

    exec: Move path_noexec() check earlier
    This adds the path_noexec() check to the same place as the
    S_ISREG() check.

    This patch (of 3):

    Change uselib(2)' S_ISREG() error return to EACCES instead of EINVAL so
    the behavior matches execve(2), and the seemingly documented value. The
    "not a regular file" failure mode of execve(2) is explicitly
    documented[1], but it is not mentioned in uselib(2)[2] which does,
    however, say that open(2) and mmap(2) errors may apply. The documentation
    for open(2) does not include a "not a regular file" error[3], but mmap(2)
    does[4], and it is EACCES.

    [1] http://man7.org/linux/man-pages/man2/execve.2.html#ERRORS
    [2] http://man7.org/linux/man-pages/man2/uselib.2.html#ERRORS
    [3] http://man7.org/linux/man-pages/man2/open.2.html#ERRORS
    [4] http://man7.org/linux/man-pages/man2/mmap.2.html#ERRORS

    Signed-off-by: Kees Cook
    Signed-off-by: Andrew Morton
    Acked-by: Christian Brauner
    Cc: Aleksa Sarai
    Cc: Alexander Viro
    Cc: Dmitry Vyukov
    Cc: Eric Biggers
    Cc: Tetsuo Handa
    Link: http://lkml.kernel.org/r/20200605160013.3954297-1-keescook@chromium.org
    Link: http://lkml.kernel.org/r/20200605160013.3954297-2-keescook@chromium.org
    Signed-off-by: Linus Torvalds

    Kees Cook
     
  • The document reads "%e" should be "executable filename" while actually it
    could be changed by things like pr_ctl PR_SET_NAME. People who uses "%e"
    in core_pattern get surprised when they find out they get thread name
    instead of executable filename.

    This is either a bug of document or a bug of code. Since the behavior of
    "%e" is there for long time, it could bring another surprise for users if
    we "fix" the code.

    So we just "fix" the document. And more, for users who really need the
    "executable filename" in core_pattern, we introduce a new "%f" for the
    real executable filename. We already have "%E" for executable path in
    kernel, so just reuse most of its code for the new added "%f" format.

    Signed-off-by: Lepton Wu
    Signed-off-by: Andrew Morton
    Link: http://lkml.kernel.org/r/20200701031432.2978761-1-ytht.net@gmail.com
    Signed-off-by: Linus Torvalds

    Lepton Wu
     
  • The kernel signalfd4() syscall returns different error codes when called
    either in compat or native mode. This behaviour makes correct emulation
    in qemu and testing programs like LTP more complicated.

    Fix the code to always return -in both modes- EFAULT for unaccessible user
    memory, and EINVAL when called with an invalid signal mask.

    Signed-off-by: Helge Deller
    Signed-off-by: Andrew Morton
    Cc: Alexander Viro
    Cc: Laurent Vivier
    Link: http://lkml.kernel.org/r/20200530100707.GA10159@ls3530.fritz.box
    Signed-off-by: Linus Torvalds

    Helge Deller
     
  • If data clusters == 0, fat_ra_init() calls the ->ent_blocknr() for the
    cluster beyond ->max_clusters.

    This checks the limit before initialization to suppress the warning.

    Reported-by: syzbot+756199124937b31a9b7e@syzkaller.appspotmail.com
    Signed-off-by: OGAWA Hirofumi
    Signed-off-by: Andrew Morton
    Link: http://lkml.kernel.org/r/87mu462sv4.fsf@mail.parknet.co.jp
    Signed-off-by: Linus Torvalds

    OGAWA Hirofumi
     
  • Rationale:
    Reduces attack surface on kernel devs opening the links for MITM
    as HTTPS traffic is much harder to manipulate.

    Deterministic algorithm:
    For each file:
    If not .svg:
    For each line:
    If doesn't contain `xmlns`:
    For each link, `http://[^# ]*(?:\w|/)`:
    If neither `gnu\.org/license`, nor `mozilla\.org/MPL`:
    If both the HTTP and HTTPS versions
    return 200 OK and serve the same content:
    Replace HTTP with HTTPS.

    Signed-off-by: Alexander A. Klimov
    Signed-off-by: Andrew Morton
    Acked-by: OGAWA Hirofumi
    Link: http://lkml.kernel.org/r/20200708200409.22293-1-grandmaster@al2klimov.de
    Signed-off-by: Linus Torvalds

    Alexander A. Klimov
     
  • There is no need to hold write_lock in fat_ioctl_get_attributes.
    write_lock may make an impact on concurrency of fat_ioctl_get_attributes.

    Signed-off-by: Yubo Feng
    Signed-off-by: Andrew Morton
    Acked-by: OGAWA Hirofumi
    Link: http://lkml.kernel.org/r/1593308053-12702-1-git-send-email-fengyubo3@huawei.com
    Signed-off-by: Linus Torvalds

    Yubo Feng
     
  • The 64 bit ino is being compared to the product of two u32 values,
    however, the multiplication is being performed using a 32 bit multiply so
    there is a potential of an overflow. To be fully safe, cast uspi->s_ncg
    to a u64 to ensure a 64 bit multiplication occurs to avoid any chance of
    overflow.

    Fixes: f3e2a520f5fb ("ufs: NFS support")
    Signed-off-by: Colin Ian King
    Signed-off-by: Andrew Morton
    Cc: Evgeniy Dushistov
    Cc: Alexey Dobriyan
    Link: http://lkml.kernel.org/r/20200715170355.1081713-1-colin.king@canonical.com
    Addresses-Coverity: ("Unintentional integer overflow")
    Signed-off-by: Linus Torvalds

    Colin Ian King
     
  • Add macros for nilfs_(sb, fmt, ...) and convert the uses of
    'nilfs_msg(sb, KERN_, ...)' to 'nilfs_(sb, ...)' so nilfs2
    uses a logging style more like the typical kernel logging style.

    Miscellanea:

    o Realign arguments for these uses

    Signed-off-by: Joe Perches
    Signed-off-by: Ryusuke Konishi
    Signed-off-by: Andrew Morton
    Link: http://lkml.kernel.org/r/1595860111-3920-4-git-send-email-konishi.ryusuke@gmail.com
    Signed-off-by: Linus Torvalds

    Joe Perches
     
  • Reduce object size a bit by removing the KERN_ as a separate
    argument and adding it to the format string.

    Reduce overall object size by about ~.5% (x86-64 defconfig w/ nilfs2)

    old:
    $ size -t fs/nilfs2/built-in.a | tail -1
    191738 8676 44 200458 30f0a (TOTALS)

    new:
    $ size -t fs/nilfs2/built-in.a | tail -1
    190971 8676 44 199691 30c0b (TOTALS)

    Signed-off-by: Joe Perches
    Signed-off-by: Ryusuke Konishi
    Signed-off-by: Andrew Morton
    Link: http://lkml.kernel.org/r/1595860111-3920-3-git-send-email-konishi.ryusuke@gmail.com
    Signed-off-by: Linus Torvalds

    Joe Perches
     
  • Patch series "nilfs2 updates".

    This patch (of 3):

    unlock_new_inode() is only meant to be called after a new inode has
    already been inserted into the hash table. But nilfs_new_inode() can call
    it even before it has inserted the inode, triggering the WARNING in
    unlock_new_inode(). Fix this by only calling unlock_new_inode() if the
    inode has the I_NEW flag set, indicating that it's in the table.

    Signed-off-by: Eric Biggers
    Signed-off-by: Ryusuke Konishi
    Signed-off-by: Andrew Morton
    Link: http://lkml.kernel.org/r/1595860111-3920-1-git-send-email-konishi.ryusuke@gmail.com
    Link: http://lkml.kernel.org/r/1595860111-3920-2-git-send-email-konishi.ryusuke@gmail.com
    Signed-off-by: Linus Torvalds

    Eric Biggers
     
  • When truncating a file to a size within the last allowed logical block,
    block_to_path() is called with the *next* block. This exceeds the limit,
    causing the "block %ld too big" error message to be printed.

    This case isn't actually an error; there are just no more blocks past that
    point. So, remove this error message.

    Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
    Signed-off-by: Eric Biggers
    Signed-off-by: Andrew Morton
    Cc: Alexander Viro
    Cc: Qiujun Huang
    Link: http://lkml.kernel.org/r/20200628060846.682158-7-ebiggers@kernel.org
    Signed-off-by: Linus Torvalds

    Eric Biggers
     
  • The minix filesystem reads its maximum file size from its on-disk
    superblock. This value isn't necessarily a multiple of the block size.
    When it's not, the V1 block mapping code doesn't allow mapping the last
    possible block. Commit 6ed6a722f9ab ("minixfs: fix block limit check")
    fixed this in the V2 mapping code. Fix it in the V1 mapping code too.

    Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
    Signed-off-by: Eric Biggers
    Signed-off-by: Andrew Morton
    Cc: Alexander Viro
    Cc: Qiujun Huang
    Link: http://lkml.kernel.org/r/20200628060846.682158-6-ebiggers@kernel.org
    Signed-off-by: Linus Torvalds

    Eric Biggers
     
  • The minix filesystem leaves super_block::s_maxbytes at MAX_NON_LFS rather
    than setting it to the actual filesystem-specific limit. This is broken
    because it means userspace doesn't see the standard behavior like getting
    EFBIG and SIGXFSZ when exceeding the maximum file size.

    Fix this by setting s_maxbytes correctly.

    Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
    Signed-off-by: Eric Biggers
    Signed-off-by: Andrew Morton
    Cc: Alexander Viro
    Cc: Qiujun Huang
    Link: http://lkml.kernel.org/r/20200628060846.682158-5-ebiggers@kernel.org
    Signed-off-by: Linus Torvalds

    Eric Biggers
     
  • If the minix filesystem tries to map a very large logical block number to
    its on-disk location, block_to_path() can return offsets that are too
    large, causing out-of-bounds memory accesses when accessing indirect index
    blocks. This should be prevented by the check against the maximum file
    size, but this doesn't work because the maximum file size is read directly
    from the on-disk superblock and isn't validated itself.

    Fix this by validating the maximum file size at mount time.

    Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
    Reported-by: syzbot+c7d9ec7a1a7272dd71b3@syzkaller.appspotmail.com
    Reported-by: syzbot+3b7b03a0c28948054fb5@syzkaller.appspotmail.com
    Reported-by: syzbot+6e056ee473568865f3e6@syzkaller.appspotmail.com
    Signed-off-by: Eric Biggers
    Signed-off-by: Andrew Morton
    Cc: Alexander Viro
    Cc: Qiujun Huang
    Cc:
    Link: http://lkml.kernel.org/r/20200628060846.682158-4-ebiggers@kernel.org
    Signed-off-by: Linus Torvalds

    Eric Biggers
     
  • If an inode has no links, we need to mark it bad rather than allowing it
    to be accessed. This avoids WARNINGs in inc_nlink() and drop_nlink() when
    doing directory operations on a fuzzed filesystem.

    Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
    Reported-by: syzbot+a9ac3de1b5de5fb10efc@syzkaller.appspotmail.com
    Reported-by: syzbot+df958cf5688a96ad3287@syzkaller.appspotmail.com
    Signed-off-by: Eric Biggers
    Signed-off-by: Andrew Morton
    Cc: Alexander Viro
    Cc: Qiujun Huang
    Cc:
    Link: http://lkml.kernel.org/r/20200628060846.682158-3-ebiggers@kernel.org
    Signed-off-by: Linus Torvalds

    Eric Biggers
     
  • Patch series "fs/minix: fix syzbot bugs and set s_maxbytes".

    This series fixes all syzbot bugs in the minix filesystem:

    KASAN: null-ptr-deref Write in get_block
    KASAN: use-after-free Write in get_block
    KASAN: use-after-free Read in get_block
    WARNING in inc_nlink
    KMSAN: uninit-value in get_block
    WARNING in drop_nlink

    It also fixes the minix filesystem to set s_maxbytes correctly, so that
    userspace sees the correct behavior when exceeding the max file size.

    This patch (of 6):

    sb_getblk() can fail, so check its return value.

    This fixes a NULL pointer dereference.

    Originally from Qiujun Huang.

    Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
    Reported-by: syzbot+4a88b2b9dc280f47baf4@syzkaller.appspotmail.com
    Signed-off-by: Eric Biggers
    Signed-off-by: Andrew Morton
    Cc: Qiujun Huang
    Cc: Alexander Viro
    Cc:
    Link: http://lkml.kernel.org/r/20200628060846.682158-1-ebiggers@kernel.org
    Link: http://lkml.kernel.org/r/20200628060846.682158-2-ebiggers@kernel.org
    Signed-off-by: Linus Torvalds

    Eric Biggers
     
  • Both exec and exit want to ensure that the uaccess routines actually do
    access user pointers. Use the newly added force_uaccess_begin helper
    instead of an open coded set_fs for that to prepare for kernel builds
    where set_fs() does not exist.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Acked-by: Linus Torvalds
    Cc: Nick Hu
    Cc: Greentime Hu
    Cc: Vincent Chen
    Cc: Paul Walmsley
    Cc: Palmer Dabbelt
    Cc: Geert Uytterhoeven
    Link: http://lkml.kernel.org/r/20200710135706.537715-7-hch@lst.de
    Signed-off-by: Linus Torvalds

    Christoph Hellwig
     
  • syzbot found issues with having hugetlbfs on a union/overlay as reported
    in [1]. Due to the limitations (no write) and special functionality of
    hugetlbfs, it does not work well in filesystem stacking. There are no
    know use cases for hugetlbfs stacking. Rather than making modifications
    to get hugetlbfs working in such environments, simply prevent stacking.

    [1] https://lore.kernel.org/linux-mm/000000000000b4684e05a2968ca6@google.com/

    Reported-by: syzbot+d6ec23007e951dadf3de@syzkaller.appspotmail.com
    Suggested-by: Amir Goldstein
    Signed-off-by: Mike Kravetz
    Signed-off-by: Andrew Morton
    Acked-by: Miklos Szeredi
    Cc: Al Viro
    Cc: Matthew Wilcox
    Cc: Colin Walters
    Link: http://lkml.kernel.org/r/80f869aa-810d-ef6c-8888-b46cee135907@oracle.com
    Signed-off-by: Linus Torvalds

    Mike Kravetz
     
  • Recently we found an issue on our production environment that when memcg
    oom is triggered the oom killer doesn't chose the process with largest
    resident memory but chose the first scanned process. Note that all
    processes in this memcg have the same oom_score_adj, so the oom killer
    should chose the process with largest resident memory.

    Bellow is part of the oom info, which is enough to analyze this issue.
    [7516987.983223] memory: usage 16777216kB, limit 16777216kB, failcnt 52843037
    [7516987.983224] memory+swap: usage 16777216kB, limit 9007199254740988kB, failcnt 0
    [7516987.983225] kmem: usage 301464kB, limit 9007199254740988kB, failcnt 0
    [...]
    [7516987.983293] [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name
    [7516987.983510] [ 5740] 0 5740 257 1 32768 0 -998 pause
    [7516987.983574] [58804] 0 58804 4594 771 81920 0 -998 entry_point.bas
    [7516987.983577] [58908] 0 58908 7089 689 98304 0 -998 cron
    [7516987.983580] [58910] 0 58910 16235 5576 163840 0 -998 supervisord
    [7516987.983590] [59620] 0 59620 18074 1395 188416 0 -998 sshd
    [7516987.983594] [59622] 0 59622 18680 6679 188416 0 -998 python
    [7516987.983598] [59624] 0 59624 1859266 5161 548864 0 -998 odin-agent
    [7516987.983600] [59625] 0 59625 707223 9248 983040 0 -998 filebeat
    [7516987.983604] [59627] 0 59627 416433 64239 774144 0 -998 odin-log-agent
    [7516987.983607] [59631] 0 59631 180671 15012 385024 0 -998 python3
    [7516987.983612] [61396] 0 61396 791287 3189 352256 0 -998 client
    [7516987.983615] [61641] 0 61641 1844642 29089 946176 0 -998 client
    [7516987.983765] [ 9236] 0 9236 2642 467 53248 0 -998 php_scanner
    [7516987.983911] [42898] 0 42898 15543 838 167936 0 -998 su
    [7516987.983915] [42900] 1000 42900 3673 867 77824 0 -998 exec_script_vr2
    [7516987.983918] [42925] 1000 42925 36475 19033 335872 0 -998 python
    [7516987.983921] [57146] 1000 57146 3673 848 73728 0 -998 exec_script_J2p
    [7516987.983925] [57195] 1000 57195 186359 22958 491520 0 -998 python2
    [7516987.983928] [58376] 1000 58376 275764 14402 290816 0 -998 rosmaster
    [7516987.983931] [58395] 1000 58395 155166 4449 245760 0 -998 rosout
    [7516987.983935] [58406] 1000 58406 18285584 3967322 37101568 0 -998 data_sim
    [7516987.984221] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=3aa16c9482ae3a6f6b78bda68a55d32c87c99b985e0f11331cddf05af6c4d753,mems_allowed=0-1,oom_memcg=/kubepods/podf1c273d3-9b36-11ea-b3df-246e9693c184,task_memcg=/kubepods/podf1c273d3-9b36-11ea-b3df-246e9693c184/1f246a3eeea8f70bf91141eeaf1805346a666e225f823906485ea0b6c37dfc3d,task=pause,pid=5740,uid=0
    [7516987.984254] Memory cgroup out of memory: Killed process 5740 (pause) total-vm:1028kB, anon-rss:4kB, file-rss:0kB, shmem-rss:0kB
    [7516988.092344] oom_reaper: reaped process 5740 (pause), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

    We can find that the first scanned process 5740 (pause) was killed, but
    its rss is only one page. That is because, when we calculate the oom
    badness in oom_badness(), we always ignore the negtive point and convert
    all of these negtive points to 1. Now as oom_score_adj of all the
    processes in this targeted memcg have the same value -998, the points of
    these processes are all negtive value. As a result, the first scanned
    process will be killed.

    The oom_socre_adj (-998) in this memcg is set by kubelet, because it is a
    a Guaranteed pod, which has higher priority to prevent from being killed
    by system oom.

    To fix this issue, we should make the calculation of oom point more
    accurate. We can achieve it by convert the chosen_point from 'unsigned
    long' to 'long'.

    [cai@lca.pw: reported a issue in the previous version]
    [mhocko@suse.com: fixed the issue reported by Cai]
    [mhocko@suse.com: add the comment in proc_oom_score()]
    [laoar.shao@gmail.com: v3]
    Link: http://lkml.kernel.org/r/1594396651-9931-1-git-send-email-laoar.shao@gmail.com

    Signed-off-by: Yafang Shao
    Signed-off-by: Andrew Morton
    Tested-by: Naresh Kamboju
    Acked-by: Michal Hocko
    Cc: David Rientjes
    Cc: Qian Cai
    Link: http://lkml.kernel.org/r/1594309987-9919-1-git-send-email-laoar.shao@gmail.com
    Signed-off-by: Linus Torvalds

    Yafang Shao
     
  • The keys in smaps output are padded to fixed width with spaces. All
    except for THPeligible that uses tabs (only since commit c06306696f83
    ("mm: thp: fix false negative of shmem vma's THP eligibility")).

    Unify the output formatting to save time debugging some naïve parsers.
    (Part of the unification is also aligning FilePmdMapped with others.)

    Signed-off-by: Michal Koutný
    Signed-off-by: Andrew Morton
    Acked-by: Yang Shi
    Cc: Alexey Dobriyan
    Cc: Matthew Wilcox
    Link: http://lkml.kernel.org/r/20200728083207.17531-1-mkoutny@suse.com
    Signed-off-by: Linus Torvalds

    Michal Koutný