09 Mar, 2019

1 commit

  • Pull io_uring IO interface from Jens Axboe:
    "Second attempt at adding the io_uring interface.

    Since the first one, we've added basic unit testing of the three
    system calls, that resides in liburing like the other unit tests that
    we have so far. It'll take a while to get full coverage of it, but
    we're working towards it. I've also added two basic test programs to
    tools/io_uring. One uses the raw interface and has support for all the
    various features that io_uring supports outside of standard IO, like
    fixed files, fixed IO buffers, and polled IO. The other uses the
    liburing API, and is a simplified version of cp(1).

    This adds support for a new IO interface, io_uring.

    io_uring allows an application to communicate with the kernel through
    two rings, the submission queue (SQ) and completion queue (CQ) ring.
    This allows for very efficient handling of IOs, see the v5 posting for
    some basic numbers:

    https://lore.kernel.org/linux-block/20190116175003.17880-1-axboe@kernel.dk/

    Outside of just efficiency, the interface is also flexible and
    extendable, and allows for future use cases like the upcoming NVMe
    key-value store API, networked IO, and so on. It also supports async
    buffered IO, something that we've always failed to support in the
    kernel.

    Outside of basic IO features, it supports async polled IO as well.
    This particular feature has already been tested at Facebook months ago
    for flash storage boxes, with 25-33% improvements. It makes polled IO
    actually useful for real world use cases, where even basic flash sees
    a nice win in terms of efficiency, latency, and performance. These
    boxes were IOPS bound before, now they are not.

    This series adds three new system calls. One for setting up an
    io_uring instance (io_uring_setup(2)), one for submitting/completing
    IO (io_uring_enter(2)), and one for aux functions like registrating
    file sets, buffers, etc (io_uring_register(2)). Through the help of
    Arnd, I've coordinated the syscall numbers so merge on that front
    should be painless.

    Jon did a writeup of the interface a while back, which (except for
    minor details that have been tweaked) is still accurate. Find that
    here:

    https://lwn.net/Articles/776703/

    Huge thanks to Al Viro for helping getting the reference cycle code
    correct, and to Jann Horn for his extensive reviews focused on both
    security and bugs in general.

    There's a userspace library that provides basic functionality for
    applications that don't need or want to care about how to fiddle with
    the rings directly. It has helpers to allow applications to easily set
    up an io_uring instance, and submit/complete IO through it without
    knowing about the intricacies of the rings. It also includes man pages
    (thanks to Jeff Moyer), and will continue to grow support helper
    functions and features as time progresses. Find it here:

    git://git.kernel.dk/liburing

    Fio has full support for the raw interface, both in the form of an IO
    engine (io_uring), but also with a small test application (t/io_uring)
    that can exercise and benchmark the interface"

    * tag 'io_uring-2019-03-06' of git://git.kernel.dk/linux-block:
    io_uring: add a few test tools
    io_uring: allow workqueue item to handle multiple buffered requests
    io_uring: add support for IORING_OP_POLL
    io_uring: add io_kiocb ref count
    io_uring: add submission polling
    io_uring: add file set registration
    net: split out functions related to registering inflight socket files
    io_uring: add support for pre-mapped user IO buffers
    block: implement bio helper to add iter bvec pages to bio
    io_uring: batch io_kiocb allocation
    io_uring: use fget/fput_many() for file references
    fs: add fget_many() and fput_many()
    io_uring: support for IO polling
    io_uring: add fsync support
    Add io_uring IO interface

    Linus Torvalds
     

06 Mar, 2019

1 commit

  • (Taken from https://bugzilla.kernel.org/show_bug.cgi?id=200647)

    'get_unused_fd_flags' in kthread cause kernel crash. It works fine on
    4.1, but causes crash after get 64 fds. It also cause crash on
    ubuntu1404/1604/1804, centos7.5, and the crash messages are almost the
    same.

    The crash message on centos7.5 shows below:

    start fd 61
    start fd 62
    start fd 63
    BUG: unable to handle kernel NULL pointer dereference at (null)
    IP: __wake_up_common+0x2e/0x90
    PGD 0
    Oops: 0000 [#1] SMP
    Modules linked in: test(OE) xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT nf_reject_ipv4 tun bridge stp llc ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter devlink sunrpc kvm_intel kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd sg ppdev pcspkr virtio_balloon parport_pc parport i2c_piix4 joydev ip_tables xfs libcrc32c sr_mod cdrom sd_mod crc_t10dif crct10dif_generic ata_generic pata_acpi virtio_scsi virtio_console virtio_net cirrus drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm crct10dif_pclmul crct10dif_common crc32c_intel drm ata_piix serio_raw libata virtio_pci virtio_ring i2c_core
    virtio floppy dm_mirror dm_region_hash dm_log dm_mod
    CPU: 2 PID: 1820 Comm: test_fd Kdump: loaded Tainted: G OE ------------ 3.10.0-862.3.3.el7.x86_64 #1
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.10.2-0-g5f4c7b1-prebuilt.qemu-project.org 04/01/2014
    task: ffff8e92b9431fa0 ti: ffff8e94247a0000 task.ti: ffff8e94247a0000
    RIP: 0010:__wake_up_common+0x2e/0x90
    RSP: 0018:ffff8e94247a2d18 EFLAGS: 00010086
    RAX: 0000000000000000 RBX: ffffffff9d09daa0 RCX: 0000000000000000
    RDX: 0000000000000000 RSI: 0000000000000003 RDI: ffffffff9d09daa0
    RBP: ffff8e94247a2d50 R08: 0000000000000000 R09: ffff8e92b95dfda8
    R10: 0000000000000000 R11: 0000000000000000 R12: ffffffff9d09daa8
    R13: 0000000000000003 R14: 0000000000000000 R15: 0000000000000003
    FS: 0000000000000000(0000) GS:ffff8e9434e80000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 0000000000000000 CR3: 000000017c686000 CR4: 00000000000207e0
    Call Trace:
    __wake_up+0x39/0x50
    expand_files+0x131/0x250
    __alloc_fd+0x47/0x170
    get_unused_fd_flags+0x30/0x40
    test_fd+0x12a/0x1c0 [test]
    kthread+0xd1/0xe0
    ret_from_fork_nospec_begin+0x21/0x21
    Code: 66 90 55 48 89 e5 41 57 41 89 f7 41 56 41 89 ce 41 55 41 54 49 89 fc 49 83 c4 08 53 48 83 ec 10 48 8b 47 08 89 55 cc 4c 89 45 d0 8b 08 49 39 c4 48 8d 78 e8 4c 8d 69 e8 75 08 eb 3b 4c 89 ef
    RIP __wake_up_common+0x2e/0x90
    RSP
    CR2: 0000000000000000

    This issue exists since CentOS 7.5 3.10.0-862 and CentOS 7.4
    (3.10.0-693.21.1 ) is ok. Root cause: the item 'resize_wait' is not
    initialized before being used.

    Reported-by: Richard Zhang
    Reviewed-by: Andrew Morton
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shuriyc Chu
     

28 Feb, 2019

1 commit

  • Some uses cases repeatedly get and put references to the same file, but
    the only exposed interface is doing these one at the time. As each of
    these entail an atomic inc or dec on a shared structure, that cost can
    add up.

    Add fget_many(), which works just like fget(), except it takes an
    argument for how many references to get on the file. Ditto fput_many(),
    which can drop an arbitrary number of references to a file.

    Reviewed-by: Hannes Reinecke
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Jens Axboe
     

29 Dec, 2018

1 commit

  • Pull char/misc driver updates from Greg KH:
    "Here is the big set of char and misc driver patches for 4.21-rc1.

    Lots of different types of driver things in here, as this tree seems
    to be the "collection of various driver subsystems not big enough to
    have their own git tree" lately.

    Anyway, some highlights of the changes in here:

    - binderfs: is it a rule that all driver subsystems will eventually
    grow to have their own filesystem? Binder now has one to handle the
    use of it in containerized systems.

    This was discussed at the Plumbers conference a few months ago and
    knocked into mergable shape very fast by Christian Brauner. Who
    also has signed up to be another binder maintainer, showing a
    distinct lack of good judgement :)

    - binder updates and fixes

    - mei driver updates

    - fpga driver updates and additions

    - thunderbolt driver updates

    - soundwire driver updates

    - extcon driver updates

    - nvmem driver updates

    - hyper-v driver updates

    - coresight driver updates

    - pvpanic driver additions and reworking for more device support

    - lp driver updates. Yes really, it's _finally_ moved to the proper
    parallal port driver model, something I never thought I would see
    happen. Good stuff.

    - other tiny driver updates and fixes.

    All of these have been in linux-next for a while with no reported
    issues"

    * tag 'char-misc-4.21-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc: (116 commits)
    MAINTAINERS: add another Android binder maintainer
    intel_th: msu: Fix an off-by-one in attribute store
    stm class: Add a reference to the SyS-T document
    stm class: Fix a module refcount leak in policy creation error path
    char: lp: use new parport device model
    char: lp: properly count the lp devices
    char: lp: use first unused lp number while registering
    char: lp: detach the device when parallel port is removed
    char: lp: introduce list to save port number
    bus: qcom: remove duplicated include from qcom-ebi2.c
    VMCI: Use memdup_user() rather than duplicating its implementation
    char/rtc: Use of_node_name_eq for node name comparisons
    misc: mic: fix a DMA pool free failure
    ptp: fix an IS_ERR() vs NULL check
    genwqe: Fix size check
    binder: implement binderfs
    binder: fix use-after-free due to ksys_close() during fdget()
    bus: fsl-mc: remove duplicated include files
    bus: fsl-mc: explicitly define the fsl_mc_command endianness
    misc: ti-st: make array read_ver_cmd static, shrinks object size
    ...

    Linus Torvalds
     

19 Dec, 2018

1 commit

  • 44d8047f1d8 ("binder: use standard functions to allocate fds")
    exposed a pre-existing issue in the binder driver.

    fdget() is used in ksys_ioctl() as a performance optimization.
    One of the rules associated with fdget() is that ksys_close() must
    not be called between the fdget() and the fdput(). There is a case
    where this requirement is not met in the binder driver which results
    in the reference count dropping to 0 when the device is still in
    use. This can result in use-after-free or other issues.

    If userpace has passed a file-descriptor for the binder driver using
    a BINDER_TYPE_FDA object, then kys_close() is called on it when
    handling a binder_ioctl(BC_FREE_BUFFER) command. This violates
    the assumptions for using fdget().

    The problem is fixed by deferring the close using task_work_add(). A
    new variant of __close_fd() was created that returns a struct file
    with a reference. The fput() is deferred instead of using ksys_close().

    Fixes: 44d8047f1d87a ("binder: use standard functions to allocate fds")
    Suggested-by: Al Viro
    Signed-off-by: Todd Kjos
    Cc: stable
    Signed-off-by: Greg Kroah-Hartman

    Todd Kjos
     

28 Nov, 2018

1 commit


03 Apr, 2018

2 commits

  • Using the ksys_close() wrapper allows us to get rid of in-kernel calls
    to the sys_close() syscall. The ksys_ prefix denotes that this function
    is meant as a drop-in replacement for the syscall. In particular, it
    uses the same calling convention as sys_close(), with one subtle
    difference:

    The few places which checked the return value did not care about the return
    value re-writing in sys_close(), so simply use a wrapper around
    __close_fd().

    This patch is part of a series which removes in-kernel calls to syscalls.
    On this basis, the syscall entry path can be streamlined. For details, see
    http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net

    Cc: Al Viro
    Cc: Andrew Morton
    Signed-off-by: Dominik Brodowski

    Dominik Brodowski
     
  • Using ksys_dup() and ksys_dup3() as helper functions allows us to
    avoid the in-kernel calls to the sys_dup() and sys_dup3() syscalls.
    The ksys_ prefix denotes that these functions are meant as a drop-in
    replacement for the syscalls. In particular, they use the same
    calling convention as sys_dup{,3}().

    In the near future, the fs-external callers of ksys_dup{,3}() should be
    converted to call do_dup2() directly. Then, ksys_dup{,3}() can be moved
    within sys_dup{,3}() again.

    This patch is part of a series which removes in-kernel calls to syscalls.
    On this basis, the syscall entry path can be streamlined. For details, see
    http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net

    Cc: Alexander Viro
    Signed-off-by: Dominik Brodowski

    Dominik Brodowski
     

01 Feb, 2018

1 commit

  • Pull misc vfs updates from Al Viro:
    "All kinds of misc stuff, without any unifying topic, from various
    people.

    Neil's d_anon patch, several bugfixes, introduction of kvmalloc
    analogue of kmemdup_user(), extending bitfield.h to deal with
    fixed-endians, assorted cleanups all over the place..."

    * 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (28 commits)
    alpha: osf_sys.c: use timespec64 where appropriate
    alpha: osf_sys.c: fix put_tv32 regression
    jffs2: Fix use-after-free bug in jffs2_iget()'s error handling path
    dcache: delete unused d_hash_mask
    dcache: subtract d_hash_shift from 32 in advance
    fs/buffer.c: fold init_buffer() into init_page_buffers()
    fs: fold __inode_permission() into inode_permission()
    fs: add RWF_APPEND
    sctp: use vmemdup_user() rather than badly open-coding memdup_user()
    snd_ctl_elem_init_enum_names(): switch to vmemdup_user()
    replace_user_tlv(): switch to vmemdup_user()
    new primitive: vmemdup_user()
    memdup_user(): switch to GFP_USER
    eventfd: fold eventfd_ctx_get() into eventfd_ctx_fileget()
    eventfd: fold eventfd_ctx_read() into eventfd_read()
    eventfd: convert to use anon_inode_getfd()
    nfs4file: get rid of pointless include of btrfs.h
    uvc_v4l2: clean copyin/copyout up
    vme_user: don't use __copy_..._user()
    usx2y: don't bother with memdup_user() for 16-byte structure
    ...

    Linus Torvalds
     

05 Dec, 2017

2 commits


18 Nov, 2017

1 commit

  • Pull misc vfs updates from Al Viro:
    "Assorted stuff, really no common topic here"

    * 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    vfs: grab the lock instead of blocking in __fd_install during resizing
    vfs: stop clearing close on exec when closing a fd
    include/linux/fs.h: fix comment about struct address_space
    fs: make fiemap work from compat_ioctl
    coda: fix 'kernel memory exposure attempt' in fsync
    pstore: remove unneeded unlikely()
    vfs: remove unneeded unlikely()
    stubs for mount_bdev() and kill_block_super() in !CONFIG_BLOCK case
    make vfs_ustat() static
    do_handle_open() should be static
    elf_fdpic: fix unused variable warning
    fold destroy_super() into __put_super()
    new helper: destroy_unused_super()
    fix address space warnings in ipc/
    acct.h: get rid of detritus

    Linus Torvalds
     

06 Nov, 2017

2 commits


02 Nov, 2017

1 commit

  • Many source files in the tree are missing licensing information, which
    makes it harder for compliance tools to determine the correct license.

    By default all files without license information are under the default
    license of the kernel, which is GPL version 2.

    Update the files which contain no license information with the 'GPL-2.0'
    SPDX license identifier. The SPDX identifier is a legally binding
    shorthand, which can be used instead of the full boiler plate text.

    This patch is based on work done by Thomas Gleixner and Kate Stewart and
    Philippe Ombredanne.

    How this work was done:

    Patches were generated and checked against linux-4.14-rc6 for a subset of
    the use cases:
    - file had no licensing information it it.
    - file was a */uapi/* one with no licensing information in it,
    - file was a */uapi/* one with existing licensing information,

    Further patches will be generated in subsequent months to fix up cases
    where non-standard license headers were used, and references to license
    had to be inferred by heuristics based on keywords.

    The analysis to determine which SPDX License Identifier to be applied to
    a file was done in a spreadsheet of side by side results from of the
    output of two independent scanners (ScanCode & Windriver) producing SPDX
    tag:value files created by Philippe Ombredanne. Philippe prepared the
    base worksheet, and did an initial spot review of a few 1000 files.

    The 4.13 kernel was the starting point of the analysis with 60,537 files
    assessed. Kate Stewart did a file by file comparison of the scanner
    results in the spreadsheet to determine which SPDX license identifier(s)
    to be applied to the file. She confirmed any determination that was not
    immediately clear with lawyers working with the Linux Foundation.

    Criteria used to select files for SPDX license identifier tagging was:
    - Files considered eligible had to be source code files.
    - Make and config files were included as candidates if they contained >5
    lines of source
    - File already had some variant of a license header in it (even if
    Reviewed-by: Philippe Ombredanne
    Reviewed-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Greg Kroah-Hartman
     

07 Jul, 2017

1 commit


09 May, 2017

1 commit

  • __vmalloc* allows users to provide gfp flags for the underlying
    allocation. This API is quite popular

    $ git grep "=[[:space:]]__vmalloc\|return[[:space:]]*__vmalloc" | wc -l
    77

    The only problem is that many people are not aware that they really want
    to give __GFP_HIGHMEM along with other flags because there is really no
    reason to consume precious lowmemory on CONFIG_HIGHMEM systems for pages
    which are mapped to the kernel vmalloc space. About half of users don't
    use this flag, though. This signals that we make the API unnecessarily
    too complex.

    This patch simply uses __GFP_HIGHMEM implicitly when allocating pages to
    be mapped to the vmalloc space. Current users which add __GFP_HIGHMEM
    are simplified and drop the flag.

    Link: http://lkml.kernel.org/r/20170307141020.29107-1-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Reviewed-by: Matthew Wilcox
    Cc: Al Viro
    Cc: Vlastimil Babka
    Cc: David Rientjes
    Cc: Cristopher Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

02 Mar, 2017

1 commit


28 Sep, 2016

1 commit

  • Propagate unsignedness for grand total of 149 bytes:

    $ ./scripts/bloat-o-meter ../vmlinux-000 ../obj/vmlinux
    add/remove: 0/0 grow/shrink: 0/10 up/down: 0/-149 (-149)
    function old new delta
    set_close_on_exec 99 98 -1
    put_files_struct 201 200 -1
    get_close_on_exec 59 58 -1
    do_prlimit 498 497 -1
    do_execveat_common.isra 1662 1661 -1
    __close_fd 178 173 -5
    do_dup2 219 204 -15
    seq_show 685 660 -25
    __alloc_fd 384 357 -27
    dup_fd 718 646 -72

    It mostly comes from converting "unsigned int" to "long" for bit operations.

    Signed-off-by: Alexey Dobriyan
    Signed-off-by: Al Viro

    Alexey Dobriyan
     

03 May, 2016

1 commit


15 Jan, 2016

1 commit

  • Mark those kmem allocations that are known to be easily triggered from
    userspace as __GFP_ACCOUNT/SLAB_ACCOUNT, which makes them accounted to
    memcg. For the list, see below:

    - threadinfo
    - task_struct
    - task_delay_info
    - pid
    - cred
    - mm_struct
    - vm_area_struct and vm_region (nommu)
    - anon_vma and anon_vma_chain
    - signal_struct
    - sighand_struct
    - fs_struct
    - files_struct
    - fdtable and fdtable->full_fds_bits
    - dentry and external_name
    - inode for all filesystems. This is the most tedious part, because
    most filesystems overwrite the alloc_inode method.

    The list is far from complete, so feel free to add more objects.
    Nevertheless, it should be close to "account everything" approach and
    keep most workloads within bounds. Malevolent users will be able to
    breach the limit, but this was possible even with the former "account
    everything" approach (simply because it did not account everything in
    fact).

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Vladimir Davydov
    Acked-by: Johannes Weiner
    Acked-by: Michal Hocko
    Cc: Tejun Heo
    Cc: Greg Thelen
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     

07 Dec, 2015

1 commit


06 Nov, 2015

1 commit


01 Nov, 2015

2 commits

  • We clear the close-on-exec flag when opening and closing files, and the
    bit was almost always already clear before. Avoid dirtying the
    cacheline if the clearning isn't necessary. That avoids unnecessary
    cacheline dirtying and bouncing in multi-socket environments.

    Eric Dumazet has a file descriptor benchmark that goes 4% faster from
    this on his two-socket machine. It's probably partly superlinear
    improvement due to getting slightly less spinlock contention on the
    file_lock spinlock due to less work in the critical section.

    Tested-by: Eric Dumazet
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     
  • Al Viro points out that:
    > > * [Linux-specific aside] our __alloc_fd() can degrade quite badly
    > > with some use patterns. The cacheline pingpong in the bitmap is probably
    > > inevitable, unless we accept considerably heavier memory footprint,
    > > but we also have a case when alloc_fd() takes O(n) and it's _not_ hard
    > > to trigger - close(3);open(...); will have the next open() after that
    > > scanning the entire in-use bitmap.

    And Eric Dumazet has a somewhat realistic multithreaded microbenchmark
    that opens and closes a lot of sockets with minimal work per socket.

    This patch largely fixes it. We keep a 2nd-level bitmap of the open
    file bitmaps, showing which words are already full. So then we can
    traverse that second-level bitmap to efficiently skip already allocated
    file descriptors.

    On his benchmark, this improves performance by up to an order of
    magnitude, by avoiding the excessive open file bitmap scanning.

    Tested-and-acked-by: Eric Dumazet
    Cc: Al Viro
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

01 Jul, 2015

2 commits

  • __fget() does lockless fetch of pointer from the descriptor
    table, attempts to grab a reference and treats "it was already
    zero" as "it's already gone from the table, we just hadn't
    seen the store, let's fail". Unfortunately, that breaks the
    atomicity of dup2() - __fget() might see the old pointer,
    notice that it's been already dropped and treat that as
    "it's closed". What we should be getting is either the
    old file or new one, depending whether we come before or after
    dup2().

    Dmitry had following test failing sometimes :

    int fd;
    void *Thread(void *x) {
    char buf;
    int n = read(fd, &buf, 1);
    if (n != 1)
    exit(printf("read failed: n=%d errno=%d\n", n, errno));
    return 0;
    }

    int main()
    {
    fd = open("/dev/urandom", O_RDONLY);
    int fd2 = open("/dev/urandom", O_RDONLY);
    if (fd == -1 || fd2 == -1)
    exit(printf("open failed\n"));
    pthread_t th;
    pthread_create(&th, 0, Thread, 0);
    if (dup2(fd2, fd) == -1)
    exit(printf("dup2 failed\n"));
    pthread_join(th, 0);
    if (close(fd) == -1)
    exit(printf("close failed\n"));
    if (close(fd2) == -1)
    exit(printf("close failed\n"));
    printf("DONE\n");
    return 0;
    }

    Signed-off-by: Eric Dumazet
    Reported-by: Dmitry Vyukov
    Signed-off-by: Al Viro

    Eric Dumazet
     
  • Mateusz Guzik reported :

    Currently obtaining a new file descriptor results in locking fdtable
    twice - once in order to reserve a slot and second time to fill it.

    Holding the spinlock in __fd_install() is needed in case a resize is
    done, or to prevent a resize.

    Mateusz provided an RFC patch and a micro benchmark :
    http://people.redhat.com/~mguzik/pipebench.c

    A resize is an unlikely operation in a process lifetime,
    as table size is at least doubled at every resize.

    We can use RCU instead of the spinlock.

    __fd_install() must wait if a resize is in progress.

    The resize must block new __fd_install() callers from starting,
    and wait that ongoing install are finished (synchronize_sched())

    resize should be attempted by a single thread to not waste resources.

    rcu_sched variant is used, as __fd_install() and expand_fdtable() run
    from process context.

    It gives us a ~30% speedup using pipebench on a dual Intel(R) Xeon(R)
    CPU E5-2696 v2 @ 2.50GHz

    Signed-off-by: Eric Dumazet
    Reported-by: Mateusz Guzik
    Acked-by: Mateusz Guzik
    Tested-by: Mateusz Guzik
    Signed-off-by: Al Viro

    Eric Dumazet
     

17 Apr, 2015

1 commit

  • This patch removes mm->mmap_sem from mm->exe_file read side.
    Also it kills dup_mm_exe_file() and moves exe_file duplication into
    dup_mmap() where both mmap_sems are locked.

    [akpm@linux-foundation.org: fix comment typo]
    Signed-off-by: Konstantin Khlebnikov
    Cc: Davidlohr Bueso
    Cc: Al Viro
    Cc: Oleg Nesterov
    Cc: "Paul E. McKenney"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     

11 Dec, 2014

1 commit

  • This patch replaces calls to get_unused_fd() with equivalent call to
    get_unused_fd_flags(0) to preserve current behavor for existing code.

    In a further patch, get_unused_fd() will be removed so that new code
    start using get_unused_fd_flags(), with the hope O_CLOEXEC could be
    used, either by default or choosen by userspace.

    Signed-off-by: Yann Droneaud
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yann Droneaud
     

13 Oct, 2014

1 commit

  • Pull RCU updates from Ingo Molnar:
    "The main changes in this cycle were:

    - changes related to No-CBs CPUs and NO_HZ_FULL

    - RCU-tasks implementation

    - torture-test updates

    - miscellaneous fixes

    - locktorture updates

    - RCU documentation updates"

    * 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (81 commits)
    workqueue: Use cond_resched_rcu_qs macro
    workqueue: Add quiescent state between work items
    locktorture: Cleanup header usage
    locktorture: Cannot hold read and write lock
    locktorture: Fix __acquire annotation for spinlock irq
    locktorture: Support rwlocks
    rcu: Eliminate deadlock between CPU hotplug and expedited grace periods
    locktorture: Document boot/module parameters
    rcutorture: Rename rcutorture_runnable parameter
    locktorture: Add test scenario for rwsem_lock
    locktorture: Add test scenario for mutex_lock
    locktorture: Make torture scripting account for new _runnable name
    locktorture: Introduce torture context
    locktorture: Support rwsems
    locktorture: Add infrastructure for torturing read locks
    torture: Address race in module cleanup
    locktorture: Make statistics generic
    locktorture: Teach about lock debugging
    locktorture: Support mutexes
    locktorture: Add documentation
    ...

    Linus Torvalds
     

09 Oct, 2014

1 commit


08 Sep, 2014

1 commit

  • RCU-tasks requires the occasional voluntary context switch
    from CPU-bound in-kernel tasks. In some cases, this requires
    instrumenting cond_resched(). However, there is some reluctance
    to countenance unconditionally instrumenting cond_resched() (see
    http://lwn.net/Articles/603252/), so this commit creates a separate
    cond_resched_rcu_qs() that may be used in place of cond_resched() in
    locations prone to long-duration in-kernel looping.

    This commit currently instruments only RCU-tasks. Future possibilities
    include also instrumenting RCU, RCU-bh, and RCU-sched in order to reduce
    IPI usage.

    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     

07 May, 2014

1 commit


13 Apr, 2014

1 commit

  • Pull vfs updates from Al Viro:
    "The first vfs pile, with deep apologies for being very late in this
    window.

    Assorted cleanups and fixes, plus a large preparatory part of iov_iter
    work. There's a lot more of that, but it'll probably go into the next
    merge window - it *does* shape up nicely, removes a lot of
    boilerplate, gets rid of locking inconsistencie between aio_write and
    splice_write and I hope to get Kent's direct-io rewrite merged into
    the same queue, but some of the stuff after this point is having
    (mostly trivial) conflicts with the things already merged into
    mainline and with some I want more testing.

    This one passes LTP and xfstests without regressions, in addition to
    usual beating. BTW, readahead02 in ltp syscalls testsuite has started
    giving failures since "mm/readahead.c: fix readahead failure for
    memoryless NUMA nodes and limit readahead pages" - might be a false
    positive, might be a real regression..."

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (63 commits)
    missing bits of "splice: fix racy pipe->buffers uses"
    cifs: fix the race in cifs_writev()
    ceph_sync_{,direct_}write: fix an oops on ceph_osdc_new_request() failure
    kill generic_file_buffered_write()
    ocfs2_file_aio_write(): switch to generic_perform_write()
    ceph_aio_write(): switch to generic_perform_write()
    xfs_file_buffered_aio_write(): switch to generic_perform_write()
    export generic_perform_write(), start getting rid of generic_file_buffer_write()
    generic_file_direct_write(): get rid of ppos argument
    btrfs_file_aio_write(): get rid of ppos
    kill the 5th argument of generic_file_buffered_write()
    kill the 4th argument of __generic_file_aio_write()
    lustre: don't open-code kernel_recvmsg()
    ocfs2: don't open-code kernel_recvmsg()
    drbd: don't open-code kernel_recvmsg()
    constify blk_rq_map_user_iov() and friends
    lustre: switch to kernel_sendmsg()
    ocfs2: don't open-code kernel_sendmsg()
    take iov_iter stuff to mm/iov_iter.c
    process_vm_access: tidy up a bit
    ...

    Linus Torvalds
     

02 Apr, 2014

1 commit


01 Apr, 2014

1 commit

  • Pull RCU updates from Ingo Molnar:
    "Main changes:

    - Torture-test changes, including refactoring of rcutorture and
    introduction of a vestigial locktorture.

    - Real-time latency fixes.

    - Documentation updates.

    - Miscellaneous fixes"

    * 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (77 commits)
    rcu: Provide grace-period piggybacking API
    rcu: Ensure kernel/rcu/rcu.h can be sourced/used stand-alone
    rcu: Fix sparse warning for rcu_expedited from kernel/ksysfs.c
    notifier: Substitute rcu_access_pointer() for rcu_dereference_raw()
    Documentation/memory-barriers.txt: Clarify release/acquire ordering
    rcutorture: Save kvm.sh output to log
    rcutorture: Add a lock_busted to test the test
    rcutorture: Place kvm-test-1-run.sh output into res directory
    rcutorture: Rename TREE_RCU-Kconfig.txt
    locktorture: Add kvm-recheck.sh plug-in for locktorture
    rcutorture: Gracefully handle NULL cleanup hooks
    locktorture: Add vestigial locktorture configuration
    rcutorture: Introduce "rcu" directory level underneath configs
    rcutorture: Rename kvm-test-1-rcu.sh
    rcutorture: Remove RCU dependencies from ver_functions.sh API
    rcutorture: Create CFcommon file for common Kconfig parameters
    rcutorture: Create config files for scripted test-the-test testing
    rcutorture: Add an rcu_busted to test the test
    locktorture: Add a lock-torture kernel module
    rcutorture: Abstract kvm-recheck.sh
    ...

    Linus Torvalds
     

23 Mar, 2014

1 commit

  • Commit bd2a31d522344 ("get rid of fget_light()") introduced the
    __fdget_pos() function, which returns the resulting file pointer and
    fdput flags combined in an 'unsigned long'. However, it also changed the
    behavior to return files with FMODE_PATH set, which shouldn't happen
    because read(), write(), lseek(), etc. aren't allowed on such files.
    This commit restores the old behavior.

    This regression actually had no effect on read() and write() since
    FMODE_READ and FMODE_WRITE are not set on file descriptors opened with
    O_PATH, but it did cause lseek() on a file descriptor opened with O_PATH
    to fail with ESPIPE rather than EBADF.

    Signed-off-by: Eric Biggers
    Signed-off-by: Al Viro

    Eric Biggers
     

10 Mar, 2014

1 commit

  • instead of returning the flags by reference, we can just have the
    low-level primitive return those in lower bits of unsigned long,
    with struct file * derived from the rest.

    Signed-off-by: Al Viro

    Al Viro
     

18 Feb, 2014

1 commit

  • (Trivial patch.)

    If the code is looking at the RCU-protected pointer itself, but not
    dereferencing it, the rcu_dereference() functions can be downgraded to
    rcu_access_pointer(). This commit makes this downgrade in __alloc_fd(),
    which simply compares the RCU-protected pointer against NULL with no
    dereferencing.

    Signed-off-by: Paul E. McKenney
    Cc: Alexander Viro
    Cc: linux-fsdevel@vger.kernel.org
    Reviewed-by: Josh Triplett

    Paul E. McKenney
     

11 Feb, 2014

1 commit

  • Recently due to a spike in connections per second memcached on 3
    separate boxes triggered the OOM killer from accept. At the time the
    OOM killer was triggered there was 4GB out of 36GB free in zone 1. The
    problem was that alloc_fdtable was allocating an order 3 page (32KiB) to
    hold a bitmap, and there was sufficient fragmentation that the largest
    page available was 8KiB.

    I find the logic that PAGE_ALLOC_COSTLY_ORDER can't fail pretty dubious
    but I do agree that order 3 allocations are very likely to succeed.

    There are always pathologies where order > 0 allocations can fail when
    there are copious amounts of free memory available. Using the pigeon
    hole principle it is easy to show that it requires 1 page more than 50%
    of the pages being free to guarantee an order 1 (8KiB) allocation will
    succeed, 1 page more than 75% of the pages being free to guarantee an
    order 2 (16KiB) allocation will succeed and 1 page more than 87.5% of
    the pages being free to guarantee an order 3 allocate will succeed.

    A server churning memory with a lot of small requests and replies like
    memcached is a common case that if anything can will skew the odds
    against large pages being available.

    Therefore let's not give external applications a practical way to kill
    linux server applications, and specify __GFP_NORETRY to the kmalloc in
    alloc_fdmem. Unless I am misreading the code and by the time the code
    reaches should_alloc_retry in __alloc_pages_slowpath (where
    __GFP_NORETRY becomes signification). We have already tried everything
    reasonable to allocate a page and the only thing left to do is wait. So
    not waiting and falling back to vmalloc immediately seems like the
    reasonable thing to do even if there wasn't a chance of triggering the
    OOM killer.

    Signed-off-by: "Eric W. Biederman"
    Cc: Eric Dumazet
    Acked-by: David Rientjes
    Cc: Cong Wang
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric W. Biederman