28 Sep, 2016

1 commit

  • CURRENT_TIME macro is not appropriate for filesystems as it
    doesn't use the right granularity for filesystem timestamps.
    Use current_time() instead.

    CURRENT_TIME is also not y2038 safe.

    This is also in preparation for the patch that transitions
    vfs timestamps to use 64 bit time and hence make them
    y2038 safe. As part of the effort current_time() will be
    extended to do range checks. Hence, it is necessary for all
    file system timestamps to use current_time(). Also,
    current_time() will be transitioned along with vfs to be
    y2038 safe.

    Note that whenever a single call to current_time() is used
    to change timestamps in different inodes, it is because they
    share the same time granularity.

    Signed-off-by: Deepa Dinamani
    Reviewed-by: Arnd Bergmann
    Acked-by: Felipe Balbi
    Acked-by: Steven Whitehouse
    Acked-by: Ryusuke Konishi
    Acked-by: David Sterba
    Signed-off-by: Al Viro

    Deepa Dinamani
     

24 Jun, 2016

3 commits

  • Introduce a function may_open_dev that tests MNT_NODEV and a new
    superblock flab SB_I_NODEV. Use this new function in all of the
    places where MNT_NODEV was previously tested.

    Add the new SB_I_NODEV s_iflag to proc, sysfs, and mqueuefs as those
    filesystems should never support device nodes, and a simple superblock
    flags makes that very hard to get wrong. With SB_I_NODEV set if any
    device nodes somehow manage to show up on on a filesystem those
    device nodes will be unopenable.

    Acked-by: Seth Forshee
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     
  • Set SB_I_NOEXEC on mqueuefs to ensure small implementation mistakes
    do not result in executable on mqueuefs by accident.

    Acked-by: Seth Forshee
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     
  • Today what is normally called data (the mount options) is not passed
    to fill_super through mount_ns.

    Pass the mount options and the namespace separately to mount_ns so
    that filesystems such as proc that have mount options, can use
    mount_ns.

    Pass the user namespace to mount_ns so that the standard permission
    check that verifies the mounter has permissions over the namespace can
    be performed in mount_ns instead of in each filesystems .mount method.
    Thus removing the duplication between mqueuefs and proc in terms of
    permission checks. The extra permission check does not currently
    affect the rpc_pipefs filesystem and the nfsd filesystem as those
    filesystems do not currently allow unprivileged mounts. Without
    unpvileged mounts it is guaranteed that the caller has already passed
    capable(CAP_SYS_ADMIN) which guarantees extra permission check will
    pass.

    Update rpc_pipefs and the nfsd filesystem to ensure that the network
    namespace reference is always taken in fill_super and always put in kill_sb
    so that the logic is simpler and so that errors originating inside of
    fill_super do not cause a network namespace leak.

    Acked-by: Seth Forshee
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

05 Apr, 2016

1 commit

  • PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
    ago with promise that one day it will be possible to implement page
    cache with bigger chunks than PAGE_SIZE.

    This promise never materialized. And unlikely will.

    We have many places where PAGE_CACHE_SIZE assumed to be equal to
    PAGE_SIZE. And it's constant source of confusion on whether
    PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
    especially on the border between fs and mm.

    Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
    breakage to be doable.

    Let's stop pretending that pages in page cache are special. They are
    not.

    The changes are pretty straight-forward:

    - << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> ;

    - >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> ;

    - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};

    - page_cache_get() -> get_page();

    - page_cache_release() -> put_page();

    This patch contains automated changes generated with coccinelle using
    script below. For some reason, coccinelle doesn't patch header files.
    I've called spatch for them manually.

    The only adjustment after coccinelle is revert of changes to
    PAGE_CAHCE_ALIGN definition: we are going to drop it later.

    There are few places in the code where coccinelle didn't reach. I'll
    fix them manually in a separate patch. Comments and documentation also
    will be addressed with the separate patch.

    virtual patch

    @@
    expression E;
    @@
    - E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
    + E

    @@
    expression E;
    @@
    - E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
    + E

    @@
    @@
    - PAGE_CACHE_SHIFT
    + PAGE_SHIFT

    @@
    @@
    - PAGE_CACHE_SIZE
    + PAGE_SIZE

    @@
    @@
    - PAGE_CACHE_MASK
    + PAGE_MASK

    @@
    expression E;
    @@
    - PAGE_CACHE_ALIGN(E)
    + PAGE_ALIGN(E)

    @@
    expression E;
    @@
    - page_cache_get(E)
    + get_page(E)

    @@
    expression E;
    @@
    - page_cache_release(E)
    + put_page(E)

    Signed-off-by: Kirill A. Shutemov
    Acked-by: Michal Hocko
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

23 Jan, 2016

1 commit

  • parallel to mutex_{lock,unlock,trylock,is_locked,lock_nested},
    inode_foo(inode) being mutex_foo(&inode->i_mutex).

    Please, use those for access to ->i_mutex; over the coming cycle
    ->i_mutex will become rwsem, with ->lookup() done with it held
    only shared.

    Signed-off-by: Al Viro

    Al Viro
     

15 Jan, 2016

1 commit

  • Mark those kmem allocations that are known to be easily triggered from
    userspace as __GFP_ACCOUNT/SLAB_ACCOUNT, which makes them accounted to
    memcg. For the list, see below:

    - threadinfo
    - task_struct
    - task_delay_info
    - pid
    - cred
    - mm_struct
    - vm_area_struct and vm_region (nommu)
    - anon_vma and anon_vma_chain
    - signal_struct
    - sighand_struct
    - fs_struct
    - files_struct
    - fdtable and fdtable->full_fds_bits
    - dentry and external_name
    - inode for all filesystems. This is the most tedious part, because
    most filesystems overwrite the alloc_inode method.

    The list is far from complete, so feel free to add more objects.
    Nevertheless, it should be close to "account everything" approach and
    keep most workloads within bounds. Malevolent users will be able to
    breach the limit, but this was possible even with the former "account
    everything" approach (simply because it did not account everything in
    fact).

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Vladimir Davydov
    Acked-by: Johannes Weiner
    Acked-by: Michal Hocko
    Cc: Tejun Heo
    Cc: Greg Thelen
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     

07 Aug, 2015

1 commit

  • A while back, the message queue implementation in the kernel was
    improved to use btrees to speed up retrieval of messages, in commit
    d6629859b36d ("ipc/mqueue: improve performance of send/recv").

    That patch introducing the improved kernel handling of message queues
    (using btrees) has, as a by-product, changed the meaning of the QSIZE
    field in the pseudo-file created for the queue. Before, this field
    reflected the size of the user-data in the queue. Since, it also takes
    kernel data structures into account. For example, if 13 bytes of user
    data are in the queue, on my machine the file reports a size of 61
    bytes.

    There was some discussion on this topic before (for example
    https://lkml.org/lkml/2014/10/1/115). Commenting on a th lkml, Michael
    Kerrisk gave the following background
    (https://lkml.org/lkml/2015/6/16/74):

    The pseudofiles in the mqueue filesystem (usually mounted at
    /dev/mqueue) expose fields with metadata describing a message
    queue. One of these fields, QSIZE, as originally implemented,
    showed the total number of bytes of user data in all messages in
    the message queue, and this feature was documented from the
    beginning in the mq_overview(7) page. In 3.5, some other (useful)
    work happened to break the user-space API in a couple of places,
    including the value exposed via QSIZE, which now includes a measure
    of kernel overhead bytes for the queue, a figure that renders QSIZE
    useless for its original purpose, since there's no way to deduce
    the number of overhead bytes consumed by the implementation.
    (The other user-space breakage was subsequently fixed.)

    This patch removes the accounting of kernel data structures in the
    queue. Reporting the size of these data-structures in the QSIZE field
    was a breaking change (see Michael's comment above). Without the QSIZE
    field reporting the total size of user-data in the queue, there is no
    way to deduce this number.

    It should be noted that the resource limit RLIMIT_MSGQUEUE is counted
    against the worst-case size of the queue (in both the old and the new
    implementation). Therefore, the kernel overhead accounting in QSIZE is
    not necessary to help the user understand the limitations RLIMIT imposes
    on the processes.

    Signed-off-by: Marcus Gelderie
    Acked-by: Doug Ledford
    Acked-by: Michael Kerrisk
    Acked-by: Davidlohr Bueso
    Cc: David Howells
    Cc: Alexander Viro
    Cc: John Duffy
    Cc: Arto Bendiken
    Cc: Manfred Spraul
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Marcus Gelderie
     

08 May, 2015

1 commit

  • This patch moves the wakeup_process() invocation so it is not done under
    the info->lock by making use of a lockless wake_q. With this change, the
    waiter is woken up once it is STATE_READY and it does not need to loop
    on SMP if it is still in STATE_PENDING. In the timeout case we still need
    to grab the info->lock to verify the state.

    This change should also avoid the introduction of preempt_disable() in -rt
    which avoids a busy-loop which pools for the STATE_PENDING -> STATE_READY
    change if the waiter has a higher priority compared to the waker.

    Additionally, this patch micro-optimizes wq_sleep by using the cheaper
    cousin of set_current_state(TASK_INTERRUPTABLE) as we will block no
    matter what, thus get rid of the implied barrier.

    Signed-off-by: Davidlohr Bueso
    Signed-off-by: Peter Zijlstra (Intel)
    Acked-by: George Spelvin
    Acked-by: Thomas Gleixner
    Cc: Andrew Morton
    Cc: Borislav Petkov
    Cc: Chris Mason
    Cc: H. Peter Anvin
    Cc: Linus Torvalds
    Cc: Manfred Spraul
    Cc: Peter Zijlstra
    Cc: Sebastian Andrzej Siewior
    Cc: Steven Rostedt
    Cc: dave@stgolabs.net
    Link: http://lkml.kernel.org/r/1430748166.1940.17.camel@stgolabs.net
    Signed-off-by: Ingo Molnar

    Davidlohr Bueso
     

16 Apr, 2015

1 commit


20 Nov, 2014

1 commit

  • ... for situations when we don't have any candidate in pathnames - basically,
    in descriptor-based syscalls.

    [Folded the build fix for !CONFIG_AUDITSYSCALL configs from Chen Gang]

    Signed-off-by: Al Viro

    Al Viro
     

08 Apr, 2014

1 commit


26 Feb, 2014

1 commit

  • Commit 93e6f119c0ce ("ipc/mqueue: cleanup definition names and
    locations") added global hardcoded limits to the amount of message
    queues that can be created. While these limits are per-namespace,
    reality is that it ends up breaking userspace applications.
    Historically users have, at least in theory, been able to create up to
    INT_MAX queues, and limiting it to just 1024 is way too low and dramatic
    for some workloads and use cases. For instance, Madars reports:

    "This update imposes bad limits on our multi-process application. As
    our app uses approaches that each process opens its own set of queues
    (usually something about 3-5 queues per process). In some scenarios
    we might run up to 3000 processes or more (which of-course for linux
    is not a problem). Thus we might need up to 9000 queues or more. All
    processes run under one user."

    Other affected users can be found in launchpad bug #1155695:
    https://bugs.launchpad.net/ubuntu/+source/manpages/+bug/1155695

    Instead of increasing this limit, revert it entirely and fallback to the
    original way of dealing queue limits -- where once a user's resource
    limit is reached, and all memory is used, new queues cannot be created.

    Signed-off-by: Davidlohr Bueso
    Reported-by: Madars Vitolins
    Acked-by: Doug Ledford
    Cc: Manfred Spraul
    Cc: [3.5+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     

28 Jan, 2014

2 commits

  • Deal with checkpatch messages:
    WARNING: braces {} are not necessary for single statement blocks

    Signed-off-by: Davidlohr Bueso
    Cc: Aswin Chandramouleeswaran
    Cc: Rik van Riel
    Acked-by: Manfred Spraul
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     
  • The ipc code does not adhere the typical linux coding style.
    This patch fixes lots of simple whitespace errors.

    - mostly autogenerated by
    scripts/checkpatch.pl -f --fix \
    --types=pointer_location,spacing,space_before_tab
    - one manual fixup (keep structure members tab-aligned)
    - removal of additional space_before_tab that were not found by --fix

    Tested with some of my msg and sem test apps.

    Andrew: Could you include it in -mm and move it towards Linus' tree?

    Signed-off-by: Manfred Spraul
    Suggested-by: Li Bin
    Cc: Joe Perches
    Acked-by: Rafael Aquini
    Cc: Davidlohr Bueso
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Manfred Spraul
     

09 Nov, 2013

1 commit

  • We need to break delegations on any operation that changes the set of
    links pointing to an inode. Start with unlink.

    Such operations also hold the i_mutex on a parent directory. Breaking a
    delegation may require waiting for a timeout (by default 90 seconds) in
    the case of a unresponsive NFS client. To avoid blocking all directory
    operations, we therefore drop locks before waiting for the delegation.
    The logic then looks like:

    acquire locks
    ...
    test for delegation; if found:
    take reference on inode
    release locks
    wait for delegation break
    drop reference on inode
    retry

    It is possible this could never terminate. (Even if we take precautions
    to prevent another delegation being acquired on the same inode, we could
    get a different inode on each retry.) But this seems very unlikely.

    The initial test for a delegation happens after the lock on the target
    inode is acquired, but the directory inode may have been acquired
    further up the call stack. We therefore add a "struct inode **"
    argument to any intervening functions, which we use to pass the inode
    back up to the caller in the case it needs a delegation synchronously
    broken.

    Cc: David Howells
    Cc: Tyler Hicks
    Cc: Dustin Kirkland
    Acked-by: Jeff Layton
    Signed-off-by: J. Bruce Fields
    Signed-off-by: Al Viro

    J. Bruce Fields
     

10 Jul, 2013

1 commit

  • The old audit PATH records for mq_open looked like this:

    type=PATH msg=audit(1366282323.982:869): item=1 name=(null) inode=6777
    dev=00:0c mode=041777 ouid=0 ogid=0 rdev=00:00
    obj=system_u:object_r:tmpfs_t:s15:c0.c1023
    type=PATH msg=audit(1366282323.982:869): item=0 name="test_mq" inode=26732
    dev=00:0c mode=0100700 ouid=0 ogid=0 rdev=00:00
    obj=staff_u:object_r:user_tmpfs_t:s15:c0.c1023

    ...with the audit related changes that went into 3.7, they now look like this:

    type=PATH msg=audit(1366282236.776:3606): item=2 name=(null) inode=66655
    dev=00:0c mode=0100700 ouid=0 ogid=0 rdev=00:00
    obj=staff_u:object_r:user_tmpfs_t:s15:c0.c1023
    type=PATH msg=audit(1366282236.776:3606): item=1 name=(null) inode=6926
    dev=00:0c mode=041777 ouid=0 ogid=0 rdev=00:00
    obj=system_u:object_r:tmpfs_t:s15:c0.c1023
    type=PATH msg=audit(1366282236.776:3606): item=0 name="test_mq"

    Both of these look wrong to me. As Steve Grubb pointed out:

    "What we need is 1 PATH record that identifies the MQ. The other PATH
    records probably should not be there."

    Fix it to record the mq root as a parent, and flag it such that it
    should be hidden from view when the names are logged, since the root of
    the mq filesystem isn't terribly interesting. With this change, we get
    a single PATH record that looks more like this:

    type=PATH msg=audit(1368021604.836:484): item=0 name="test_mq" inode=16914
    dev=00:0c mode=0100644 ouid=0 ogid=0 rdev=00:00
    obj=unconfined_u:object_r:user_tmpfs_t:s0

    In order to do this, a new audit_inode_parent_hidden() function is
    added. If we do it this way, then we avoid having the existing callers
    of audit_inode needing to do any sort of flag conversion if auditing is
    inactive.

    Signed-off-by: Jeff Layton
    Reported-by: Jiri Jaburek
    Cc: Steve Grubb
    Cc: Eric Paris
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jeff Layton
     

29 Mar, 2013

1 commit

  • Pull userns fixes from Eric W Biederman:
    "The bulk of the changes are fixing the worst consequences of the user
    namespace design oversight in not considering what happens when one
    namespace starts off as a clone of another namespace, as happens with
    the mount namespace.

    The rest of the changes are just plain bug fixes.

    Many thanks to Andy Lutomirski for pointing out many of these issues."

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
    userns: Restrict when proc and sysfs can be mounted
    ipc: Restrict mounting the mqueue filesystem
    vfs: Carefully propogate mounts across user namespaces
    vfs: Add a mount flag to lock read only bind mounts
    userns: Don't allow creation if the user is chrooted
    yama: Better permission check for ptraceme
    pid: Handle the exit of a multi-threaded init.
    scm: Require CAP_SYS_ADMIN over the current pidns to spoof pids.

    Linus Torvalds
     

27 Mar, 2013

1 commit

  • Only allow mounting the mqueue filesystem if the caller has CAP_SYS_ADMIN
    rights over the ipc namespace. The principle here is if you create
    or have capabilities over it you can mount it, otherwise you get to live
    with what other people have mounted.

    This information is not particularly sensitive and mqueue essentially
    only reports which posix messages queues exist. Still when creating a
    restricted environment for an application to live any extra
    information may be of use to someone with sufficient creativity. The
    historical if imperfect way this information has been restricted has
    been not to allow mounts and restricting this to ipc namespace
    creators maintains the spirit of the historical restriction.

    Cc: stable@vger.kernel.org
    Acked-by: Serge Hallyn
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

23 Mar, 2013

1 commit

  • mnt_drop_write() must be called only if mnt_want_write() succeeded,
    otherwise the mnt_writers counter will diverge.

    mnt_writers counters are used to check if remounting FS as read-only is
    OK, so after an extra mnt_drop_write() call, it would be impossible to
    remount mqueue FS as read-only. Besides, on umount a warning would be
    printed like this one:

    =====================================
    [ BUG: bad unlock balance detected! ]
    3.9.0-rc3 #5 Not tainted
    -------------------------------------
    a.out/12486 is trying to release lock (sb_writers) at:
    mnt_drop_write+0x1f/0x30
    but there are no more locks to release!

    Signed-off-by: Vladimir Davydov
    Cc: Doug Ledford
    Cc: KOSAKI Motohiro
    Cc: "Eric W. Biederman"
    Cc: Al Viro
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     

27 Feb, 2013

1 commit

  • Pull vfs pile (part one) from Al Viro:
    "Assorted stuff - cleaning namei.c up a bit, fixing ->d_name/->d_parent
    locking violations, etc.

    The most visible changes here are death of FS_REVAL_DOT (replaced with
    "has ->d_weak_revalidate()") and a new helper getting from struct file
    to inode. Some bits of preparation to xattr method interface changes.

    Misc patches by various people sent this cycle *and* ocfs2 fixes from
    several cycles ago that should've been upstream right then.

    PS: the next vfs pile will be xattr stuff."

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (46 commits)
    saner proc_get_inode() calling conventions
    proc: avoid extra pde_put() in proc_fill_super()
    fs: change return values from -EACCES to -EPERM
    fs/exec.c: make bprm_mm_init() static
    ocfs2/dlm: use GFP_ATOMIC inside a spin_lock
    ocfs2: fix possible use-after-free with AIO
    ocfs2: Fix oops in ocfs2_fast_symlink_readpage() code path
    get_empty_filp()/alloc_file() leave both ->f_pos and ->f_version zero
    target: writev() on single-element vector is pointless
    export kernel_write(), convert open-coded instances
    fs: encode_fh: return FILEID_INVALID if invalid fid_type
    kill f_vfsmnt
    vfs: kill FS_REVAL_DOT by adding a d_weak_revalidate dentry op
    nfsd: handle vfs_getattr errors in acl protocol
    switch vfs_getattr() to struct path
    default SET_PERSONALITY() in linux/elf.h
    ceph: prepopulate inodes only when request is aborted
    d_hash_and_lookup(): export, switch open-coded instances
    9p: switch v9fs_set_create_acl() to inode+fid, do it before d_instantiate()
    9p: split dropping the acls from v9fs_set_create_acl()
    ...

    Linus Torvalds
     

23 Feb, 2013

1 commit


28 Jan, 2013

1 commit

  • This patch allow the unprivileged user to mount mqueuefs in
    user ns.

    If two userns share the same ipcns,the files in mqueue fs
    should be seen in both these two userns.

    If the userns has its own ipcns,it has its own mqueue fs too.
    ipcns has already done this job well.

    Signed-off-by: Gao feng
    Signed-off-by: Eric W. Biederman

    Gao feng
     

13 Oct, 2012

2 commits

  • Keep a pointer to the audit_names "slot" in struct filename.

    Have all of the audit_inode callers pass a struct filename ponter to
    audit_inode instead of a string pointer. If the aname field is already
    populated, then we can skip walking the list altogether and just use it
    directly.

    Signed-off-by: Jeff Layton
    Signed-off-by: Al Viro

    Jeff Layton
     
  • getname() is intended to copy pathname strings from userspace into a
    kernel buffer. The result is just a string in kernel space. It would
    however be quite helpful to be able to attach some ancillary info to
    the string.

    For instance, we could attach some audit-related info to reduce the
    amount of audit-related processing needed. When auditing is enabled,
    we could also call getname() on the string more than once and not
    need to recopy it from userspace.

    This patchset converts the getname()/putname() interfaces to return
    a struct instead of a string. For now, the struct just tracks the
    string in kernel space and the original userland pointer for it.

    Later, we'll add other information to the struct as it becomes
    convenient.

    Signed-off-by: Jeff Layton
    Signed-off-by: Al Viro

    Jeff Layton
     

12 Oct, 2012

1 commit

  • Currently, this gets set mostly by happenstance when we call into
    audit_inode_child. While that might be a little more efficient, it seems
    wrong. If the syscall ends up failing before audit_inode_child ever gets
    called, then you'll have an audit_names record that shows the full path
    but has the parent inode info attached.

    Fix this by passing in a parent flag when we call audit_inode that gets
    set to the value of LOOKUP_PARENT. We can then fix up the pathname for
    the audit entry correctly from the get-go.

    While we're at it, clean up the no-op macro for audit_inode in the
    !CONFIG_AUDITSYSCALL case.

    Signed-off-by: Jeff Layton
    Signed-off-by: Al Viro

    Jeff Layton
     

09 Oct, 2012

1 commit

  • Commit d6629859b36d ("ipc/mqueue: improve performance of send/recv") and
    ce2d52cc ("ipc/mqueue: add rbtree node caching support") introduced an
    rbtree of message priorities, and usage of rb_init_node() to initialize
    the corresponding nodes. As it turns out, rb_init_node() is unnecessary
    here, as the nodes are fully initialized on insertion by rb_link_node()
    and the code doesn't access nodes that aren't inserted on the rbtree.

    Removing the rb_init_node() calls as I removed that function during
    rbtree API cleanups (the only other use of it was in a place that
    similarly didn't require it).

    Signed-off-by: Michel Lespinasse
    Acked-by: Doug Ledford
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     

27 Sep, 2012

2 commits


19 Aug, 2012

1 commit


23 Jul, 2012

1 commit


14 Jul, 2012

2 commits


01 Jun, 2012

7 commits

  • When I wrote the first patch that added the rbtree support for message
    queue insertion, it sped up the case where the queue was very full
    drastically from the original code. It, however, slowed down the case
    where the queue was empty (not drastically though).

    This patch caches the last freed rbtree node struct so we can quickly
    reuse it when we get a new message. This is the common path for any queue
    that very frequently goes from 0 to 1 then back to 0 messages in queue.

    Andrew Morton didn't like that we were doing a GFP_ATOMIC allocation in
    msg_insert, so this patch attempts to speculatively allocate a new node
    struct outside of the spin lock when we know we need it, but will still
    fall back to a GFP_ATOMIC allocation if it has to.

    Once I added the caching, the necessary various ret = ; spin_unlock
    gyrations in mq_timedsend were getting pretty ugly, so this also slightly
    refactors that function to streamline the flow of the code and the
    function exit.

    Finally, while working on getting performance back I made sure that all of
    the node structs were always fully initialized when they were first used,
    rendering the use of kzalloc unnecessary and a waste of CPU cycles.

    The net result of all of this is:

    1) We will avoid a GFP_ATOMIC allocation when possible, but fall back
    on it when necessary.

    2) We will speculatively allocate a node struct using GFP_KERNEL if our
    cache is empty (and save the struct to our cache if it's still empty
    after we have obtained the spin lock).

    3) The performance of the common queue empty case has significantly
    improved and is now much more in line with the older performance for
    this case.

    The performance changes are:

    Old mqueue new mqueue new mqueue + caching
    queue empty
    send/recv 305/288ns 349/318ns 310/322ns

    I don't think we'll ever be able to get the recv performance back, but
    that's because the old recv performance was a direct result and
    consequence of the old methods abysmal send performance. The recv path
    simply must do more so that the send path does not incur such a penalty
    under higher queue depths.

    As it turns out, the new caching code also sped up the various queue full
    cases relative to my last patch. That could be because of the difference
    between the syscall path in 3.3.4-rc5 and 3.3.4-rc6, or because of the
    change in code flow in the mq_timedsend routine. Regardless, I'll take
    it. It wasn't huge, and I *would* say it was within the margin for error,
    but after many repeated runs what I'm seeing is that the old numbers trend
    slightly higher (about 10 to 20ns depending on which test is the one
    running).

    [akpm@linux-foundation.org: checkpatch fixes]
    Signed-off-by: Doug Ledford
    Cc: Frederic Weisbecker
    Cc: Manfred Spraul
    Cc: Stephen Rothwell
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Doug Ledford
     
  • We already check the mq attr struct if it's passed in, but now that the
    admin can set system wide defaults separate from maximums, it's actually
    possible to set the defaults to something that would overflow. So, if
    there is no attr struct passed in to the open call, check the default
    values.

    While we are at it, simplify mq_attr_ok() by making it return 0 or an
    error condition, so that way if we add more tests to it later, we have the
    option of what error should be returned instead of the calling location
    having to pick a possibly inaccurate error code.

    [akpm@linux-foundation.org: s/ENOMEM/EOVERFLOW/]
    Signed-off-by: Doug Ledford
    Cc: Stephen Rothwell
    Cc: Manfred Spraul
    Acked-by: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Doug Ledford
     
  • While working on the other parts of the mqueue stuff, I noticed that the
    calculation for overflow in mq_attr_ok didn't actually match reality (this
    is especially true since my last patch which changed how we account memory
    slightly).

    In particular, we used to test for overflow using:
    msgs * msgsize + msgs * sizeof(struct msg_msg *)

    That was never really correct because each message we allocate via
    load_msg() is actually a struct msg_msg followed by the data for the
    message (and if struct msg_msg + data exceeds PAGE_SIZE we end up
    allocating struct msg_msgseg structs too, but accounting for them would
    get really tedious, so let's ignore those...they're only a pointer in size
    anyway). This patch updates the calculation to be more accurate in
    regards to maximum possible memory consumption by the mqueue.

    [akpm@linux-foundation.org: add a local to simplify overflow-checking expression]
    Signed-off-by: Doug Ledford
    Cc: Stephen Rothwell
    Cc: Manfred Spraul
    Acked-by: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Doug Ledford
     
  • The existing implementation of the POSIX message queue send and recv
    functions is, well, abysmal. Even worse than abysmal. I submitted a
    patch to increase the maximum POSIX message queue limit to 65536 due to
    customer needs, however, upon looking over the send/recv implementation, I
    realized that my customer needs help with that too even if they don't know
    it. The basic problem is that, given the fairly typical use case scenario
    for a large queue of queueing lots of messages all at the same priority (I
    verified with my customer that this is indeed what their app does), the
    msg_insert routine is basically a frikkin' bubble sort. I mean, whoa,
    that's *so* middle school.

    OK, OK, to not slam the original author too much, I'm sure they didn't
    envision a queue depth of 50,000+ messages. No one would think that
    moving elements in an array, one at a time, and dereferencing each pointer
    in that array to check priority of the message being pointed too, again
    one at a time, for 50,000+ times would be good. So let's assume that, as
    is typical, the users have found a way to break our code simply by using
    it in a way we didn't envision. Fair enough.

    "So, just how broken is it?", you ask. I wondered the same thing, so I
    wrote an app to let me know. It's my next patch. It gave me some
    interesting results. Here's what it tested:

    Interference with other apps - In continuous mode, the app just sits there
    and hits a message queue forever, while you go do something productive on
    another terminal using other CPUs. You then measure how long it takes you
    to do that something productive. Then you restart the app in fake
    continuous mode, and it sits in a tight loop on a CPU while you repeat
    your tests. The whole point of this is to keep one CPU tied up (so it
    can't be used in your other work) but in one case tied up hitting the
    mqueue code so we can see the effect of walking that 65,528 element array
    one pointer at a time on the global CPU cache. If it's bad, then it will
    slow down your app on the other CPUs just by polluting cache mercilessly.
    In the fake case, it will be in a tight loop, but not polluting cache.
    Testing the mqueue subsystem directly - Here we just run a number of tests
    to see how the mqueue subsystem performs under different conditions. A
    couple conditions are known to be worst case for the old system, and some
    routines, so this tests all of them.

    So, on to the results already:

    Subsystem/Test Old New

    Time to compile linux
    kernel (make -j12 on a
    6 core CPU)
    Running mqueue test user 49m10.744s user 45m26.294s
    sys 5m51.924s sys 4m59.894s
    total 55m02.668s total 50m26.188s

    Running fake test user 45m32.686s user 45m18.552s
    sys 5m12.465s sys 4m56.468s
    total 50m45.151s total 50m15.020s

    % slowdown from mqueue
    cache thrashing ~8% ~.5%

    Avg time to send/recv (in nanoseconds per message)
    when queue empty 305/288 349/318
    when queue full (65528 messages)
    constant priority 526589/823 362/314
    increasing priority 403105/916 495/445
    decreasing priority 73420/594 482/409
    random priority 280147/920 546/436

    Time to fill/drain queue (65528 messages, in seconds)
    constant priority 17.37/.12 .13/.12
    increasing priority 4.14/.14 .21/.18
    decreasing priority 12.93/.13 .21/.18
    random priority 8.88/.16 .22/.17

    So, I think the results speak for themselves. It's possible this
    implementation could be improved by cacheing at least one priority level
    in the node tree (that would bring the queue empty performance more in
    line with the old implementation), but this works and is *so* much better
    than what we had, especially for the common case of a single priority in
    use, that further refinements can be in follow on patches.

    [akpm@linux-foundation.org: fix typo in comment, remove stray semicolon]
    [levinsasha928@gmail.com: use correct gfp flags in msg_insert]
    Signed-off-by: Doug Ledford
    Cc: Stephen Rothwell
    Cc: Manfred Spraul
    Acked-by: KOSAKI Motohiro
    Signed-off-by: Sasha Levin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Doug Ledford
     
  • Commit b231cca4381e ("message queues: increase range limits") changed
    mqueue default value when attr parameter is specified NULL from hard
    coded value to fs.mqueue.{msg,msgsize}_max sysctl value.

    This made large side effect. When user need to use two mqueue
    applications 1) using !NULL attr parameter and it require big message
    size and 2) using NULL attr parameter and only need small size message,
    app (1) require to raise fs.mqueue.msgsize_max and app (2) consume large
    memory size even though it doesn't need.

    Doug Ledford propsed to switch back it to static hard coded value.
    However it also has a compatibility problem. Some applications might
    started depend on the default value is tunable.

    The solution is to separate default value from maximum value.

    Signed-off-by: KOSAKI Motohiro
    Signed-off-by: Doug Ledford
    Acked-by: Doug Ledford
    Acked-by: Joe Korty
    Cc: Amerigo Wang
    Acked-by: Serge E. Hallyn
    Cc: Jiri Slaby
    Cc: Manfred Spraul
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • KMALLOC_MAX_SIZE is not a good threshold. It is extremely high and
    problematic. Unfortunately, some silly drivers depend on this and we
    can't change it. But any new code needn't use such extreme ugly high
    order allocations. It brings us awful fragmentation issues and system
    slowdown.

    Signed-off-by: KOSAKI Motohiro
    Acked-by: Doug Ledford
    Acked-by: Joe Korty
    Cc: Amerigo Wang
    Cc: Serge E. Hallyn
    Cc: Jiri Slaby
    Cc: Joe Korty
    Cc: Manfred Spraul
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • Commit b231cca4381e ("message queues: increase range limits") changed the
    maximum size of a message in a message queue from INT_MAX to 8192*128.
    Unfortunately, we had customers that relied on a size much larger than
    8192*128 on their production systems. After reviewing POSIX, we found
    that it is silent on the maximum message size. We did find a couple other
    areas in which it was not silent. Fix up the mqueue maximums so that the
    customer's system can continue to work, and document both the POSIX and
    real world requirements in ipc_namespace.h so that we don't have this
    issue crop back up.

    Also, commit 9cf18e1dd74cd0 ("ipc: HARD_MSGMAX should be higher not lower
    on 64bit") fiddled with HARD_MSGMAX without realizing that the number was
    intentionally in place to limit the msg queue depth to one that was small
    enough to kmalloc an array of pointers (hence why we divided 128k by
    sizeof(long)). If we wish to meet POSIX requirements, we have no choice
    but to change our allocation to a vmalloc instead (at least for the large
    queue size case). With that, it's possible to increase our allowed
    maximum to the POSIX requirements (or more if we choose).

    [sfr@canb.auug.org.au: using vmalloc requires including vmalloc.h]
    Signed-off-by: Doug Ledford
    Cc: Serge E. Hallyn
    Cc: Amerigo Wang
    Cc: Joe Korty
    Cc: Jiri Slaby
    Acked-by: KOSAKI Motohiro
    Cc: Manfred Spraul
    Signed-off-by: Stephen Rothwell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Doug Ledford