09 Sep, 2017

1 commit

  • refcount_t type and corresponding API should be used instead of atomic_t
    when the variable is used as a reference counter. This allows to avoid
    accidental refcounter overflows that might lead to use-after-free
    situations.

    Link: http://lkml.kernel.org/r/1499417992-3238-2-git-send-email-elena.reshetova@intel.com
    Signed-off-by: Elena Reshetova
    Signed-off-by: Hans Liljestrand
    Signed-off-by: Kees Cook
    Signed-off-by: David Windsor
    Cc: Peter Zijlstra
    Cc: Greg Kroah-Hartman
    Cc: "Eric W. Biederman"
    Cc: Ingo Molnar
    Cc: Alexey Dobriyan
    Cc: Serge Hallyn
    Cc:
    Cc: Davidlohr Bueso
    Cc: Manfred Spraul
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Elena Reshetova
     

28 Oct, 2016

1 commit

  • When kmem accounting switched from account by default to only account if
    flagged by __GFP_ACCOUNT, IPC mqueue and messages was left out.

    The production use case at hand is that mqueues should be customizable
    via sysctls in Docker containers in a Kubernetes cluster. This can only
    be safely allowed to the users of the cluster (without the risk that
    they can cause resource shortage on a node, influencing other users'
    containers) if all resources they control are bounded, i.e. accounted
    for.

    Link: http://lkml.kernel.org/r/1476806075-1210-1-git-send-email-arozansk@redhat.com
    Signed-off-by: Aristeu Rozanski
    Reported-by: Stefan Schimanski
    Acked-by: Davidlohr Bueso
    Cc: Alexey Dobriyan
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Vladimir Davydov
    Cc: Stefan Schimanski
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Aristeu Rozanski
     

03 Aug, 2016

1 commit


07 Nov, 2015

1 commit

  • d0edd8528362 ("ipc: convert invalid scenarios to use WARN_ON") relaxed the
    nil dst parameter check, originally being a full BUG_ON. However, this
    check seems quite unnecessary when the only purpose is for
    ceckpoint/restore (MSG_COPY flag):

    o The copy variable is set initially to nil, apparently as a way of
    ensuring that prepare_copy is previously called. Which is in fact done,
    unconditionally at the beginning of do_msgrcv.

    o There is no concurrency with 'copy' (stack allocated in do_msgrcv).

    Furthermore, any errors in 'copy' (and thus prepare_copy/copy_msg) should
    always handled by IS_ERR() family. Therefore remove this check altogether
    as it can never occur with the current users.

    Signed-off-by: Davidlohr Bueso
    Cc: Stanislav Kinsbursky
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     

11 Sep, 2015

1 commit

  • Considering Linus' past rants about the (ab)use of BUG in the kernel, I
    took a look at how we deal with such calls in ipc. Given that any errors
    or corruption in ipc code are most likely contained within the set of
    processes participating in the broken mechanisms, there aren't really many
    strong fatal system failure scenarios that would require a BUG call.
    Also, if something is seriously wrong, ipc might not be the place for such
    a BUG either.

    1. For example, recently, a customer hit one of these BUG_ONs in shm
    after failing shm_lock(). A busted ID imho does not merit a BUG_ON,
    and WARN would have been better.

    2. MSG_COPY functionality of posix msgrcv(2) for checkpoint/restore.
    I don't see how we can hit this anyway -- at least it should be IS_ERR.
    The 'copy' arg from do_msgrcv is always set by calling prepare_copy()
    first and foremost. We could also probably drop this check altogether.
    Either way, it does not merit a BUG_ON.

    3. No ->fault() callback for the fs getting the corresponding page --
    seems selfish to make the system unusable.

    Signed-off-by: Davidlohr Bueso
    Cc: Manfred Spraul
    Cc: Linus Torvalds
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     

05 Dec, 2014

2 commits


13 Nov, 2013

1 commit

  • On 64 bit systems the test for negative message sizes is bogus as the
    size, which may be positive when evaluated as a long, will get truncated
    to an int when passed to load_msg(). So a long might very well contain a
    positive value but when truncated to an int it would become negative.

    That in combination with a small negative value of msg_ctlmax (which will
    be promoted to an unsigned type for the comparison against msgsz, making
    it a big positive value and therefore make it pass the check) will lead to
    two problems: 1/ The kmalloc() call in alloc_msg() will allocate a too
    small buffer as the addition of alen is effectively a subtraction. 2/ The
    copy_from_user() call in load_msg() will first overflow the buffer with
    userland data and then, when the userland access generates an access
    violation, the fixup handler copy_user_handle_tail() will try to fill the
    remainder with zeros -- roughly 4GB. That almost instantly results in a
    system crash or reset.

    ,-[ Reproducer (needs to be run as root) ]--
    | #include
    | #include
    | #include
    | #include
    |
    | int main(void) {
    | long msg = 1;
    | int fd;
    |
    | fd = open("/proc/sys/kernel/msgmax", O_WRONLY);
    | write(fd, "-1", 2);
    | close(fd);
    |
    | msgsnd(0, &msg, 0xfffffff0, IPC_NOWAIT);
    |
    | return 0;
    | }
    '---

    Fix the issue by preventing msgsz from getting truncated by consistently
    using size_t for the message length. This way the size checks in
    do_msgsnd() could still be passed with a negative value for msg_ctlmax but
    we would fail on the buffer allocation in that case and error out.

    Also change the type of m_ts from int to size_t to avoid similar nastiness
    in other code paths -- it is used in similar constructs, i.e. signed vs.
    unsigned checks. It should never become negative under normal
    circumstances, though.

    Setting msg_ctlmax to a negative value is an odd configuration and should
    be prevented. As that might break existing userland, it will be handled
    in a separate commit so it could easily be reverted and reworked without
    reintroducing the above described bug.

    Hardening mechanisms for user copy operations would have catched that bug
    early -- e.g. checking slab object sizes on user copy operations as the
    usercopy feature of the PaX patch does. Or, for that matter, detect the
    long vs. int sign change due to truncation, as the size overflow plugin
    of the very same patch does.

    [akpm@linux-foundation.org: fix i386 min() warnings]
    Signed-off-by: Mathias Krause
    Cc: Pax Team
    Cc: Davidlohr Bueso
    Cc: Brad Spengler
    Cc: Manfred Spraul
    Cc: [ v2.3.27+ -- yes, that old ;) ]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mathias Krause
     

02 May, 2013

2 commits

  • Pull VFS updates from Al Viro,

    Misc cleanups all over the place, mainly wrt /proc interfaces (switch
    create_proc_entry to proc_create(), get rid of the deprecated
    create_proc_read_entry() in favor of using proc_create_data() and
    seq_file etc).

    7kloc removed.

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (204 commits)
    don't bother with deferred freeing of fdtables
    proc: Move non-public stuff from linux/proc_fs.h to fs/proc/internal.h
    proc: Make the PROC_I() and PDE() macros internal to procfs
    proc: Supply a function to remove a proc entry by PDE
    take cgroup_open() and cpuset_open() to fs/proc/base.c
    ppc: Clean up scanlog
    ppc: Clean up rtas_flash driver somewhat
    hostap: proc: Use remove_proc_subtree()
    drm: proc: Use remove_proc_subtree()
    drm: proc: Use minor->index to label things, not PDE->name
    drm: Constify drm_proc_list[]
    zoran: Don't print proc_dir_entry data in debug
    reiserfs: Don't access the proc_dir_entry in r_open(), r_start() r_show()
    proc: Supply an accessor for getting the data from a PDE's parent
    airo: Use remove_proc_subtree()
    rtl8192u: Don't need to save device proc dir PDE
    rtl8187se: Use a dir under /proc/net/r8180/
    proc: Add proc_mkdir_data()
    proc: Move some bits from linux/proc_fs.h to linux/{of.h,signal.h,tty.h}
    proc: Move PDE_NET() to fs/proc/proc_net.c
    ...

    Linus Torvalds
     
  • Split the proc namespace stuff out into linux/proc_ns.h.

    Signed-off-by: David Howells
    cc: netdev@vger.kernel.org
    cc: Serge E. Hallyn
    cc: Eric W. Biederman
    Signed-off-by: Al Viro

    David Howells
     

01 May, 2013

5 commits


09 Mar, 2013

1 commit


05 Jan, 2013

2 commits

  • Remove the redundant and confusing fill_copy(). Also add copy_msg()
    check for error. In this case exit from the function have to be done
    instead of break, because further code interprets any error as EAGAIN.

    Also define copy_msg() for the case when CONFIG_CHECKPOINT_RESTORE is
    disabled.

    Signed-off-by: Stanislav Kinsbursky
    Cc: "Eric W. Biederman"
    Cc: James Morris
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Stanislav Kinsbursky
     
  • This patch is required for checkpoint/restore in userspace.

    c/r requires some way to get all pending IPC messages without deleting
    them from the queue (checkpoint can fail and in this case tasks will be
    resumed, so queue have to be valid).

    To achive this, new operation flag MSG_COPY for sys_msgrcv() system call
    was introduced. If this flag was specified, then mtype is interpreted as
    number of the message to copy.

    If MSG_COPY is set, then kernel will allocate dummy message with passed
    size, and then use new copy_msg() helper function to copy desired message
    (instead of unlinking it from the queue).

    Notes:

    1) Return -ENOSYS if MSG_COPY is specified, but
    CONFIG_CHECKPOINT_RESTORE is not set.

    Signed-off-by: Stanislav Kinsbursky
    Cc: Serge Hallyn
    Cc: "Eric W. Biederman"
    Cc: Pavel Emelyanov
    Cc: Al Viro
    Cc: KOSAKI Motohiro
    Cc: Michael Kerrisk
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Stanislav Kinsbursky
     

20 Nov, 2012

1 commit

  • Assign a unique proc inode to each namespace, and use that
    inode number to ensure we only allocate at most one proc
    inode for every namespace in proc.

    A single proc inode per namespace allows userspace to test
    to see if two processes are in the same namespace.

    This has been a long requested feature and only blocked because
    a naive implementation would put the id in a global space and
    would ultimately require having a namespace for the names of
    namespaces, making migration and certain virtualization tricks
    impossible.

    We still don't have per superblock inode numbers for proc, which
    appears necessary for application unaware checkpoint/restart and
    migrations (if the application is using namespace file descriptors)
    but that is now allowd by the design if it becomes important.

    I have preallocated the ipc and uts initial proc inode numbers so
    their structures can be statically initialized.

    Signed-off-by: Eric W. Biederman

    Eric W. Biederman
     

14 Feb, 2012

1 commit


09 Dec, 2011

1 commit


24 Mar, 2011

1 commit

  • Changelog:
    Feb 15: Don't set new ipc->user_ns if we didn't create a new
    ipc_ns.
    Feb 23: Move extern declaration to ipc_namespace.h, and group
    fwd declarations at top.

    Signed-off-by: Serge E. Hallyn
    Acked-by: "Eric W. Biederman"
    Acked-by: Daniel Lezcano
    Acked-by: David Howells
    Cc: James Morris
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Serge E. Hallyn
     

07 Apr, 2009

2 commits

  • Implement multiple mounts of the mqueue file system, and link it to usage
    of CLONE_NEWIPC.

    Each ipc ns has a corresponding mqueuefs superblock. When a user does
    clone(CLONE_NEWIPC) or unshare(CLONE_NEWIPC), the unshare will cause an
    internal mount of a new mqueuefs sb linked to the new ipc ns.

    When a user does 'mount -t mqueue mqueue /dev/mqueue', he mounts the
    mqueuefs superblock.

    Posix message queues can be worked with both through the mq_* system calls
    (see mq_overview(7)), and through the VFS through the mqueue mount. Any
    usage of mq_open() and friends will work with the acting task's ipc
    namespace. Any actions through the VFS will work with the mqueuefs in
    which the file was created. So if a user doesn't remount mqueuefs after
    unshare(CLONE_NEWIPC), mq_open("/ab") will not be reflected in "ls
    /dev/mqueue".

    If task a mounts mqueue for ipc_ns:1, then clones task b with a new ipcns,
    ipcns:2, and then task a is the last task in ipc_ns:1 to exit, then (1)
    ipc_ns:1 will be freed, (2) it's superblock will live on until task b
    umounts the corresponding mqueuefs, and vfs actions will continue to
    succeed, but (3) sb->s_fs_info will be NULL for the sb corresponding to
    the deceased ipc_ns:1.

    To make this happen, we must protect the ipc reference count when

    a) a task exits and drops its ipcns->count, since it might be dropping
    it to 0 and freeing the ipcns

    b) a task accesses the ipcns through its mqueuefs interface, since it
    bumps the ipcns refcount and might race with the last task in the ipcns
    exiting.

    So the kref is changed to an atomic_t so we can use
    atomic_dec_and_lock(&ns->count,mq_lock), and every access to the ipcns
    through ns = mqueuefs_sb->s_fs_info is protected by the same lock.

    Signed-off-by: Cedric Le Goater
    Signed-off-by: Serge E. Hallyn
    Cc: Alexey Dobriyan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Serge E. Hallyn
     
  • Move mqueue vfsmount plus a few tunables into the ipc_namespace struct.
    The CONFIG_IPC_NS boolean and the ipc_namespace struct will serve both the
    posix message queue namespaces and the SYSV ipc namespaces.

    The sysctl code will be fixed separately in patch 3. After just this
    patch, making a change to posix mqueue tunables always changes the values
    in the initial ipc namespace.

    Signed-off-by: Cedric Le Goater
    Signed-off-by: Serge E. Hallyn
    Cc: Alexey Dobriyan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Serge E. Hallyn
     

14 Dec, 2006

1 commit

  • Run this:

    #!/bin/sh
    for f in $(grep -Erl "\([^\)]*\) *k[cmz]alloc" *) ; do
    echo "De-casting $f..."
    perl -pi -e "s/ ?= ?\([^\)]*\) *(k[cmz]alloc) *\(/ = \1\(/" $f
    done

    And then go through and reinstate those cases where code is casting pointers
    to non-pointers.

    And then drop a few hunks which conflicted with outstanding work.

    Cc: Russell King , Ian Molton
    Cc: Mikael Starvik
    Cc: Yoshinori Sato
    Cc: Roman Zippel
    Cc: Geert Uytterhoeven
    Cc: Ralf Baechle
    Cc: Paul Mackerras
    Cc: Kyle McMartin
    Cc: Benjamin Herrenschmidt
    Cc: Martin Schwidefsky
    Cc: "David S. Miller"
    Cc: Jeff Dike
    Cc: Greg KH
    Cc: Jens Axboe
    Cc: Paul Fulghum
    Cc: Alan Cox
    Cc: Karsten Keil
    Cc: Mauro Carvalho Chehab
    Cc: Jeff Garzik
    Cc: James Bottomley
    Cc: Ian Kent
    Cc: Steven French
    Cc: David Woodhouse
    Cc: Neil Brown
    Cc: Jaroslav Kysela
    Cc: Takashi Iwai
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Robert P. J. Day
     

04 Oct, 2006

1 commit


17 Apr, 2005

1 commit

  • Initial git repository build. I'm not bothering with the full history,
    even though we have it. We can create a separate "historical" git
    archive of that later if we want to, and in the meantime it's about
    3.2GB when imported into git - space that would just make the early
    git days unnecessarily complicated, when we don't have a lot of good
    infrastructure for it.

    Let it rip!

    Linus Torvalds