15 Jan, 2016

1 commit

  • Mark those kmem allocations that are known to be easily triggered from
    userspace as __GFP_ACCOUNT/SLAB_ACCOUNT, which makes them accounted to
    memcg. For the list, see below:

    - threadinfo
    - task_struct
    - task_delay_info
    - pid
    - cred
    - mm_struct
    - vm_area_struct and vm_region (nommu)
    - anon_vma and anon_vma_chain
    - signal_struct
    - sighand_struct
    - fs_struct
    - files_struct
    - fdtable and fdtable->full_fds_bits
    - dentry and external_name
    - inode for all filesystems. This is the most tedious part, because
    most filesystems overwrite the alloc_inode method.

    The list is far from complete, so feel free to add more objects.
    Nevertheless, it should be close to "account everything" approach and
    keep most workloads within bounds. Malevolent users will be able to
    breach the limit, but this was possible even with the former "account
    everything" approach (simply because it did not account everything in
    fact).

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Vladimir Davydov
    Acked-by: Johannes Weiner
    Acked-by: Michal Hocko
    Cc: Tejun Heo
    Cc: Greg Thelen
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     

07 Nov, 2015

1 commit

  • d0edd8528362 ("ipc: convert invalid scenarios to use WARN_ON") relaxed the
    nil dst parameter check, originally being a full BUG_ON. However, this
    check seems quite unnecessary when the only purpose is for
    ceckpoint/restore (MSG_COPY flag):

    o The copy variable is set initially to nil, apparently as a way of
    ensuring that prepare_copy is previously called. Which is in fact done,
    unconditionally at the beginning of do_msgrcv.

    o There is no concurrency with 'copy' (stack allocated in do_msgrcv).

    Furthermore, any errors in 'copy' (and thus prepare_copy/copy_msg) should
    always handled by IS_ERR() family. Therefore remove this check altogether
    as it can never occur with the current users.

    Signed-off-by: Davidlohr Bueso
    Cc: Stanislav Kinsbursky
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     

01 Oct, 2015

1 commit

  • As reported by Dmitry Vyukov, we really shouldn't do ipc_addid() before
    having initialized the IPC object state. Yes, we initialize the IPC
    object in a locked state, but with all the lockless RCU lookup work,
    that IPC object lock no longer means that the state cannot be seen.

    We already did this for the IPC semaphore code (see commit e8577d1f0329:
    "ipc/sem.c: fully initialize sem_array before making it visible") but we
    clearly forgot about msg and shm.

    Reported-by: Dmitry Vyukov
    Cc: Manfred Spraul
    Cc: Davidlohr Bueso
    Cc: stable@vger.kernel.org
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

11 Sep, 2015

1 commit

  • Considering Linus' past rants about the (ab)use of BUG in the kernel, I
    took a look at how we deal with such calls in ipc. Given that any errors
    or corruption in ipc code are most likely contained within the set of
    processes participating in the broken mechanisms, there aren't really many
    strong fatal system failure scenarios that would require a BUG call.
    Also, if something is seriously wrong, ipc might not be the place for such
    a BUG either.

    1. For example, recently, a customer hit one of these BUG_ONs in shm
    after failing shm_lock(). A busted ID imho does not merit a BUG_ON,
    and WARN would have been better.

    2. MSG_COPY functionality of posix msgrcv(2) for checkpoint/restore.
    I don't see how we can hit this anyway -- at least it should be IS_ERR.
    The 'copy' arg from do_msgrcv is always set by calling prepare_copy()
    first and foremost. We could also probably drop this check altogether.
    Either way, it does not merit a BUG_ON.

    3. No ->fault() callback for the fs getting the corresponding page --
    seems selfish to make the system unusable.

    Signed-off-by: Davidlohr Bueso
    Cc: Manfred Spraul
    Cc: Linus Torvalds
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     

15 Aug, 2015

3 commits

  • sem_lock() did not properly pair memory barriers:

    !spin_is_locked() and spin_unlock_wait() are both only control barriers.
    The code needs an acquire barrier, otherwise the cpu might perform read
    operations before the lock test.

    As no primitive exists inside and since it seems
    noone wants another primitive, the code creates a local primitive within
    ipc/sem.c.

    With regards to -stable:

    The change of sem_wait_array() is a bugfix, the change to sem_lock() is a
    nop (just a preprocessor redefinition to improve the readability). The
    bugfix is necessary for all kernels that use sem_wait_array() (i.e.:
    starting from 3.10).

    Signed-off-by: Manfred Spraul
    Reported-by: Oleg Nesterov
    Acked-by: Peter Zijlstra (Intel)
    Cc: "Paul E. McKenney"
    Cc: Kirill Tkhai
    Cc: Ingo Molnar
    Cc: Josh Poimboeuf
    Cc: Davidlohr Bueso
    Cc: [3.10+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Manfred Spraul
     
  • After we acquire the sma->sem_perm lock in exit_sem(), we are protected
    against a racing IPC_RMID operation. Also at that point, we are the last
    user of sem_undo_list. Therefore it isn't required that we acquire or use
    ulp->lock.

    Signed-off-by: Herton R. Krzesinski
    Acked-by: Manfred Spraul
    Cc: Davidlohr Bueso
    Cc: Rafael Aquini
    CC: Aristeu Rozanski
    Cc: David Jeffery
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Herton R. Krzesinski
     
  • The current semaphore code allows a potential use after free: in
    exit_sem we may free the task's sem_undo_list while there is still
    another task looping through the same semaphore set and cleaning the
    sem_undo list at freeary function (the task called IPC_RMID for the same
    semaphore set).

    For example, with a test program [1] running which keeps forking a lot
    of processes (which then do a semop call with SEM_UNDO flag), and with
    the parent right after removing the semaphore set with IPC_RMID, and a
    kernel built with CONFIG_SLAB, CONFIG_SLAB_DEBUG and
    CONFIG_DEBUG_SPINLOCK, you can easily see something like the following
    in the kernel log:

    Slab corruption (Not tainted): kmalloc-64 start=ffff88003b45c1c0, len=64
    000: 6b 6b 6b 6b 6b 6b 6b 6b 00 6b 6b 6b 6b 6b 6b 6b kkkkkkkk.kkkkkkk
    010: ff ff ff ff 6b 6b 6b 6b ff ff ff ff ff ff ff ff ....kkkk........
    Prev obj: start=ffff88003b45c180, len=64
    000: 00 00 00 00 ad 4e ad de ff ff ff ff 5a 5a 5a 5a .....N......ZZZZ
    010: ff ff ff ff ff ff ff ff c0 fb 01 37 00 88 ff ff ...........7....
    Next obj: start=ffff88003b45c200, len=64
    000: 00 00 00 00 ad 4e ad de ff ff ff ff 5a 5a 5a 5a .....N......ZZZZ
    010: ff ff ff ff ff ff ff ff 68 29 a7 3c 00 88 ff ff ........h). 8b 84 24 88 03 00 00 49 8d 8c 24 60 05 00 00 8b 53 04 48 89
    RIP [] spin_dump+0x53/0xc0
    RSP
    ---[ end trace 783ebb76612867a0 ]---
    NMI watchdog: BUG: soft lockup - CPU#3 stuck for 22s! [test:18053]
    Modules linked in: 8021q mrp garp stp llc nf_conntrack_ipv4 nf_defrag_ipv4 ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables binfmt_misc ppdev input_leds joydev parport_pc parport floppy serio_raw virtio_balloon virtio_rng virtio_console virtio_net iosf_mbi crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcspkr qxl ttm drm_kms_helper drm snd_hda_codec_generic i2c_piix4 snd_hda_intel snd_hda_codec snd_hda_core snd_hwdep snd_seq snd_seq_device snd_pcm snd_timer snd soundcore crc32c_intel virtio_pci virtio_ring virtio pata_acpi ata_generic [last unloaded: speedstep_lib]
    CPU: 3 PID: 18053 Comm: test Tainted: G D 4.2.0-rc5+ #1
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.8.1-20150318_183358- 04/01/2014
    RIP: native_read_tsc+0x0/0x20
    Call Trace:
    ? delay_tsc+0x40/0x70
    __delay+0xf/0x20
    do_raw_spin_lock+0x96/0x140
    _raw_spin_lock+0xe/0x10
    sem_lock_and_putref+0x11/0x70
    SYSC_semtimedop+0x7bf/0x960
    ? handle_mm_fault+0xbf6/0x1880
    ? dequeue_task_fair+0x79/0x4a0
    ? __do_page_fault+0x19a/0x430
    ? kfree_debugcheck+0x16/0x40
    ? __do_page_fault+0x19a/0x430
    ? __audit_syscall_entry+0xa8/0x100
    ? do_audit_syscall_entry+0x66/0x70
    ? syscall_trace_enter_phase1+0x139/0x160
    SyS_semtimedop+0xe/0x10
    SyS_semop+0x10/0x20
    entry_SYSCALL_64_fastpath+0x12/0x71
    Code: 47 10 83 e8 01 85 c0 89 47 10 75 08 65 48 89 3d 1f 74 ff 7e c9 c3 0f 1f 44 00 00 55 48 89 e5 e8 87 17 04 00 66 90 c9 c3 0f 1f 00 48 89 e5 0f 31 89 c1 48 89 d0 48 c1 e0 20 89 c9 48 09 c8 c9
    Kernel panic - not syncing: softlockup: hung tasks

    I wasn't able to trigger any badness on a recent kernel without the
    proper config debugs enabled, however I have softlockup reports on some
    kernel versions, in the semaphore code, which are similar as above (the
    scenario is seen on some servers running IBM DB2 which uses semaphore
    syscalls).

    The patch here fixes the race against freeary, by acquiring or waiting
    on the sem_undo_list lock as necessary (exit_sem can race with freeary,
    while freeary sets un->semid to -1 and removes the same sem_undo from
    list_proc or when it removes the last sem_undo).

    After the patch I'm unable to reproduce the problem using the test case
    [1].

    [1] Test case used below:

    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include

    #define NSEM 1
    #define NSET 5

    int sid[NSET];

    void thread()
    {
    struct sembuf op;
    int s;
    uid_t pid = getuid();

    s = rand() % NSET;
    op.sem_num = pid % NSEM;
    op.sem_op = 1;
    op.sem_flg = SEM_UNDO;

    semop(sid[s], &op, 1);
    exit(EXIT_SUCCESS);
    }

    void create_set()
    {
    int i, j;
    pid_t p;
    union {
    int val;
    struct semid_ds *buf;
    unsigned short int *array;
    struct seminfo *__buf;
    } un;

    /* Create and initialize semaphore set */
    for (i = 0; i < NSET; i++) {
    sid[i] = semget(IPC_PRIVATE , NSEM, 0644 | IPC_CREAT);
    if (sid[i] < 0) {
    perror("semget");
    exit(EXIT_FAILURE);
    }
    }
    un.val = 0;
    for (i = 0; i < NSET; i++) {
    for (j = 0; j < NSEM; j++) {
    if (semctl(sid[i], j, SETVAL, un) < 0)
    perror("semctl");
    }
    }

    /* Launch threads that operate on semaphore set */
    for (i = 0; i < NSEM * NSET * NSET; i++) {
    p = fork();
    if (p < 0)
    perror("fork");
    if (p == 0)
    thread();
    }

    /* Free semaphore set */
    for (i = 0; i < NSET; i++) {
    if (semctl(sid[i], NSEM, IPC_RMID))
    perror("IPC_RMID");
    }

    /* Wait for forked processes to exit */
    while (wait(NULL)) {
    if (errno == ECHILD)
    break;
    };
    }

    int main(int argc, char **argv)
    {
    pid_t p;

    srand(time(NULL));

    while (1) {
    p = fork();
    if (p < 0) {
    perror("fork");
    exit(EXIT_FAILURE);
    }
    if (p == 0) {
    create_set();
    goto end;
    }

    /* Wait for forked processes to exit */
    while (wait(NULL)) {
    if (errno == ECHILD)
    break;
    };
    }
    end:
    return 0;
    }

    [akpm@linux-foundation.org: use normal comment layout]
    Signed-off-by: Herton R. Krzesinski
    Acked-by: Manfred Spraul
    Cc: Davidlohr Bueso
    Cc: Rafael Aquini
    CC: Aristeu Rozanski
    Cc: David Jeffery
    Cc:
    Signed-off-by: Andrew Morton

    Signed-off-by: Linus Torvalds

    Herton R. Krzesinski
     

07 Aug, 2015

2 commits

  • The shm implementation internally uses shmem or hugetlbfs inodes for shm
    segments. As these inodes are never directly exposed to userspace and
    only accessed through the shm operations which are already hooked by
    security modules, mark the inodes with the S_PRIVATE flag so that inode
    security initialization and permission checking is skipped.

    This was motivated by the following lockdep warning:

    ======================================================
    [ INFO: possible circular locking dependency detected ]
    4.2.0-0.rc3.git0.1.fc24.x86_64+debug #1 Tainted: G W
    -------------------------------------------------------
    httpd/1597 is trying to acquire lock:
    (&ids->rwsem){+++++.}, at: shm_close+0x34/0x130
    but task is already holding lock:
    (&mm->mmap_sem){++++++}, at: SyS_shmdt+0x4b/0x180
    which lock already depends on the new lock.
    the existing dependency chain (in reverse order) is:
    -> #3 (&mm->mmap_sem){++++++}:
    lock_acquire+0xc7/0x270
    __might_fault+0x7a/0xa0
    filldir+0x9e/0x130
    xfs_dir2_block_getdents.isra.12+0x198/0x1c0 [xfs]
    xfs_readdir+0x1b4/0x330 [xfs]
    xfs_file_readdir+0x2b/0x30 [xfs]
    iterate_dir+0x97/0x130
    SyS_getdents+0x91/0x120
    entry_SYSCALL_64_fastpath+0x12/0x76
    -> #2 (&xfs_dir_ilock_class){++++.+}:
    lock_acquire+0xc7/0x270
    down_read_nested+0x57/0xa0
    xfs_ilock+0x167/0x350 [xfs]
    xfs_ilock_attr_map_shared+0x38/0x50 [xfs]
    xfs_attr_get+0xbd/0x190 [xfs]
    xfs_xattr_get+0x3d/0x70 [xfs]
    generic_getxattr+0x4f/0x70
    inode_doinit_with_dentry+0x162/0x670
    sb_finish_set_opts+0xd9/0x230
    selinux_set_mnt_opts+0x35c/0x660
    superblock_doinit+0x77/0xf0
    delayed_superblock_init+0x10/0x20
    iterate_supers+0xb3/0x110
    selinux_complete_init+0x2f/0x40
    security_load_policy+0x103/0x600
    sel_write_load+0xc1/0x750
    __vfs_write+0x37/0x100
    vfs_write+0xa9/0x1a0
    SyS_write+0x58/0xd0
    entry_SYSCALL_64_fastpath+0x12/0x76
    ...

    Signed-off-by: Stephen Smalley
    Reported-by: Morten Stevens
    Acked-by: Hugh Dickins
    Acked-by: Paul Moore
    Cc: Manfred Spraul
    Cc: Davidlohr Bueso
    Cc: Prarit Bhargava
    Cc: Eric Paris
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Stephen Smalley
     
  • A while back, the message queue implementation in the kernel was
    improved to use btrees to speed up retrieval of messages, in commit
    d6629859b36d ("ipc/mqueue: improve performance of send/recv").

    That patch introducing the improved kernel handling of message queues
    (using btrees) has, as a by-product, changed the meaning of the QSIZE
    field in the pseudo-file created for the queue. Before, this field
    reflected the size of the user-data in the queue. Since, it also takes
    kernel data structures into account. For example, if 13 bytes of user
    data are in the queue, on my machine the file reports a size of 61
    bytes.

    There was some discussion on this topic before (for example
    https://lkml.org/lkml/2014/10/1/115). Commenting on a th lkml, Michael
    Kerrisk gave the following background
    (https://lkml.org/lkml/2015/6/16/74):

    The pseudofiles in the mqueue filesystem (usually mounted at
    /dev/mqueue) expose fields with metadata describing a message
    queue. One of these fields, QSIZE, as originally implemented,
    showed the total number of bytes of user data in all messages in
    the message queue, and this feature was documented from the
    beginning in the mq_overview(7) page. In 3.5, some other (useful)
    work happened to break the user-space API in a couple of places,
    including the value exposed via QSIZE, which now includes a measure
    of kernel overhead bytes for the queue, a figure that renders QSIZE
    useless for its original purpose, since there's no way to deduce
    the number of overhead bytes consumed by the implementation.
    (The other user-space breakage was subsequently fixed.)

    This patch removes the accounting of kernel data structures in the
    queue. Reporting the size of these data-structures in the QSIZE field
    was a breaking change (see Michael's comment above). Without the QSIZE
    field reporting the total size of user-data in the queue, there is no
    way to deduce this number.

    It should be noted that the resource limit RLIMIT_MSGQUEUE is counted
    against the worst-case size of the queue (in both the old and the new
    implementation). Therefore, the kernel overhead accounting in QSIZE is
    not necessary to help the user understand the limitations RLIMIT imposes
    on the processes.

    Signed-off-by: Marcus Gelderie
    Acked-by: Doug Ledford
    Acked-by: Michael Kerrisk
    Acked-by: Davidlohr Bueso
    Cc: David Howells
    Cc: Alexander Viro
    Cc: John Duffy
    Cc: Arto Bendiken
    Cc: Manfred Spraul
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Marcus Gelderie
     

01 Jul, 2015

6 commits

  • In ipc_obtain_object_check we return -EIDRM when a bogus sequence number
    is detected via ipc_checkid, while the ipc manpages state the following
    return codes for such errors:

    EIDRM points to a removed identifier.
    EINVAL Invalid value, or unaligned, etc.

    EIDRM should only be returned upon a RMID call (->deleted check), and thus
    return EINVAL for wrong seq. This difference in semantics has also caused
    real bugs, ie: https://bugzilla.redhat.com/show_bug.cgi?id=246509

    Signed-off-by: Davidlohr Bueso
    Cc: Manfred Spraul
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     
  • The ipc_lock helper is used by all forms of sysv ipc to acquire the ipc
    object's spinlock. Upon error (bogus identifier), we always return
    -EINVAL, whether the problem be in the idr path or because we raced with a
    task performing RMID. For the later, however, all ipc related manpages,
    state the that for:

    EIDRM points to a removed identifier.

    And return:

    EINVAL Invalid value, or unaligned, etc.

    Which (EINVAL) should only return once the ipc resource is deleted. For
    all types of ipc this is done immediately upon a RMID command. However,
    shared memory behaves slightly different as it can merely mark a segment
    for deletion, and delay the actual freeing until there are no more active
    consumers. Per shmctl(IPC_RMID) manpage:

    ""
    Mark the segment to be destroyed. The segment will only actually
    be destroyed after the last process detaches it (i.e., when the
    shm_nattch member of the associated structure shmid_ds is zero).
    ""

    Unlike ipc_lock, paths that behave "correctly", at least per the manpage,
    involve controlling the ipc resource via *ctl(), doing the exact same
    validity check as ipc_lock after right acquiring the spinlock:

    if (!ipc_valid_object()) {
    err = -EIDRM;
    goto out_unlock;
    }

    Thus make ipc_lock consistent with the rest of ipc code and return -EIDRM
    in ipc_lock when !ipc_valid_object().

    Signed-off-by: Davidlohr Bueso
    Cc: Manfred Spraul
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     
  • ... to ipc_obtain_object_idr, which is more meaningful and makes the code
    slightly easier to follow.

    Signed-off-by: Davidlohr Bueso
    Cc: Manfred Spraul
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     
  • We currently use a full barrier on the sender side to to avoid receiver
    tasks disappearing on us while still performing on the sender side wakeup.
    We lack however, the proper CPU-CPU interactions pairing on the receiver
    side which busy-waits for the message. Similarly, we do not need a full
    smp_mb, and can relax the semantics for the writer and reader sides of the
    message. This is safe as we are only ordering loads and stores to r_msg.
    And in both smp_wmb and smp_rmb, there are no stores after the calls
    _anyway_.

    This obviously applies for pipelined_send and expunge_all, for EIRDM when
    destroying a queue.

    Signed-off-by: Davidlohr Bueso
    Cc: Manfred Spraul
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     
  • Upon every shm_lock call, we BUG_ON if an error was returned, indicating
    racing either in idr or in shm_destroy. Move this logic into the locking.

    [akpm@linux-foundation.org: simplify code]
    Signed-off-by: Davidlohr Bueso
    Cc: Manfred Spraul
    Cc: Davidlohr Bueso
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     
  • Use kvfree() instead of open-coding it.

    Signed-off-by: Pekka Enberg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pekka Enberg
     

08 May, 2015

1 commit

  • This patch moves the wakeup_process() invocation so it is not done under
    the info->lock by making use of a lockless wake_q. With this change, the
    waiter is woken up once it is STATE_READY and it does not need to loop
    on SMP if it is still in STATE_PENDING. In the timeout case we still need
    to grab the info->lock to verify the state.

    This change should also avoid the introduction of preempt_disable() in -rt
    which avoids a busy-loop which pools for the STATE_PENDING -> STATE_READY
    change if the waiter has a higher priority compared to the waker.

    Additionally, this patch micro-optimizes wq_sleep by using the cheaper
    cousin of set_current_state(TASK_INTERRUPTABLE) as we will block no
    matter what, thus get rid of the implied barrier.

    Signed-off-by: Davidlohr Bueso
    Signed-off-by: Peter Zijlstra (Intel)
    Acked-by: George Spelvin
    Acked-by: Thomas Gleixner
    Cc: Andrew Morton
    Cc: Borislav Petkov
    Cc: Chris Mason
    Cc: H. Peter Anvin
    Cc: Linus Torvalds
    Cc: Manfred Spraul
    Cc: Peter Zijlstra
    Cc: Sebastian Andrzej Siewior
    Cc: Steven Rostedt
    Cc: dave@stgolabs.net
    Link: http://lkml.kernel.org/r/1430748166.1940.17.camel@stgolabs.net
    Signed-off-by: Ingo Molnar

    Davidlohr Bueso
     

27 Apr, 2015

1 commit

  • Pull fourth vfs update from Al Viro:
    "d_inode() annotations from David Howells (sat in for-next since before
    the beginning of merge window) + four assorted fixes"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    RCU pathwalk breakage when running into a symlink overmounting something
    fix I_DIO_WAKEUP definition
    direct-io: only inc/dec inode->i_dio_count for file systems
    fs/9p: fix readdir()
    VFS: assorted d_backing_inode() annotations
    VFS: fs/inode.c helpers: d_inode() annotations
    VFS: fs/cachefiles: d_backing_inode() annotations
    VFS: fs library helpers: d_inode() annotations
    VFS: assorted weird filesystems: d_inode() annotations
    VFS: normal filesystems (and lustre): d_inode() annotations
    VFS: security/: d_inode() annotations
    VFS: security/: d_backing_inode() annotations
    VFS: net/: d_inode() annotations
    VFS: net/unix: d_backing_inode() annotations
    VFS: kernel/: d_inode() annotations
    VFS: audit: d_backing_inode() annotations
    VFS: Fix up some ->d_inode accesses in the chelsio driver
    VFS: Cachefiles should perform fs modifications on the top layer only
    VFS: AF_UNIX sockets should call mknod on the top layer only

    Linus Torvalds
     

16 Apr, 2015

2 commits


18 Feb, 2015

1 commit

  • Call __set_current_state() instead of assigning the new state directly.
    These interfaces also aid CONFIG_DEBUG_ATOMIC_SLEEP environments, keeping
    track of who changed the state.

    Signed-off-by: Davidlohr Bueso
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     

17 Dec, 2014

1 commit

  • Pull vfs pile #2 from Al Viro:
    "Next pile (and there'll be one or two more).

    The large piece in this one is getting rid of /proc/*/ns/* weirdness;
    among other things, it allows to (finally) make nameidata completely
    opaque outside of fs/namei.c, making for easier further cleanups in
    there"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    coda_venus_readdir(): use file_inode()
    fs/namei.c: fold link_path_walk() call into path_init()
    path_init(): don't bother with LOOKUP_PARENT in argument
    fs/namei.c: new helper (path_cleanup())
    path_init(): store the "base" pointer to file in nameidata itself
    make default ->i_fop have ->open() fail with ENXIO
    make nameidata completely opaque outside of fs/namei.c
    kill proc_ns completely
    take the targets of /proc/*/ns/* symlinks to separate fs
    bury struct proc_ns in fs/proc
    copy address of proc_ns_ops into ns_common
    new helpers: ns_alloc_inum/ns_free_inum
    make proc_ns_operations work with struct ns_common * instead of void *
    switch the rest of proc_ns_operations to working with &...->ns
    netns: switch ->get()/->put()/->install()/->inum() to working with &net->ns
    make mntns ->get()/->put()/->install()/->inum() work with &mnt_ns->ns
    common object embedded into various struct ....ns

    Linus Torvalds
     

14 Dec, 2014

4 commits

  • Andrew Morton noted

    http://lkml.kernel.org/r/20141104142027.a7a0d010772d84560b445f59@linux-foundation.org

    that the shmdt uses inode->i_size outside of i_mutex being held.
    There is one more case in shm.c in shm_destroy(). This converts
    both users over to use i_size_read().

    Signed-off-by: Dave Hansen
    Cc: Manfred Spraul
    Cc: Davidlohr Bueso
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dave Hansen
     
  • This is a highly-contrived scenario. But, a single shmdt() call can be
    induced in to unmapping memory from mulitple shm segments. Example code
    is here:

    http://www.sr71.net/~dave/intel/shmfun.c

    The fix is pretty simple: Record the 'struct file' for the first VMA we
    encounter and then stick to it. Decline to unmap anything not from the
    same file and thus the same segment.

    I found this by inspection and the odds of anyone hitting this in practice
    are pretty darn small.

    Lightly tested, but it's a pretty small patch.

    Signed-off-by: Dave Hansen
    Cc: Manfred Spraul
    Reviewed-by: Davidlohr Bueso
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dave Hansen
     
  • SysV can be abused to allocate locked kernel memory. For most systems, a
    small limit doesn't make sense, see the discussion with regards to SHMMAX.

    Therefore: increase MSGMNI to the maximum supported.

    And: If we ignore the risk of locking too much memory, then an automatic
    scaling of MSGMNI doesn't make sense. Therefore the logic can be removed.

    The code preserves auto_msgmni to avoid breaking any user space applications
    that expect that the value exists.

    Notes:
    1) If an administrator must limit the memory allocations, then he can set
    MSGMNI as necessary.

    Or he can disable sysv entirely (as e.g. done by Android).

    2) MSGMAX and MSGMNB are intentionally not increased, as these values are used
    to control latency vs. throughput:
    If MSGMNB is large, then msgsnd() just returns and more messages can be queued
    before a task switch to a task that calls msgrcv() is forced.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Manfred Spraul
    Cc: Davidlohr Bueso
    Cc: Rafael Aquini
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Manfred Spraul
     
  • When I fixed bugs in the sem_lock() logic, I was more conservative than
    necessary. Therefore it is safe to replace the smp_mb() with smp_rmb().
    And: With smp_rmb(), semop() syscalls are up to 10% faster.

    The race we must protect against is:

    sem->lock is free
    sma->complex_count = 0
    sma->sem_perm.lock held by thread B

    thread A:

    A: spin_lock(&sem->lock)

    B: sma->complex_count++; (now 1)
    B: spin_unlock(&sma->sem_perm.lock);

    A: spin_is_locked(&sma->sem_perm.lock);
    A: XXXXX memory barrier
    A: if (sma->complex_count == 0)

    Thread A must read the increased complex_count value, i.e. the read must
    not be reordered with the read of sem_perm.lock done by spin_is_locked().

    Since it's about ordering of reads, smp_rmb() is sufficient.

    [akpm@linux-foundation.org: update sem_lock() comment, from Davidlohr]
    Signed-off-by: Manfred Spraul
    Reviewed-by: Davidlohr Bueso
    Acked-by: Rafael Aquini
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Manfred Spraul
     

11 Dec, 2014

2 commits

  • Al Viro
     
  • Pull VFS changes from Al Viro:
    "First pile out of several (there _definitely_ will be more). Stuff in
    this one:

    - unification of d_splice_alias()/d_materialize_unique()

    - iov_iter rewrite

    - killing a bunch of ->f_path.dentry users (and f_dentry macro).

    Getting that completed will make life much simpler for
    unionmount/overlayfs, since then we'll be able to limit the places
    sensitive to file _dentry_ to reasonably few. Which allows to have
    file_inode(file) pointing to inode in a covered layer, with dentry
    pointing to (negative) dentry in union one.

    Still not complete, but much closer now.

    - crapectomy in lustre (dead code removal, mostly)

    - "let's make seq_printf return nothing" preparations

    - assorted cleanups and fixes

    There _definitely_ will be more piles"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (63 commits)
    copy_from_iter_nocache()
    new helper: iov_iter_kvec()
    csum_and_copy_..._iter()
    iov_iter.c: handle ITER_KVEC directly
    iov_iter.c: convert copy_to_iter() to iterate_and_advance
    iov_iter.c: convert copy_from_iter() to iterate_and_advance
    iov_iter.c: get rid of bvec_copy_page_{to,from}_iter()
    iov_iter.c: convert iov_iter_zero() to iterate_and_advance
    iov_iter.c: convert iov_iter_get_pages_alloc() to iterate_all_kinds
    iov_iter.c: convert iov_iter_get_pages() to iterate_all_kinds
    iov_iter.c: convert iov_iter_npages() to iterate_all_kinds
    iov_iter.c: iterate_and_advance
    iov_iter.c: macros for iterating over iov_iter
    kill f_dentry macro
    dcache: fix kmemcheck warning in switch_names
    new helper: audit_file()
    nfsd_vfs_write(): use file_inode()
    ncpfs: use file_inode()
    kill f_dentry uses
    lockd: get rid of ->f_path.dentry->d_sb
    ...

    Linus Torvalds
     

05 Dec, 2014

5 commits


04 Dec, 2014

1 commit

  • ipc_addid() makes a new ipc identifier visible to everyone. New objects
    start as locked, so that the caller can complete the initialization
    after the call. Within struct sem_array, at least sma->sem_base and
    sma->sem_nsems are accessed without any locks, therefore this approach
    doesn't work.

    Thus: Move the ipc_addid() to the end of the initialization.

    Signed-off-by: Manfred Spraul
    Reported-by: Rik van Riel
    Acked-by: Rik van Riel
    Acked-by: Davidlohr Bueso
    Acked-by: Rafael Aquini
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Manfred Spraul
     

20 Nov, 2014

1 commit

  • ... for situations when we don't have any candidate in pathnames - basically,
    in descriptor-based syscalls.

    [Folded the build fix for !CONFIG_AUDITSYSCALL configs from Chen Gang]

    Signed-off-by: Al Viro

    Al Viro
     

14 Oct, 2014

4 commits

  • Resolve some shadow warnings produced in W=2 builds by changing the name
    of some parameters and local variables. Change instances of "s64"
    because that clashes with the well-known typedef. Also change a local
    variable with the name "up" because that clashes with the name of of the
    "up" function for semaphores. These are hazards so eliminate the
    hazards by renaming them.

    Signed-off-by: Mark Rustad
    Signed-off-by: Jeff Kirsher
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mark Rustad
     
  • Using __seq_open_private() removes boilerplate code from
    sysvipc_proc_open().

    The resultant code is shorter and easier to follow.

    However, please note that __seq_open_private() call kzalloc() rather than
    kmalloc() which may affect timing due to the memory initialisation
    overhead.

    Signed-off-by: Rob Jones
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rob Jones
     
  • do_shmat() is the only user of ->start_stack (proc just reports its
    value), and this check looks ugly and wrong.

    The reason for this check is not clear at all, and it wrongly assumes that
    the stack can only grow down.

    But the main problem is that in general mm->start_stack has nothing to do
    with stack_vma->vm_start. Not only the application can switch to another
    stack and even unmap this area, setup_arg_pages() expands the stack
    without updating mm->start_stack during exec(). This means that in the
    likely case "addr > start_stack - size - PAGE_SIZE * 5" is simply
    impossible after find_vma_intersection() == F, or the stack can't grow
    anyway because of RLIMIT_STACK.

    Many thanks to Hugh for his explanations.

    Signed-off-by: Oleg Nesterov
    Acked-by: Hugh Dickins
    Cc: Cyrill Gorcunov
    Cc: Davidlohr Bueso
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • proc_dointvec_minmax() returns zero if a new value has been set. So we
    don't need to check all charecters have been handled.

    Below you can find two examples. In the new value has not been handled
    properly.

    $ strace ./a.out
    open("/proc/sys/kernel/auto_msgmni", O_WRONLY) = 3
    write(3, "0\n\0", 3) = 2
    close(3) = 0
    exit_group(0)
    $ cat /sys/kernel/debug/tracing/trace

    $strace ./a.out
    open("/proc/sys/kernel/auto_msgmni", O_WRONLY) = 3
    write(3, "0\n", 2) = 2
    close(3) = 0

    $ cat /sys/kernel/debug/tracing/trace
    a.out-697 [000] .... 3280.998235: unregister_ipcns_notifier
    Cc: Mathias Krause
    Cc: Manfred Spraul
    Cc: Joe Perches
    Cc: Davidlohr Bueso
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Vagin
     

08 Oct, 2014

1 commit

  • Pull "trivial tree" updates from Jiri Kosina:
    "Usual pile from trivial tree everyone is so eagerly waiting for"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (39 commits)
    Remove MN10300_PROC_MN2WS0038
    mei: fix comments
    treewide: Fix typos in Kconfig
    kprobes: update jprobe_example.c for do_fork() change
    Documentation: change "&" to "and" in Documentation/applying-patches.txt
    Documentation: remove obsolete pcmcia-cs from Changes
    Documentation: update links in Changes
    Documentation: Docbook: Fix generated DocBook/kernel-api.xml
    score: Remove GENERIC_HAS_IOMAP
    gpio: fix 'CONFIG_GPIO_IRQCHIP' comments
    tty: doc: Fix grammar in serial/tty
    dma-debug: modify check_for_stack output
    treewide: fix errors in printk
    genirq: fix reference in devm_request_threaded_irq comment
    treewide: fix synchronize_rcu() in comments
    checkstack.pl: port to AArch64
    doc: queue-sysfs: minor fixes
    init/do_mounts: better syntax description
    MIPS: fix comment spelling
    powerpc/simpleboot: fix comment
    ...

    Linus Torvalds
     

09 Sep, 2014

1 commit