12 Oct, 2016

25 commits

  • This is a patch that provides behavior that is more consistent, and
    probably less surprising to users. I consider the change optional, and
    welcome opinions about whether it should be applied.

    By default, pipes are created with a capacity of 64 kiB. However,
    /proc/sys/fs/pipe-max-size may be set smaller than this value. In this
    scenario, an unprivileged user could thus create a pipe whose initial
    capacity exceeds the limit. Therefore, it seems logical to cap the
    initial pipe capacity according to the value of pipe-max-size.

    The test program shown earlier in this patch series can be used to
    demonstrate the effect of the change brought about with this patch:

    # cat /proc/sys/fs/pipe-max-size
    1048576
    # sudo -u mtk ./test_F_SETPIPE_SZ 1
    Initial pipe capacity: 65536
    # echo 10000 > /proc/sys/fs/pipe-max-size
    # cat /proc/sys/fs/pipe-max-size
    16384
    # sudo -u mtk ./test_F_SETPIPE_SZ 1
    Initial pipe capacity: 16384
    # ./test_F_SETPIPE_SZ 1
    Initial pipe capacity: 65536

    The last two executions of 'test_F_SETPIPE_SZ' show that pipe-max-size
    caps the initial allocation for a new pipe for unprivileged users, but
    not for privileged users.

    Link: http://lkml.kernel.org/r/31dc7064-2a17-9c5b-1df1-4e3012ee992c@gmail.com
    Signed-off-by: Michael Kerrisk
    Reviewed-by: Vegard Nossum
    Cc: Willy Tarreau
    Cc:
    Cc: Tetsuo Handa
    Cc: Jens Axboe
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michael Kerrisk (man-pages)
     
  • This is an optional patch, to provide a small performance
    improvement. Alter account_pipe_buffers() so that it returns the
    new value in user->pipe_bufs. This means that we can refactor
    too_many_pipe_buffers_soft() and too_many_pipe_buffers_hard() to
    avoid the costs of repeated use of atomic_long_read() to get the
    value user->pipe_bufs.

    Link: http://lkml.kernel.org/r/93e5f193-1e5e-3e1f-3a20-eae79b7e1310@gmail.com
    Signed-off-by: Michael Kerrisk
    Reviewed-by: Vegard Nossum
    Cc: Willy Tarreau
    Cc:
    Cc: Tetsuo Handa
    Cc: Jens Axboe
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michael Kerrisk (man-pages)
     
  • The limit checking in alloc_pipe_info() (used by pipe(2) and when
    opening a FIFO) has the following problems:

    (1) When checking capacity required for the new pipe, the checks against
    the limit in /proc/sys/fs/pipe-user-pages-{soft,hard} are made
    against existing consumption, and exclude the memory required for
    the new pipe capacity. As a consequence: (1) the memory allocation
    throttling provided by the soft limit does not kick in quite as
    early as it should, and (2) the user can overrun the hard limit.

    (2) As currently implemented, accounting and checking against the limits
    is done as follows:

    (a) Test whether the user has exceeded the limit.
    (b) Make new pipe buffer allocation.
    (c) Account new allocation against the limits.

    This is racey. Multiple processes may pass point (a) simultaneously,
    and then allocate pipe buffers that are accounted for only in step
    (c). The race means that the user's pipe buffer allocation could be
    pushed over the limit (by an arbitrary amount, depending on how
    unlucky we were in the race). [Thanks to Vegard Nossum for spotting
    this point, which I had missed.]

    This patch addresses the above problems as follows:

    * Alter the checks against limits to include the memory required for the
    new pipe.
    * Re-order the accounting step so that it precedes the buffer allocation.
    If the accounting step determines that a limit has been reached, revert
    the accounting and cause the operation to fail.

    Link: http://lkml.kernel.org/r/8ff3e9f9-23f6-510c-644f-8e70cd1c0bd9@gmail.com
    Signed-off-by: Michael Kerrisk
    Reviewed-by: Vegard Nossum
    Cc: Willy Tarreau
    Cc:
    Cc: Tetsuo Handa
    Cc: Jens Axboe
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michael Kerrisk (man-pages)
     
  • Replace an 'if' block that covers most of the code in this function
    with a 'goto'. This makes the code a little simpler to read, and also
    simplifies the next patch (fix limit checking in alloc_pipe_info())

    Link: http://lkml.kernel.org/r/aef030c1-0257-98a9-4988-186efa48530c@gmail.com
    Signed-off-by: Michael Kerrisk
    Reviewed-by: Vegard Nossum
    Cc: Willy Tarreau
    Cc:
    Cc: Tetsuo Handa
    Cc: Jens Axboe
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michael Kerrisk (man-pages)
     
  • The limit checking in pipe_set_size() (used by fcntl(F_SETPIPE_SZ))
    has the following problems:

    (1) When increasing the pipe capacity, the checks against the limits in
    /proc/sys/fs/pipe-user-pages-{soft,hard} are made against existing
    consumption, and exclude the memory required for the increased pipe
    capacity. The new increase in pipe capacity can then push the total
    memory used by the user for pipes (possibly far) over a limit. This
    can also trigger the problem described next.

    (2) The limit checks are performed even when the new pipe capacity is
    less than the existing pipe capacity. This can lead to problems if a
    user sets a large pipe capacity, and then the limits are lowered,
    with the result that the user will no longer be able to decrease the
    pipe capacity.

    (3) As currently implemented, accounting and checking against the
    limits is done as follows:

    (a) Test whether the user has exceeded the limit.
    (b) Make new pipe buffer allocation.
    (c) Account new allocation against the limits.

    This is racey. Multiple processes may pass point (a)
    simultaneously, and then allocate pipe buffers that are accounted
    for only in step (c). The race means that the user's pipe buffer
    allocation could be pushed over the limit (by an arbitrary amount,
    depending on how unlucky we were in the race). [Thanks to Vegard
    Nossum for spotting this point, which I had missed.]

    This patch addresses the above problems as follows:

    * Perform checks against the limits only when increasing a pipe's
    capacity; an unprivileged user can always decrease a pipe's capacity.
    * Alter the checks against limits to include the memory required for
    the new pipe capacity.
    * Re-order the accounting step so that it precedes the buffer
    allocation. If the accounting step determines that a limit has
    been reached, revert the accounting and cause the operation to fail.

    The program below can be used to demonstrate problems 1 and 2, and the
    effect of the fix. The program takes one or more command-line arguments.
    The first argument specifies the number of pipes that the program should
    create. The remaining arguments are, alternately, pipe capacities that
    should be set using fcntl(F_SETPIPE_SZ), and sleep intervals (in
    seconds) between the fcntl() operations. (The sleep intervals allow the
    possibility to change the limits between fcntl() operations.)

    Problem 1
    =========

    Using the test program on an unpatched kernel, we first set some
    limits:

    # echo 0 > /proc/sys/fs/pipe-user-pages-soft
    # echo 1000000000 > /proc/sys/fs/pipe-max-size
    # echo 10000 > /proc/sys/fs/pipe-user-pages-hard # 40.96 MB

    Then show that we can set a pipe with capacity (100MB) that is
    over the hard limit

    # sudo -u mtk ./test_F_SETPIPE_SZ 1 100000000
    Initial pipe capacity: 65536
    Loop 1: set pipe capacity to 100000000 bytes
    F_SETPIPE_SZ returned 134217728

    Now set the capacity to 100MB twice. The second call fails (which is
    probably surprising to most users, since it seems like a no-op):

    # sudo -u mtk ./test_F_SETPIPE_SZ 1 100000000 0 100000000
    Initial pipe capacity: 65536
    Loop 1: set pipe capacity to 100000000 bytes
    F_SETPIPE_SZ returned 134217728
    Loop 2: set pipe capacity to 100000000 bytes
    Loop 2, pipe 0: F_SETPIPE_SZ failed: fcntl: Operation not permitted

    With a patched kernel, setting a capacity over the limit fails at the
    first attempt:

    # echo 0 > /proc/sys/fs/pipe-user-pages-soft
    # echo 1000000000 > /proc/sys/fs/pipe-max-size
    # echo 10000 > /proc/sys/fs/pipe-user-pages-hard
    # sudo -u mtk ./test_F_SETPIPE_SZ 1 100000000
    Initial pipe capacity: 65536
    Loop 1: set pipe capacity to 100000000 bytes
    Loop 1, pipe 0: F_SETPIPE_SZ failed: fcntl: Operation not permitted

    There is a small chance that the change to fix this problem could
    break user-space, since there are cases where fcntl(F_SETPIPE_SZ)
    calls that previously succeeded might fail. However, the chances are
    small, since (a) the pipe-user-pages-{soft,hard} limits are new (in
    4.5), and the default soft/hard limits are high/unlimited. Therefore,
    it seems warranted to make these limits operate more precisely (and
    behave more like what users probably expect).

    Problem 2
    =========

    Running the test program on an unpatched kernel, we first set some limits:

    # getconf PAGESIZE
    4096
    # echo 0 > /proc/sys/fs/pipe-user-pages-soft
    # echo 1000000000 > /proc/sys/fs/pipe-max-size
    # echo 10000 > /proc/sys/fs/pipe-user-pages-hard # 40.96 MB

    Now perform two fcntl(F_SETPIPE_SZ) operations on a single pipe,
    first setting a pipe capacity (10MB), sleeping for a few seconds,
    during which time the hard limit is lowered, and then set pipe
    capacity to a smaller amount (5MB):

    # sudo -u mtk ./test_F_SETPIPE_SZ 1 10000000 15 5000000 &
    [1] 748
    # Initial pipe capacity: 65536
    Loop 1: set pipe capacity to 10000000 bytes
    F_SETPIPE_SZ returned 16777216
    Sleeping 15 seconds

    # echo 1000 > /proc/sys/fs/pipe-user-pages-hard # 4.096 MB
    # Loop 2: set pipe capacity to 5000000 bytes
    Loop 2, pipe 0: F_SETPIPE_SZ failed: fcntl: Operation not permitted

    In this case, the user should be able to lower the limit.

    With a kernel that has the patch below, the second fcntl()
    succeeds:

    # echo 0 > /proc/sys/fs/pipe-user-pages-soft
    # echo 1000000000 > /proc/sys/fs/pipe-max-size
    # echo 10000 > /proc/sys/fs/pipe-user-pages-hard
    # sudo -u mtk ./test_F_SETPIPE_SZ 1 10000000 15 5000000 &
    [1] 3215
    # Initial pipe capacity: 65536
    # Loop 1: set pipe capacity to 10000000 bytes
    F_SETPIPE_SZ returned 16777216
    Sleeping 15 seconds

    # echo 1000 > /proc/sys/fs/pipe-user-pages-hard

    # Loop 2: set pipe capacity to 5000000 bytes
    F_SETPIPE_SZ returned 8388608

    8x---8x---8x---8x---8x---8x---8x---8x---8x---8x---8x---8x---8x---8x---

    /* test_F_SETPIPE_SZ.c

    (C) 2016, Michael Kerrisk; licensed under GNU GPL version 2 or later

    Test operation of fcntl(F_SETPIPE_SZ) for setting pipe capacity
    and interactions with limits defined by /proc/sys/fs/pipe-* files.
    */

    #define _GNU_SOURCE
    #include
    #include
    #include
    #include

    int
    main(int argc, char *argv[])
    {
    int (*pfd)[2];
    int npipes;
    int pcap, rcap;
    int j, p, s, stime, loop;

    if (argc < 2) {
    fprintf(stderr, "Usage: %s num-pipes "
    "[pipe-capacity sleep-time]...\n", argv[0]);
    exit(EXIT_FAILURE);
    }

    npipes = atoi(argv[1]);

    pfd = calloc(npipes, sizeof (int [2]));
    if (pfd == NULL) {
    perror("calloc");
    exit(EXIT_FAILURE);
    }

    for (j = 0; j < npipes; j++) {
    if (pipe(pfd[j]) == -1) {
    fprintf(stderr, "Loop %d: pipe() failed: ", j);
    perror("pipe");
    exit(EXIT_FAILURE);
    }
    }

    printf("Initial pipe capacity: %d\n", fcntl(pfd[0][0], F_GETPIPE_SZ));

    for (j = 2; j < argc; j += 2 ) {
    loop = j / 2;
    pcap = atoi(argv[j]);
    printf(" Loop %d: set pipe capacity to %d bytes\n", loop, pcap);

    for (p = 0; p < npipes; p++) {
    s = fcntl(pfd[p][0], F_SETPIPE_SZ, pcap);
    if (s == -1) {
    fprintf(stderr, " Loop %d, pipe %d: F_SETPIPE_SZ "
    "failed: ", loop, p);
    perror("fcntl");
    exit(EXIT_FAILURE);
    }

    if (p == 0) {
    printf(" F_SETPIPE_SZ returned %d\n", s);
    rcap = s;
    } else {
    if (s != rcap) {
    fprintf(stderr, " Loop %d, pipe %d: F_SETPIPE_SZ "
    "unexpected return: %d\n", loop, p, s);
    exit(EXIT_FAILURE);
    }
    }

    stime = (j + 1 < argc) ? atoi(argv[j + 1]) : 0;
    if (stime > 0) {
    printf(" Sleeping %d seconds\n", stime);
    sleep(stime);
    }
    }
    }

    exit(EXIT_SUCCESS);
    }

    8x---8x---8x---8x---8x---8x---8x---8x---8x---8x---8x---8x---8x---8x---

    Patch history:

    v2
    * Switch order of test in 'if' statement to avoid function call
    (to capability()) in normal path. [This is a fix to a preexisting
    wart in the code. Thanks to Willy Tarreau]
    * Perform (size > pipe_max_size) check before calling
    account_pipe_buffers(). [Thanks to Vegard Nossum]
    Quoting Vegard:

    The potential problem happens if the user passes a very large number
    which will overflow pipe->user->pipe_bufs.

    On 32-bit, sizeof(int) == sizeof(long), so if they pass arg = INT_MAX
    then round_pipe_size() returns INT_MAX. Although it's true that the
    accounting is done in terms of pages and not bytes, so you'd need on
    the order of (1 << 13) = 8192 processes hitting the limit at the same
    time in order to make it overflow, which seems a bit unlikely.

    (See https://lkml.org/lkml/2016/8/12/215 for another discussion on the
    limit checking)

    Link: http://lkml.kernel.org/r/1e464945-536b-2420-798b-e77b9c7e8593@gmail.com
    Signed-off-by: Michael Kerrisk
    Reviewed-by: Vegard Nossum
    Cc: Willy Tarreau
    Cc:
    Cc: Tetsuo Handa
    Cc: Jens Axboe
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michael Kerrisk (man-pages)
     
  • This is a preparatory patch for following work. account_pipe_buffers()
    performs accounting in the 'user_struct'. There is no need to pass a
    pointer to a 'pipe_inode_info' struct (which is then dereferenced to
    obtain a pointer to the 'user' field). Instead, pass a pointer directly
    to the 'user_struct'. This change is needed in preparation for a
    subsequent patch that the fixes the limit checking in alloc_pipe_info()
    (and the resulting code is a little more logical).

    Link: http://lkml.kernel.org/r/7277bf8c-a6fc-4a7d-659c-f5b145c981ab@gmail.com
    Signed-off-by: Michael Kerrisk
    Reviewed-by: Vegard Nossum
    Cc: Willy Tarreau
    Cc:
    Cc: Tetsuo Handa
    Cc: Jens Axboe
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michael Kerrisk (man-pages)
     
  • This is a preparatory patch for following work. Move the F_SETPIPE_SZ
    limit-checking logic from pipe_fcntl() into pipe_set_size(). This
    simplifies the code a little, and allows for reworking required in
    a later patch that fixes the limit checking in pipe_set_size()

    Link: http://lkml.kernel.org/r/3701b2c5-2c52-2c3e-226d-29b9deb29b50@gmail.com
    Signed-off-by: Michael Kerrisk
    Reviewed-by: Vegard Nossum
    Cc: Willy Tarreau
    Cc:
    Cc: Tetsuo Handa
    Cc: Jens Axboe
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michael Kerrisk (man-pages)
     
  • Patch series "pipe: fix limit handling", v2.

    When changing a pipe's capacity with fcntl(F_SETPIPE_SZ), various limits
    defined by /proc/sys/fs/pipe-* files are checked to see if unprivileged
    users are exceeding limits on memory consumption.

    While documenting and testing the operation of these limits I noticed
    that, as currently implemented, these checks have a number of problems:

    (1) When increasing the pipe capacity, the checks against the limits
    in /proc/sys/fs/pipe-user-pages-{soft,hard} are made against
    existing consumption, and exclude the memory required for the
    increased pipe capacity. The new increase in pipe capacity can then
    push the total memory used by the user for pipes (possibly far) over
    a limit. This can also trigger the problem described next.

    (2) The limit checks are performed even when the new pipe capacity
    is less than the existing pipe capacity. This can lead to problems
    if a user sets a large pipe capacity, and then the limits are
    lowered, with the result that the user will no longer be able to
    decrease the pipe capacity.

    (3) As currently implemented, accounting and checking against the
    limits is done as follows:

    (a) Test whether the user has exceeded the limit.
    (b) Make new pipe buffer allocation.
    (c) Account new allocation against the limits.

    This is racey. Multiple processes may pass point (a) simultaneously,
    and then allocate pipe buffers that are accounted for only in step
    (c). The race means that the user's pipe buffer allocation could be
    pushed over the limit (by an arbitrary amount, depending on how
    unlucky we were in the race). [Thanks to Vegard Nossum for spotting
    this point, which I had missed.]

    This patch series addresses these three problems.

    This patch (of 8):

    This is a minor preparatory patch. After subsequent patches,
    round_pipe_size() will be called from pipe_set_size(), so place
    round_pipe_size() above pipe_set_size().

    Link: http://lkml.kernel.org/r/91a91fdb-a959-ba7f-b551-b62477cc98a1@gmail.com
    Signed-off-by: Michael Kerrisk
    Reviewed-by: Vegard Nossum
    Cc: Willy Tarreau
    Cc:
    Cc: Tetsuo Handa
    Cc: Jens Axboe
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michael Kerrisk (man-pages)
     
  • cmd part of this struct is the same as an index of itself within
    _ioctls[]. In fact this cmd is unused, so we can drop this part.

    Link: http://lkml.kernel.org/r/20160831033414.9910.66697.stgit@pluto.themaw.net
    Signed-off-by: Tomohiro Kusumi
    Signed-off-by: Ian Kent
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tomohiro Kusumi
     
  • Having this in autofs_i.h gives illusion that uncommenting this enables
    pr_debug(), but it doesn't enable all the pr_debug() in autofs because
    inclusion order matters.

    XFS has the same DEBUG macro in its core header fs/xfs/xfs.h, however XFS
    seems to have a rule to include this prior to other XFS headers as well as
    kernel headers. This is not the case with autofs, and DEBUG could be
    enabled via Makefile, so autofs should just get rid of this comment to
    make the code less confusing. It's a comment, so there is literally no
    functional difference.

    Link: http://lkml.kernel.org/r/20160831033409.9910.77067.stgit@pluto.themaw.net
    Signed-off-by: Tomohiro Kusumi
    Signed-off-by: Ian Kent
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tomohiro Kusumi
     
  • All other warnings use "cmd(0x%08x)" and this is the only one with
    "cmd(%d)". (below comes from my userspace debug program, but not
    automount daemon)

    [ 1139.905676] autofs4:pid:1640:check_dev_ioctl_version: ioctl control interface version mismatch: kernel(1.0), user(0.0), cmd(-1072131215)

    Link: http://lkml.kernel.org/r/20160812024851.12352.75458.stgit@pluto.themaw.net
    Signed-off-by: Tomohiro Kusumi
    Signed-off-by: Ian Kent
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tomohiro Kusumi
     
  • No functional changes, based on the following justification.

    1. Make the code more consistent using the ioctl vector _ioctls[],
    rather than assigning NULL only for this ioctl command.
    2. Remove goto done; for better maintainability in the long run.
    3. The existing code is based on the fact that validate_dev_ioctl()
    sets ioctl version for any command, but AUTOFS_DEV_IOCTL_VERSION_CMD
    should explicitly set it regardless of the default behavior.

    Link: http://lkml.kernel.org/r/20160812024846.12352.9885.stgit@pluto.themaw.net
    Signed-off-by: Tomohiro Kusumi
    Signed-off-by: Ian Kent
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ian Kent
     
  • The count of miscellaneous device ioctls in fs/autofs4/autofs_i.h is wrong.

    The number of ioctls is the difference between AUTOFS_DEV_IOCTL_VERSION_CMD
    and AUTOFS_DEV_IOCTL_ISMOUNTPOINT_CMD (14) not the difference between
    AUTOFS_IOC_COUNT and 11 (21).

    [kusumi.tomohiro@gmail.com: fix typo that made the count macro negative]
    Link: http://lkml.kernel.org/r/20160831033420.9910.16809.stgit@pluto.themaw.net
    Link: http://lkml.kernel.org/r/20160812024841.12352.11975.stgit@pluto.themaw.net
    Signed-off-by: Ian Kent
    Cc: Tomohiro Kusumi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ian Kent
     
  • This isn't a return value, so change the message to indicate the status is
    the result of may_umount().

    (or locate pr_debug() after put_user() with the same message)

    Link: http://lkml.kernel.org/r/20160812024836.12352.74628.stgit@pluto.themaw.net
    Signed-off-by: Tomohiro Kusumi
    Signed-off-by: Ian Kent
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tomohiro Kusumi
     
  • Returning -ENOTTY here fails to free dynamically allocated param.

    Link: http://lkml.kernel.org/r/20160812024815.12352.69153.stgit@pluto.themaw.net
    Signed-off-by: Tomohiro Kusumi
    Signed-off-by: Ian Kent
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tomohiro Kusumi
     
  • These two were left from commit aa55ddf340c9 ("autofs4: remove unused
    ioctls") which removed unused ioctls.

    Link: http://lkml.kernel.org/r/20160812024810.12352.96377.stgit@pluto.themaw.net
    Signed-off-by: Tomohiro Kusumi
    Signed-off-by: Ian Kent
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tomohiro Kusumi
     
  • kfree dentry data allocated by autofs4_new_ino() with autofs4_free_ino()
    instead of raw kfree. (since we have the interface to free autofs_info*)

    This patch was modified to remove the need to set the dentry info field to
    NULL dew to a change in the previous patch.

    Link: http://lkml.kernel.org/r/20160812024805.12352.43650.stgit@pluto.themaw.net
    Signed-off-by: Tomohiro Kusumi
    Signed-off-by: Ian Kent
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tomohiro Kusumi
     
  • The inode allocation failure case in autofs4_dir_symlink() frees the
    autofs dentry info of the dentry without setting ->d_fsdata to NULL.

    That could lead to a double free so just get rid of the free and leave it
    to ->d_release().

    Link: http://lkml.kernel.org/r/20160812024759.12352.10653.stgit@pluto.themaw.net
    Signed-off-by: Ian Kent
    Cc: Tomohiro Kusumi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ian Kent
     
  • It's invalid if the given mode is neither dir nor link, so warn on else
    case.

    Link: http://lkml.kernel.org/r/20160812024754.12352.8536.stgit@pluto.themaw.net
    Signed-off-by: Tomohiro Kusumi
    Signed-off-by: Ian Kent
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tomohiro Kusumi
     
  • Somewhere along the line the error handling gotos have become incorrect.

    Link: http://lkml.kernel.org/r/20160812024749.12352.15100.stgit@pluto.themaw.net
    Signed-off-by: Ian Kent
    Cc: Tomohiro Kusumi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ian Kent
     
  • This patch does what the below comment says. It could be and it's
    considered better to do this first before various functions get called
    during initialization.

    /* Couldn't this be tested earlier? */

    Link: http://lkml.kernel.org/r/20160812024744.12352.43075.stgit@pluto.themaw.net
    Signed-off-by: Tomohiro Kusumi
    Signed-off-by: Ian Kent
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tomohiro Kusumi
     
  • autofs4_kill_sb() doesn't need to be declared as extern, and no other
    functions in .h are explicitly declared as extern.

    Link: http://lkml.kernel.org/r/20160812024739.12352.99354.stgit@pluto.themaw.net
    Signed-off-by: Tomohiro Kusumi
    Signed-off-by: Ian Kent
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tomohiro Kusumi
     
  • The select(2) syscall performs a kmalloc(size, GFP_KERNEL) where size grows
    with the number of fds passed. We had a customer report page allocation
    failures of order-4 for this allocation. This is a costly order, so it might
    easily fail, as the VM expects such allocation to have a lower-order fallback.

    Such trivial fallback is vmalloc(), as the memory doesn't have to be physically
    contiguous and the allocation is temporary for the duration of the syscall
    only. There were some concerns, whether this would have negative impact on the
    system by exposing vmalloc() to userspace. Although an excessive use of vmalloc
    can cause some system wide performance issues - TLB flushes etc. - a large
    order allocation is not for free either and an excessive reclaim/compaction can
    have a similar effect. Also note that the size is effectively limited by
    RLIMIT_NOFILE which defaults to 1024 on the systems I checked. That means the
    bitmaps will fit well within single page and thus the vmalloc() fallback could
    be only excercised for processes where root allows a higher limit.

    Note that the poll(2) syscall seems to use a linked list of order-0 pages, so
    it doesn't need this kind of fallback.

    [eric.dumazet@gmail.com: fix failure path logic]
    [akpm@linux-foundation.org: use proper type for size]
    Link: http://lkml.kernel.org/r/20160927084536.5923-1-vbabka@suse.cz
    Signed-off-by: Vlastimil Babka
    Acked-by: Michal Hocko
    Cc: Alexander Viro
    Cc: Eric Dumazet
    Cc: David Laight
    Cc: Hillf Danton
    Cc: Nicholas Piggin
    Cc: Jason Baron
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • After much discussion, it seems that the fallocate feature flag
    FALLOC_FL_ZERO_RANGE maps nicely to SCSI WRITE SAME; and the feature
    FALLOC_FL_PUNCH_HOLE maps nicely to the devices that have been whitelisted
    for zeroing SCSI UNMAP. Punch still requires that FALLOC_FL_KEEP_SIZE is
    set. A length that goes past the end of the device will be clamped to the
    device size if KEEP_SIZE is set; or will return -EINVAL if not. Both
    start and length must be aligned to the device's logical block size.

    Since the semantics of fallocate are fairly well established already, wire
    up the two pieces. The other fallocate variants (collapse range, insert
    range, and allocate blocks) are not supported.

    Link: http://lkml.kernel.org/r/147518379992.22791.8849838163218235007.stgit@birch.djwong.org
    Signed-off-by: Darrick J. Wong
    Reviewed-by: Hannes Reinecke
    Reviewed-by: Bart Van Assche
    Cc: Theodore Ts'o
    Cc: Martin K. Petersen
    Cc: Mike Snitzer # tweaked header
    Cc: Brian Foster
    Cc: Christoph Hellwig
    Cc: Hannes Reinecke
    Cc: Jens Axboe
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Darrick J. Wong
     
  • In the dlm_migrate_request_handler(), when `ret' is -EEXIST, the mle
    should be freed, otherwise the memory will be leaked.

    Link: http://lkml.kernel.org/r/71604351584F6A4EBAE558C676F37CA4A3D3522A@H3CMLB12-EX.srv.huawei-3com.com
    Signed-off-by: Guozhonghua
    Reviewed-by: Mark Fasheh
    Cc: Eric Ren
    Cc: Joel Becker
    Cc: Junxiao Bi
    Cc: Joseph Qi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Guozhonghua
     

11 Oct, 2016

10 commits

  • Pull networking fixes from David Miller:

    1) Netfilter list handling fix, from Linus.

    2) RXRPC/AFS bug fixes from David Howells (oops on call to serviceless
    endpoints, build warnings, missing notifications, etc.) From David
    Howells.

    3) Kernel log message missing newlines, from Colin Ian King.

    4) Don't enter direct reclaim in netlink dumps, the idea is to use a
    high order allocation first and fallback quickly to a 0-order
    allocation if such a high-order one cannot be done cheaply and
    without reclaim. From Eric Dumazet.

    5) Fix firmware download errors in btusb bluetooth driver, from Ethan
    Hsieh.

    6) Missing Kconfig deps for QCOM_EMAC, from Geert Uytterhoeven.

    7) Fix MDIO_XGENE dup Kconfig entry. From Laura Abbott.

    8) Constrain ipv6 rtr_solicits sysctl values properly, from Maciej
    Żenczykowski.

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (65 commits)
    netfilter: Fix slab corruption.
    be2net: Enable VF link state setting for BE3
    be2net: Fix TX stats for TSO packets
    be2net: Update Copyright string in be_hw.h
    be2net: NCSI FW section should be properly updated with ethtool for BE3
    be2net: Provide an alternate way to read pf_num for BEx chips
    wan/fsl_ucc_hdlc: Fix size used in dma_free_coherent()
    net: macb: NULL out phydev after removing mdio bus
    xen-netback: make sure that hashes are not send to unaware frontends
    Fixing a bug in team driver due to incorrect 'unsigned int' to 'int' conversion
    MAINTAINERS: add myself as a maintainer of xen-netback
    ipv6 addrconf: disallow rtr_solicits < -1
    Bluetooth: btusb: Fix atheros firmware download error
    drivers: net: phy: Correct duplicate MDIO_XGENE entry
    ethernet: qualcomm: QCOM_EMAC should depend on HAS_DMA and HAS_IOMEM
    net: ethernet: mediatek: remove hwlro property in the device tree
    net: ethernet: mediatek: get hw lro capability by the chip id instead of by the dtsi
    net: ethernet: mediatek: get the chip id by ETHDMASYS registers
    net: bgmac: Fix errant feature flag check
    netlink: do not enter direct reclaim from netlink_dump()
    ...

    Linus Torvalds
     
  • Pull more vfs updates from Al Viro:
    ">rename2() work from Miklos + current_time() from Deepa"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    fs: Replace current_fs_time() with current_time()
    fs: Replace CURRENT_TIME_SEC with current_time() for inode timestamps
    fs: Replace CURRENT_TIME with current_time() for inode timestamps
    fs: proc: Delete inode time initializations in proc_alloc_inode()
    vfs: Add current_time() api
    vfs: add note about i_op->rename changes to porting
    fs: rename "rename2" i_op to "rename"
    vfs: remove unused i_op->rename
    fs: make remaining filesystems use .rename2
    libfs: support RENAME_NOREPLACE in simple_rename()
    fs: support RENAME_NOREPLACE for local filesystems
    ncpfs: fix unused variable warning

    Linus Torvalds
     
  • Al Viro
     
  • Pull vfs xattr updates from Al Viro:
    "xattr stuff from Andreas

    This completes the switch to xattr_handler ->get()/->set() from
    ->getxattr/->setxattr/->removexattr"

    * 'work.xattr' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    vfs: Remove {get,set,remove}xattr inode operations
    xattr: Stop calling {get,set,remove}xattr inode operations
    vfs: Check for the IOP_XATTR flag in listxattr
    xattr: Add __vfs_{get,set,remove}xattr helpers
    libfs: Use IOP_XATTR flag for empty directory handling
    vfs: Use IOP_XATTR flag for bad-inode handling
    vfs: Add IOP_XATTR inode operations flag
    vfs: Move xattr_resolve_name to the front of fs/xattr.c
    ecryptfs: Switch to generic xattr handlers
    sockfs: Get rid of getxattr iop
    sockfs: getxattr: Fail with -EOPNOTSUPP for invalid attribute names
    kernfs: Switch to generic xattr handlers
    hfs: Switch to generic xattr handlers
    jffs2: Remove jffs2_{get,set,remove}xattr macros
    xattr: Remove unnecessary NULL attribute name check

    Linus Torvalds
     
  • Pull dlm fix from David Teigland:
    "This includes a bug fix for a bad memory access during workqueue
    cleanup, which can happen while shutting down the dlm networking
    layer"

    * tag 'dlm-4.9' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm:
    dlm: free workqueues after the connections

    Linus Torvalds
     
  • Pull Ceph updates from Ilya Dryomov:
    "The big ticket item here is support for rbd exclusive-lock feature,
    with maintenance operations offloaded to userspace (Douglas Fuller,
    Mike Christie and myself). Another block device bullet is a series
    fixing up layering error paths (myself).

    On the filesystem side, we've got patches that improve our handling of
    buffered vs dio write races (Neil Brown) and a few assorted fixes from
    Zheng. Also included a couple of random cleanups and a minor CRUSH
    update"

    * tag 'ceph-for-4.9-rc1' of git://github.com/ceph/ceph-client: (39 commits)
    crush: remove redundant local variable
    crush: don't normalize input of crush_ln iteratively
    libceph: ceph_build_auth() doesn't need ceph_auth_build_hello()
    libceph: use CEPH_AUTH_UNKNOWN in ceph_auth_build_hello()
    ceph: fix description for rsize and rasize mount options
    rbd: use kmalloc_array() in rbd_header_from_disk()
    ceph: use list_move instead of list_del/list_add
    ceph: handle CEPH_SESSION_REJECT message
    ceph: avoid accessing / when mounting a subpath
    ceph: fix mandatory flock check
    ceph: remove warning when ceph_releasepage() is called on dirty page
    ceph: ignore error from invalidate_inode_pages2_range() in direct write
    ceph: fix error handling of start_read()
    rbd: add rbd_obj_request_error() helper
    rbd: img_data requests don't own their page array
    rbd: don't call rbd_osd_req_format_read() for !img_data requests
    rbd: rework rbd_img_obj_exists_submit() error paths
    rbd: don't crash or leak on errors in rbd_img_obj_parent_read_full_callback()
    rbd: move bumping img_request refcount into rbd_obj_request_submit()
    rbd: mark the original request as done if stat request fails
    ...

    Linus Torvalds
     
  • Pull splice fixups from Al Viro:
    "A couple of fixups for interaction of pipe-backed iov_iter with
    O_DIRECT reads + constification of a couple of primitives in uio.h
    missed by previous rounds.

    Kudos to davej - his fuzzing has caught those bugs"

    * 'work.splice_read' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    [btrfs] fix check_direct_IO() for non-iovec iterators
    constify iov_iter_count() and iter_is_iovec()
    fix ITER_PIPE interaction with direct_IO

    Linus Torvalds
     
  • Pull misc vfs updates from Al Viro:
    "Assorted misc bits and pieces.

    There are several single-topic branches left after this (rename2
    series from Miklos, current_time series from Deepa Dinamani, xattr
    series from Andreas, uaccess stuff from from me) and I'd prefer to
    send those separately"

    * 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (39 commits)
    proc: switch auxv to use of __mem_open()
    hpfs: support FIEMAP
    cifs: get rid of unused arguments of CIFSSMBWrite()
    posix_acl: uapi header split
    posix_acl: xattr representation cleanups
    fs/aio.c: eliminate redundant loads in put_aio_ring_file
    fs/internal.h: add const to ns_dentry_operations declaration
    compat: remove compat_printk()
    fs/buffer.c: make __getblk_slow() static
    proc: unsigned file descriptors
    fs/file: more unsigned file descriptors
    fs: compat: remove redundant check of nr_segs
    cachefiles: Fix attempt to read i_blocks after deleting file [ver #2]
    cifs: don't use memcpy() to copy struct iov_iter
    get rid of separate multipage fault-in primitives
    fs: Avoid premature clearing of capabilities
    fs: Give dentry to inode_change_ok() instead of inode
    fuse: Propagate dentry down to inode_change_ok()
    ceph: Propagate dentry down to inode_change_ok()
    xfs: Propagate dentry down to inode_change_ok()
    ...

    Linus Torvalds
     
  • looking for duplicate ->iov_base makes sense only for
    iovec-backed iterators; for kvec-backed ones it's pointless,
    for bvec-backed ones it's pointless and broken on 32bit (we
    walk through an array of struct bio_vec accessing them as if
    they were struct iovec; works by accident on 64bit, but on
    32bit it'll blow up) and for pipe-backed ones it's pointless
    and ends up oopsing.

    Signed-off-by: Al Viro

    Al Viro
     
  • by making sure we call iov_iter_advance() on original
    iov_iter even if direct_IO (done on its copy) has returned 0.
    It's a no-op for old iov_iter flavours and does the right thing
    (== truncation of the stuff we'd allocated, but not filled) in
    ITER_PIPE case. Failures (e.g. -EIO) get caught and dealt with
    by cleanup in generic_file_read_iter().

    Signed-off-by: Al Viro

    Al Viro
     

10 Oct, 2016

1 commit

  • After backporting commit ee44b4bc054a ("dlm: use sctp 1-to-1 API")
    series to a kernel with an older workqueue which didn't use RCU yet, it
    was noticed that we are freeing the workqueues in dlm_lowcomms_stop()
    too early as free_conn() will try to access that memory for canceling
    the queued works if any.

    This issue was introduced by commit 0d737a8cfd83 as before it such
    attempt to cancel the queued works wasn't performed, so the issue was
    not present.

    This patch fixes it by simply inverting the free order.

    Cc: stable@vger.kernel.org
    Fixes: 0d737a8cfd83 ("dlm: fix race while closing connections")
    Signed-off-by: Marcelo Ricardo Leitner
    Signed-off-by: David Teigland

    Marcelo Ricardo Leitner
     

08 Oct, 2016

4 commits