18 Dec, 2012

40 commits

  • With this change, the aoe driver treats the value zero as special for
    the aoe_deadsecs module parameter. Normally, this value specifies the
    number of seconds during which the driver will continue to attempt
    retransmits to an unresponsive AoE target. After aoe_deadsecs has
    elapsed, the aoe driver marks the aoe device as "down" and fails all
    I/O.

    The new meaning of an aoe_deadsecs of zero is for the driver to
    retransmit commands indefinitely.

    Signed-off-by: Ed Cashin
    Cc: Jens Axboe
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ed Cashin
     
  • Many AoE targets have four or fewer network ports, but some existing
    storage devices have many, and the AoE protocol sets no limit.

    This patch allows the use of more than eight remote MAC addresses per AoE
    target, while reducing the amount of memory used by the aoe driver in
    cases where there are many AoE targets with fewer than eight MAC addresses
    each.

    Signed-off-by: Ed Cashin
    Cc: Jens Axboe
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ed Cashin
     
  • This change avoids a race that could result in a NULL pointer derference
    following a WARNing from kobject_add_internal, "don't try to register
    things with the same name in the same directory."

    The problem was found with a test that forgets and discovers an
    aoe device in a loop:

    while test ! -r /tmp/stop; do
    aoe-flush -a
    aoe-discover
    done

    The race was between aoedev_flush taking aoedevs out of the devlist,
    allowing a new discovery of the same AoE target to take place before the
    driver gets around to calling sysfs_remove_group. Fixing that one
    revealed another race between do_open and add_disk, and this patch avoids
    that, too.

    The fix required some care, because for flushing (forgetting) an aoedev,
    some of the steps must be performed under lock and some must be able to
    sleep. Also, for discovering a new aoedev, some steps might sleep.

    The check for a bad aoedev pointer remains from a time when about half of
    this patch was done, and it was possible for the
    bdev->bd_disk->private_data to become corrupted. The check should be
    removed eventually, but it is not expected to add significant overhead,
    occurring in the aoeblk_open routine.

    Signed-off-by: Ed Cashin
    Cc: Jens Axboe
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ed Cashin
     
  • An AoE target can have multiple network ports used for AoE, and in the
    aoe driver, those are tracked by the aoetgt struct. These changes allow
    the aoe driver to handle network paths, or aoetgts, that are not working
    well, compared to the others.

    Paths that do not get responses despite the retransmission of AoE
    commands are marked as "tainted", and non-tainted paths are preferred.

    Meanwhile, the aoe driver attempts to "probe" the tainted path in the
    background by issuing reads of LBA 0 that are padded out to full
    (possibly jumbo-frame) size. If the probes get responses, then the path
    is "redeemed", and its taint is removed.

    This mechanism has been shown to be helpful in transparently handling
    and recovering from real-world network "brown outs" in ways that the
    earlier "shoot the help-needing target in the head" mechanism could not.

    Signed-off-by: Ed Cashin
    Cc: Jens Axboe
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ed Cashin
     
  • The value returned by the static minor device number number allocator is
    the real minor number, so it must be multiplied by the supported number
    of partitions per aoedev.

    Without this fix the support for systems without udev is incomplete, and
    the few users of aoe on such systems will have surprising results when
    device nodes names do not match the AoE target.

    Signed-off-by: Ed Cashin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ed Cashin
     
  • Because the minor_get and related functions use the return values for
    errors, the compiler doesn't know that sysminor will always either 1) be
    initialized in aoedev_by_aoeaddr by the call to minor_get, or 2) be
    unused as the "goto out" is executed.

    This patch avoids the compiler warning.

    Signed-off-by: Ed Cashin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ed Cashin
     
  • For some special-purpose systems where udev isn't present, static
    allocation of minor numbers is desirable. This update distinguishes
    different failure scenarios, to help the user understand what went
    wrong.

    Signed-off-by: Ed Cashin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ed Cashin
     
  • There is no need to call the request handler function in the I/O
    completion routine. The user impact of not doing it is a more "nice" aoe
    driver that is less susceptible to causing soft lockups.

    Signed-off-by: Ed Cashin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ed Cashin
     
  • A misplaced comment was attached to the nout member of the aoetgt. This
    change corrects the comment.

    Signed-off-by: Ed Cashin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ed Cashin
     
  • The aoe driver will never be waiting for more than aoe_maxout AoE
    commands from a given remote network port on an AoE target. Increasing
    the cap increases performance. Users can tighten the setting to reduce
    the amount of memory used for handling AoE traffic or the network
    bandwidth used for AoE.

    Signed-off-by: Ed Cashin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ed Cashin
     
  • Before the aoe driver was an I/O request handler, it was a
    make_request-style block driver. Even so, there was a problem where
    sysfs expected a request queue to exist, so one was provided in commit
    7135a71b19be ("aoe: allocate unused request_queue for sysfs").

    During the transition to the request-handler style, a patch was merged
    that was based on a driver without the noop queue, and the noop queue
    remained in place after the patch was merged, even though a new
    functional queue was introduced by the patch, allocated through
    blk_init_queue.

    The user impact is a memory leak proportional to the number of AoE
    targets discovered. This patch removes the memory leak and cleans up
    vestiges of the old do-nothing queue from the aoeblk_gdalloc function.

    Signed-off-by: Ed Cashin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ed Cashin
     
  • Commit f3b8e07af774 ("aoe: commands in retransmit queue use new
    destination on failure") omits the copying of the coarse-grained time
    when an AoE command was sent during the failover from one destination
    MAC address on the AoE target to another.

    The coarse-grained timing is only used when the system time changes or
    an unlikely length of time has passed since the sending of the AoE
    command. Users will not be impacted unless their system clock is very
    inaccurate or something unusual (e.g., 10 GbE link reset) happens during
    the period when the aoe driver is handling the failure of a port on the
    AoE target. Being effected will mean that an AoE target could be
    considered "down" too eagerly.

    Signed-off-by: Ed Cashin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ed Cashin
     
  • Signed-off-by: Ed Cashin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ed Cashin
     
  • When one remote MAC address isn't working as a destination for AoE
    commands, the frames used to track information associated with the AoE
    commands are moved to a new aoetgt (defined by the tuple of {AoE major,
    AoE minor, target MAC address}).

    This patch makes sure that the frames on the queue for retransmits that
    need to be done are updated to use the new destination, so that
    retransmits will be sent through a working network path.

    Without this change, packets on the retransmit queue will be needlessly
    retransmitted to the unresponsive destination MAC, possibly causing
    premature target failure before there's time for the retransmit timer to
    run again, decide to retransmit again, and finally update the destination
    to a working MAC address on the AoE target.

    Signed-off-by: Ed Cashin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ed Cashin
     
  • These changes improve the accuracy of the decision about whether it's time
    to retransmit an AoE command by using the microsecond-resolution
    gettimeofday instead of jiffies.

    Because the system time can jump suddenly, the decision reverts to using
    jiffies if the high-resolution time difference is relatively large.
    Otherwise the AoE targets could be considered failed inappropriately.

    Signed-off-by: Ed Cashin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ed Cashin
     
  • With this bugfix in place the calculation of the criterion for "lateness"
    is performed under lock. Without the lock, there is a chance that one of
    the non-atomic operations performed on the round trip time statistics
    could be incomplete, such that an incorrect lateness criterion would be
    calculated.

    Without this change, the effect of the bug would be rare unecessary but
    benign retransmissions.

    Signed-off-by: Ed Cashin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ed Cashin
     
  • The /dev/etherd/err character device provides low-level information about
    normal but sometimes interesting AoE command retransmits and "unexpected
    responses", i.e., responses for packets that have already been
    retransmitted.

    This change adds MAC addresses to the messages about unexpected responses,
    so that when they occur, it's more easy to determine the network paths to
    which they belong.

    Signed-off-by: Ed Cashin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ed Cashin
     
  • The aoe driver already had some congestion handling, but it was limited in
    its ability to cope with the kind of congestion that can arise on more
    complex networks such as those involving paths through multiple ethernet
    switches.

    Some of the lessons from TCP's history of development can be applied to
    improving the congestion control and avoidance on AoE storage networks.
    These changes use familar concepts from Van Jacobson's "Congestion
    Avoidance and Control" paper from '88, without adding significant
    overhead.

    This patch depends on an upcoming patch that covers the failover case when
    AoE commands being retransmitted are transferred from one retransmit queue
    to another. Another upcoming patch increases the timing accuracy.

    Signed-off-by: Ed Cashin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ed Cashin
     
  • Make the aoe driver follow expected behavior when the user uses ioctl to
    get the ATA device identify information, allowing access to model, serial
    number, etc.

    Signed-off-by: Ed Cashin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ed Cashin
     
  • Signed-off-by: Ed Cashin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ed Cashin
     
  • Signed-off-by: Ed Cashin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ed Cashin
     
  • Signed-off-by: Ed Cashin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ed Cashin
     
  • The userland aoetools package includes an "aoe-stat" command that can
    display a "payload size" column when the aoe driver exports this
    information. Users can quickly see what amount of user data is
    transferred inside each AoE command on the network, network headers
    excluded.

    Signed-off-by: Ed Cashin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ed Cashin
     
  • The GPFS filesystem is an example of an aoe user that requires the aoe
    driver to support I/O request sizes larger than the default. Most users
    will not need large I/O request sizes, because they would need to be split
    up into multiple AoE commands anyway.

    Signed-off-by: Ed Cashin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ed Cashin
     
  • Users sometimes want to cause the aoe driver to forget a particular
    previously discovered device when it is no longer online. The aoetools
    provide an "aoe-flush" command that users run to perform this
    administrative task. The changes below provide the support needed in the
    driver.

    Signed-off-by: Ed Cashin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ed Cashin
     
  • The ATA over Ethernet config query response contains a "buffer count"
    field reflecting the AoE target's capacity to buffer incoming AoE
    commands.

    By taking the current value of this field into accound, we increase
    performance throughput or avoid network congestion, when the value
    has increased or decreased, respectively.

    Signed-off-by: Ed Cashin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ed Cashin
     
  • Dropped transmits are not common, but when they do occur, increasing
    the transmit queue length often helps.

    Signed-off-by: Ed Cashin
    Cc: Dan Carpenter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ed Cashin
     
  • Signed-off-by: Ed Cashin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ed Cashin
     
  • The context feature of sparse is used with the Linux kernel sources to
    check for imbalanced uses of locks. Document the annotations defined in
    include/linux/compiler.h that tell sparse what to expect when a lock is
    held on function entry, exit, or both.

    Signed-off-by: Ed Cashin
    Reviewed-by: Josh Triplett
    Acked-by: Christopher Li
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ed Cashin
     
  • linux/compiler.h has macros to denote functions that acquire or release
    locks, but not to denote functions called with a lock held that return
    with the lock still held. Add a __must_hold macro to cover that case.

    Signed-off-by: Josh Triplett
    Reported-by: Ed Cashin
    Tested-by: Ed Cashin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Josh Triplett
     
  • Since commit 1cdcbec1a337 ("CRED: Neuter sys_capset()")
    is_container_init() has no callers.

    Signed-off-by: Gao feng
    Cc: David Howells
    Acked-by: Serge Hallyn
    Cc: James Morris
    Cc: "Eric W. Biederman"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Gao feng
     
  • To avoid an explosion of request_module calls on a chain of abusive
    scripts, fail maximum recursion with -ELOOP instead of -ENOEXEC. As soon
    as maximum recursion depth is hit, the error will fail all the way back
    up the chain, aborting immediately.

    This also has the side-effect of stopping the user's shell from attempting
    to reexecute the top-level file as a shell script. As seen in the
    dash source:

    if (cmd != path_bshell && errno == ENOEXEC) {
    *argv-- = cmd;
    *argv = cmd = path_bshell;
    goto repeat;
    }

    The above logic was designed for running scripts automatically that lacked
    the "#!" header, not to re-try failed recursion. On a legitimate -ENOEXEC,
    things continue to behave as the shell expects.

    Additionally, when tracking recursion, the binfmt handlers should not be
    involved. The recursion being tracked is the depth of calls through
    search_binary_handler(), so that function should be exclusively responsible
    for tracking the depth.

    Signed-off-by: Kees Cook
    Cc: halfdog
    Cc: P J P
    Cc: Alexander Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kees Cook
     
  • We display a list of supplementary group for each process in
    /proc//status. However, we show only the first 32 groups, not all of
    them.

    Although this is rare, but sometimes processes do have more than 32
    supplementary groups, and this kernel limitation breaks user-space apps
    that rely on the group list in /proc//status.

    Number 32 comes from the internal NGROUPS_SMALL macro which defines the
    length for the internal kernel "small" groups buffer. There is no
    apparent reason to limit to this value.

    This patch removes the 32 groups printing limit.

    The Linux kernel limits the amount of supplementary groups by NGROUPS_MAX,
    which is currently set to 65536. And this is the maximum count of groups
    we may possibly print.

    Signed-off-by: Artem Bityutskiy
    Acked-by: Serge E. Hallyn
    Acked-by: Kees Cook
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Artem Bityutskiy
     
  • It is currently impossible to examine the state of seccomp for a given
    process. While attaching with gdb and attempting "call
    prctl(PR_GET_SECCOMP,...)" will work with some situations, it is not
    reliable. If the process is in seccomp mode 1, this query will kill the
    process (prctl not allowed), if the process is in mode 2 with prctl not
    allowed, it will similarly be killed, and in weird cases, if prctl is
    filtered to return errno 0, it can look like seccomp is disabled.

    When reviewing the state of running processes, there should be a way to
    externally examine the seccomp mode. ("Did this build of Chrome end up
    using seccomp?" "Did my distro ship ssh with seccomp enabled?")

    This adds the "Seccomp" line to /proc/$pid/status.

    Signed-off-by: Kees Cook
    Reviewed-by: Cyrill Gorcunov
    Cc: Andrea Arcangeli
    Cc: James Morris
    Acked-by: Serge E. Hallyn
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kees Cook
     
  • During c/r sessions we've found that there is no way at the moment to
    fetch some VMA associated flags, such as mlock() and madvise().

    This leads us to a problem -- we don't know if we should call for mlock()
    and/or madvise() after restore on the vma area we're bringing back to
    life.

    This patch intorduces a new field into "smaps" output called VmFlags,
    where all set flags associated with the particular VMA is shown as two
    letter mnemonics.

    [ Strictly speaking for c/r we only need mlock/madvise bits but it has been
    said that providing just a few flags looks somehow inconsistent. So all
    flags are here now. ]

    This feature is made available on CONFIG_CHECKPOINT_RESTORE=n kernels, as
    other applications may start to use these fields.

    The data is encoded in a somewhat awkward two letters mnemonic form, to
    encourage userspace to be prepared for fields being added or removed in
    the future.

    [a.p.zijlstra@chello.nl: props to use for_each_set_bit]
    [sfr@canb.auug.org.au: props to use array instead of struct]
    [akpm@linux-foundation.org: overall redesign and simplification]
    [akpm@linux-foundation.org: remove unneeded braces per sfr, avoid using bloaty for_each_set_bit()]
    Signed-off-by: Cyrill Gorcunov
    Cc: Pavel Emelyanov
    Cc: Peter Zijlstra
    Cc: Stephen Rothwell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cyrill Gorcunov
     
  • Without this patch it is really hard to interpret a bounding set, if
    CAP_LAST_CAP is unknown for a current kernel.

    Non-existant capabilities can not be deleted from a bounding set with help
    of prctl.

    E.g.: Here are two examples without/with this patch.

    CapBnd: ffffffe0fdecffff
    CapBnd: 00000000fdecffff

    I suggest to hide non-existent capabilities. Here is two reasons.
    * It's logically and easier for using.
    * It helps to checkpoint-restore capabilities of tasks, because tasks
    can be restored on another kernel, where CAP_LAST_CAP is bigger.

    Signed-off-by: Andrew Vagin
    Cc: Andrew G. Morgan
    Reviewed-by: Serge E. Hallyn
    Cc: Pavel Emelyanov
    Reviewed-by: Kees Cook
    Cc: KAMEZAWA Hiroyuki
    Cc: James Morris
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Vagin
     
  • Ptrace jailers want to be sure that the tracee can never escape
    from the control. However if the tracer dies unexpectedly the
    tracee continues to run in potentially unsafe mode.

    Add the new ptrace option PTRACE_O_EXITKILL. If the tracer exits
    it sends SIGKILL to every tracee which has this bit set.

    Note that the new option is not equal to the last-option << 1. Because
    currently all options have an event, and the new one starts the eventless
    group. It uses the random 20 bit, so we have the room for 12 more events,
    but we can also add the new eventless options below this one.

    Suggested by Amnon Shiloh.

    Signed-off-by: Oleg Nesterov
    Tested-by: Amnon Shiloh
    Cc: Denys Vlasenko
    Cc: Michael Kerrisk
    Cc: Serge Hallyn
    Cc: Chris Evans
    Cc: David Howells
    Cc: "Eric W. Biederman"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Update the documentation for simple_strto* to reflect that it has been
    obsoleted and advise the usage of kstrto*.

    Signed-off-by: Eldad Zack
    Cc: J. Bruce Fields
    Cc: Joe Perches
    Cc: Randy Dunlap
    Cc: Alexey Dobriyan
    Cc: Rob Landley
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eldad Zack
     
  • As Bruce Fields pointed out, kstrto* is currently lacking kerneldoc
    comments. This patch adds kerneldoc comments to common variants of
    kstrto*: kstrto(u)l, kstrto(u)ll and kstrto(u)int.

    Signed-off-by: Eldad Zack
    Cc: J. Bruce Fields
    Cc: Joe Perches
    Cc: Randy Dunlap
    Cc: Alexey Dobriyan
    Cc: Rob Landley
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eldad Zack
     
  • keys-ecryptfs.txt was missing from 00-INDEX.

    Signed-off-by: Jarkko Sakkinen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jarkko Sakkinen