12 Sep, 2013

40 commits

  • When the example udev rules in the documentation are used without
    modification, warnings like the one shown below appear in the system logs:

    /var/log/messages:Aug 22 11:09:11 kung udevd[445]: NAME="%k" \
    is superfluous and breaks kernel supplied names, please remove \
    it from /etc/udev/rules.d/60-aoe.rules:26

    Removing the term does not cause any problems with the creation of the
    special character and block device nodes.

    Signed-off-by: Ed Cashin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ed Cashin
     
  • If the system has trouble allocating memory for the creation of the aoe
    debugfs directory or of a file inside it, the debugfs member of an aoedev
    can be NULL.

    Do not treat a NULL debugfs pointer as a BUG on aoedev shutdown, avoiding
    the user impact of an unecessary panic.

    Signed-off-by: Ed Cashin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ed Cashin
     
  • This patch fixes following compiler warnings:

    drivers/block/aoe/aoecmd.c: In function `aoecmd_ata_rw':
    drivers/block/aoe/aoecmd.c:383:17: warning: variable `t' set but not used [-Wunused-but-set-variable]
    struct aoetgt *t;
    ^
    drivers/block/aoe/aoecmd.c: In function `resend':
    drivers/block/aoe/aoecmd.c:488:21: warning: variable `ah' set but not used [-Wunused-but-set-variable]
    struct aoe_atahdr *ah;
    ^

    Signed-off-by: Andy Shevchenko
    Cc: Ed Cashin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andy Shevchenko
     
  • In the kernel we have a nice helper that may be used here. This patch
    substitutes the custom implementation by the native function call.

    Signed-off-by: Andy Shevchenko
    Cc: Ed Cashin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andy Shevchenko
     
  • Signed-off-by: Ed Cashin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ed Cashin
     
  • Signed-off-by: Ed Cashin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ed Cashin
     
  • This information is presented in a compact format that has evolved for
    easy routine scanning by expert humans, mostly developers and support
    technicians helping to troubleshoot or test AoE-based systems.

    Signed-off-by: Ed Cashin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ed Cashin
     
  • The place holder in the file contents is filled out in the following
    patch.

    Signed-off-by: Ed Cashin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ed Cashin
     
  • Signed-off-by: Ed Cashin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ed Cashin
     
  • This series adds the debugging information that the coraid.com-distributed
    aoe driver exports via sysfs, but instead of sysfs, it uses debugfs.

    With these patches applied, even without AoE targets on the network, KEDR
    reports new possible memory leaks, but these are from callers outside the
    aoe driver that have used aoe_devnode to get the name of the character
    devices through the aoe_class->devnode callback, and I believe they're
    responsible for freeing that memory.

    This patch:

    Create and destroy the debugfs directory.

    Signed-off-by: Ed Cashin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ed Cashin
     
  • Signed-off-by: Cody P Schafer
    Reviewed-by: Seth Jennings
    Cc: David Woodhouse
    Cc: Rik van Riel
    Cc: Michel Lespinasse
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cody P Schafer
     
  • No reason require rbtree test code to be a module, allow it to be builtin
    (streamlines my development process)

    Signed-off-by: Cody P Schafer
    Reviewed-by: Seth Jennings
    Cc: David Woodhouse
    Cc: Rik van Riel
    Cc: Michel Lespinasse
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cody P Schafer
     
  • Just check that we examine all nodes in the tree for the postorder
    iteration.

    Signed-off-by: Cody P Schafer
    Reviewed-by: Seth Jennings
    Cc: David Woodhouse
    Cc: Rik van Riel
    Cc: Michel Lespinasse
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cody P Schafer
     
  • Because deletion (of the entire tree) is a relatively common use of the
    rbtree_postorder iteration, and because doing it safely means fiddling
    with temporary storage, provide a helper to simplify postorder rbtree
    iteration.

    Signed-off-by: Cody P Schafer
    Reviewed-by: Seth Jennings
    Cc: David Woodhouse
    Cc: Rik van Riel
    Cc: Michel Lespinasse
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cody P Schafer
     
  • Postorder iteration yields all of a node's children prior to yielding the
    node itself, and this particular implementation also avoids examining the
    leaf links in a node after that node has been yielded.

    In what I expect will be its most common usage, postorder iteration allows
    the deletion of every node in an rbtree without modifying the rbtree nodes
    (no _requirement_ that they be nulled) while avoiding referencing child
    nodes after they have been "deleted" (most commonly, freed).

    I have only updated zswap to use this functionality at this point, but
    numerous bits of code (most notably in the filesystem drivers) use a hand
    rolled postorder iteration that NULLs child links as it traverses the
    tree. Each of those instances could be replaced with this common
    implementation.

    1 & 2 add rbtree postorder iteration functions.
    3 adds testing of the iteration to the rbtree runtime tests
    4 allows building the rbtree runtime tests as builtins
    5 updates zswap.

    This patch:

    Add postorder iteration functions for rbtree. These are useful for safely
    freeing an entire rbtree without modifying the tree at all.

    Signed-off-by: Cody P Schafer
    Reviewed-by: Seth Jennings
    Cc: David Woodhouse
    Cc: Rik van Riel
    Cc: Michel Lespinasse
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cody P Schafer
     
  • Cc: Davidlohr Bueso
    Cc: Karel Zak
    Cc: Matt Fleming
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • Trivial coding style cleanups - still plenty left.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Davidlohr Bueso
    Reviewed-by: Karel Zak
    Acked-by: Matt Fleming
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     
  • I love emacs, but these settings for coding style are annoying when trying
    to open the efi.h file. More important, we already have checkpatch for
    that.

    Signed-off-by: Davidlohr Bueso
    Reviewed-by: Karel Zak
    Acked-by: Matt Fleming
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     
  • When verifying GPT header integrity, make sure that first usable LBA is
    smaller than last usable LBA.

    Signed-off-by: Davidlohr Bueso
    Reviewed-by: Karel Zak
    Acked-by: Matt Fleming
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     
  • The partition that has the 0xEE (GPT protective), must have the size in
    lba field set to the lesser of the size of the disk minus one or
    0xFFFFFFFF for larger disks.

    Signed-off-by: Davidlohr Bueso
    Reviewed-by: Karel Zak
    Acked-by: Matt Fleming
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     
  • One of the biggest problems with GPT is compatibility with older, non-GPT
    systems. The problem is addressed by creating hybrid mbrs, an extension,
    or variant, of the traditional protective mbr. This contains, apart from
    the 0xEE partition, up three additional primary partitions that point to
    the same space marked by up to three GPT partitions. The result is that
    legacy OSs can see the three required MBR partitions and at the same time
    ignore the GPT-aware partitions that protect the GPT structures.

    While hybrid MBRs are hacks, workarounds and simply not part of the GPT
    standard, they do exist and we have no way around them. For instance, by
    default, OSX creates a hybrid scheme when using multi-OS booting.

    In order for Linux to properly discover protective MBRs, it must be made
    aware of devices that have hybrid MBRs. No functionality is changed by
    this patch, just a debug message informing the user of the MBR scheme that
    is being used.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Davidlohr Bueso
    Reviewed-by: Karel Zak
    Acked-by: Matt Fleming
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     
  • When detecting a valid protective MBR, the Linux kernel isn't picky about
    the partition (1-4) the 0xEE is at, but, unlike other operating systems,
    it does require it to begin at the second sector (sector 1). This check,
    apart from it not being enforced by UEFI, and causing Linux to potentially
    fail to detect any *valid* partitions on the disk, can present problems
    when dealing with hybrid MBRs[1].

    For compatibility reasons, if the first partition is hybridized, the 0xEE
    partition must be small enough to ensure that it only protects the GPT
    data structures - as opposed to the the whole disk in a protective MBR.
    This problem is very well described by Rod Smith[1]: where MBR-only
    partitioning programs (such as older versions of fdisk) can see some of
    the disk space as unallocated, thus loosing the purpose of the 0xEE
    partition's protection of GPT data structures.

    By dropping this check, this patch enables Linux to be more flexible when
    probing for GPT disklabels.

    [1] http://www.rodsbooks.com/gdisk/hybrid.html#reactions

    Signed-off-by: Davidlohr Bueso
    Reviewed-by: Karel Zak
    Acked-by: Matt Fleming
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     
  • Per the UEFI Specs 2.4, June 2013, the starting lba of the partition that
    has the EFI GPT (0xEE) must be set to 0x00000001 - this is obviously the
    LBA of the GPT Partition Header.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Davidlohr Bueso
    Reviewed-by: Karel Zak
    Acked-by: Matt Fleming
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     
  • The kernel's GPT implementation currently uses the generic 'struct
    partition' type for dealing with legacy MBR partition records. While this
    is is useful for disklabels that we designed for CHS addressing, such as
    msdos, it doesn't adapt well to newer standards that use LBA instead, such
    as GUID partition tables. Furthermore, these generic partition structures
    do not have all the required fields to properly follow the UEFI specs.

    While a CHS address can be translated to LBA, it's much simpler and
    cleaner to just replace the partition type. This patch adds a new
    'gpt_record' type that is fully compliant with EFI and will allow, in the
    next patches, to add more checks to properly verify a protective MBR,
    which is paramount to probing a device that makes use of GPT.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Davidlohr Bueso
    Reviewed-by: Karel Zak
    Acked-by: Matt Fleming
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     
  • Modify the s390 copy_oldmem_page() and remap_oldmem_pfn_range() function
    for zfcpdump to read from the HSA memory if memory below HSA_SIZE bytes is
    requested. Otherwise real memory is used.

    Signed-off-by: Michael Holzheu
    Cc: HATAYAMA Daisuke
    Cc: Jan Willeke
    Cc: Vivek Goyal
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michael Holzheu
     
  • The patch "s390/vmcore: Implement remap_oldmem_pfn_range for s390" allows
    now to use mmap also on s390.

    So enable mmap for s390 again.

    Signed-off-by: Michael Holzheu
    Cc: HATAYAMA Daisuke
    Cc: Jan Willeke
    Cc: Vivek Goyal
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michael Holzheu
     
  • Introduce the s390 specific way to map pages from oldmem. The memory area
    below OLDMEM_SIZE is mapped with offset OLDMEM_BASE. The other old memory
    is mapped directly.

    Signed-off-by: Jan Willeke
    Signed-off-by: Michael Holzheu
    Cc: HATAYAMA Daisuke
    Cc: Vivek Goyal
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Willeke
     
  • For zfcpdump we can't map the HSA storage because it is only available via
    a read interface. Therefore, for the new vmcore mmap feature we have
    introduce a new mechanism to create mappings on demand.

    This patch introduces a new architecture function remap_oldmem_pfn_range()
    that should be used to create mappings with remap_pfn_range() for oldmem
    areas that can be directly mapped. For zfcpdump this is everything
    besides of the HSA memory. For the areas that are not mapped by
    remap_oldmem_pfn_range() a generic vmcore a new generic vmcore fault
    handler mmap_vmcore_fault() is called.

    This handler works as follows:

    * Get already available or new page from page cache (find_or_create_page)
    * Check if /proc/vmcore page is filled with data (PageUptodate)
    * If yes:
    Return that page
    * If no:
    Fill page using __vmcore_read(), set PageUptodate, and return page

    Signed-off-by: Michael Holzheu
    Acked-by: Vivek Goyal
    Cc: HATAYAMA Daisuke
    Cc: Jan Willeke
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michael Holzheu
     
  • Exchange the old relocate mechanism with the new arch function call
    override mechanism that allows to create the ELF core header in the 2nd
    kernel.

    Signed-off-by: Michael Holzheu
    Cc: HATAYAMA Daisuke
    Cc: Jan Willeke
    Cc: Vivek Goyal
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michael Holzheu
     
  • For s390 we want to use /proc/vmcore for our SCSI stand-alone dump
    (zfcpdump). We have support where the first HSA_SIZE bytes are saved into
    a hypervisor owned memory area (HSA) before the kdump kernel is booted.
    When the kdump kernel starts, it is restricted to use only HSA_SIZE bytes.

    The advantages of this mechanism are:

    * No crashkernel memory has to be defined in the old kernel.
    * Early boot problems (before kexec_load has been done) can be dumped
    * Non-Linux systems can be dumped.

    We modify the s390 copy_oldmem_page() function to read from the HSA memory
    if memory below HSA_SIZE bytes is requested.

    Since we cannot use the kexec tool to load the kernel in this scenario,
    we have to build the ELF header in the 2nd (kdump/new) kernel.

    So with the following patch set we would like to introduce the new
    function that the ELF header for /proc/vmcore can be created in the 2nd
    kernel memory.

    The following steps are done during zfcpdump execution:

    1. Production system crashes
    2. User boots a SCSI disk that has been prepared with the zfcpdump tool
    3. Hypervisor saves CPU state of boot CPU and HSA_SIZE bytes of memory into HSA
    4. Boot loader loads kernel into low memory area
    5. Kernel boots and uses only HSA_SIZE bytes of memory
    6. Kernel saves registers of non-boot CPUs
    7. Kernel does memory detection for dump memory map
    8. Kernel creates ELF header for /proc/vmcore
    9. /proc/vmcore uses this header for initialization
    10. The zfcpdump user space reads /proc/vmcore to write dump to SCSI disk
    - copy_oldmem_page() copies from HSA for memory below HSA_SIZE
    - copy_oldmem_page() copies from real memory for memory above HSA_SIZE

    Currently for s390 we create the ELF core header in the 2nd kernel with a
    small trick. We relocate the addresses in the ELF header in a way that
    for the /proc/vmcore code it seems to be in the 1st kernel (old) memory
    and the read_from_oldmem() returns the correct data. This allows the
    /proc/vmcore code to use the ELF header in the 2nd kernel.

    This patch:

    Exchange the old mechanism with the new and much cleaner function call
    override feature that now offcially allows to create the ELF core header
    in the 2nd kernel.

    To use the new feature the following function have to be defined
    by the architecture backend code to read from new memory:

    * elfcorehdr_alloc: Allocate ELF header
    * elfcorehdr_free: Free the memory of the ELF header
    * elfcorehdr_read: Read from ELF header
    * elfcorehdr_read_notes: Read from ELF notes

    Signed-off-by: Michael Holzheu
    Acked-by: Vivek Goyal
    Cc: HATAYAMA Daisuke
    Cc: Jan Willeke
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michael Holzheu
     
  • Code can not run here forever, so remove the unnecessary return.

    Signed-off-by: Xishi Qiu
    Suggested-by: Zhang Yanfei
    Reviewed-by: Simon Horman
    Reviewed-by: Zhang Yanfei
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Xishi Qiu
     
  • The error hanling and ret-from-loop look confusing and inconsistent.

    - "retval >= 0" simply returns

    - "!bprm->file" returns too but with read_unlock() because
    binfmt_lock was already re-acquired

    - "retval != -ENOEXEC || bprm->mm == NULL" does "break" and
    relies on the same check after the main loop

    Consolidate these checks into a single if/return statement.

    need_retry still checks "retval == -ENOEXEC", but this and -ENOENT before
    the main loop are not needed. This is only for pathological and
    impossible list_empty(&formats) case.

    It is not clear why do we check "bprm->mm == NULL", probably this
    should be removed.

    Signed-off-by: Oleg Nesterov
    Acked-by: Kees Cook
    Cc: Al Viro
    Cc: Evgeniy Polyakov
    Cc: Zach Levis
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • A separate one-liner for better documentation.

    It doesn't make sense to retry if request_module() fails to exec
    /sbin/modprobe, add the additional "request_module() < 0" check.

    However, this logic still doesn't look exactly right:

    1. It would be better to check "request_module() != 0", the user
    space modprobe process should report the correct exit code.
    But I didn't dare to add the user-visible change.

    2. The whole ENOEXEC logic looks suboptimal. Suppose that we try
    to exec a "#!path-to-unsupported-binary" script. In this case
    request_module() + "retry" will be done twice: first by the
    "depth == 1" code, and then again by the "depth == 0" caller
    which doesn't make sense.

    3. And note that in the case above bprm->buf was already changed
    by load_script()->prepare_binprm(), so this looks even more
    ugly.

    Signed-off-by: Oleg Nesterov
    Acked-by: Kees Cook
    Cc: Al Viro
    Cc: Evgeniy Polyakov
    Cc: Zach Levis
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • search_binary_handler() uses "for (try=0; try
    Acked-by: Kees Cook
    Cc: Al Viro
    Cc: Evgeniy Polyakov
    Cc: Zach Levis
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • search_binary_handler() checks ->load_binary != NULL for no reason, this
    method should be always defined. Turn this check into WARN_ON() and move
    it into __register_binfmt().

    Also, kill the function pointer. The current code looks confusing, as if
    ->load_binary can go away after read_unlock(&binfmt_lock). But we rely on
    module_get(fmt->module), this fmt can't be changed or unregistered,
    otherwise this code is buggy anyway.

    Signed-off-by: Oleg Nesterov
    Acked-by: Kees Cook
    Cc: Al Viro
    Cc: Evgeniy Polyakov
    Cc: Zach Levis
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • When search_binary_handler() succeeds it does allow_write_access() and
    fput(), then it clears bprm->file to ensure the caller will not do the
    same.

    We can simply move this code to exec_binprm() which is called only once.
    In fact we could move this to free_bprm() and remove the same code in
    do_execve_common's error path.

    Signed-off-by: Oleg Nesterov
    Acked-by: Kees Cook
    Cc: Al Viro
    Cc: Evgeniy Polyakov
    Cc: Zach Levis
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • A separate one-liner with the minor fix.

    PROC_EVENT_EXEC reports the "exec" event, but this message is sent at
    least twice if search_binary_handler() is called by ->load_binary()
    recursively, say, load_script().

    Move it to exec_binprm(), this is "depth == 0" code too.

    Signed-off-by: Oleg Nesterov
    Acked-by: Kees Cook
    Cc: Al Viro
    Cc: Evgeniy Polyakov
    Cc: Zach Levis
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Nobody except search_binary_handler() should touch ->recursion_depth, "int
    depth" buys nothing but complicates the code, kill it.

    Probably we should also kill "fn" and the !NULL check, ->load_binary
    should be always defined. And it can not go away after read_unlock() or
    this code is buggy anyway.

    Signed-off-by: Oleg Nesterov
    Acked-by: Kees Cook
    Cc: Al Viro
    Cc: Evgeniy Polyakov
    Cc: Zach Levis
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • task_pid_nr_ns() and trace/ptrace code in the middle of the recursive
    search_binary_handler() looks confusing and imho annoying. We only need
    this code if "depth == 0", lets add a simple helper which calls
    search_binary_handler() and does trace_sched_process_exec() +
    ptrace_event().

    The patch also moves the setting of task->did_exec, we need to do this
    only once.

    Note: we can kill either task->did_exec or PF_FORKNOEXEC.

    Signed-off-by: Oleg Nesterov
    Acked-by: Kees Cook
    Cc: Al Viro
    Cc: Evgeniy Polyakov
    Cc: Zach Levis
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • proc_fd_permission() says "process can still access /proc/self/fd after it
    has executed a setuid()", but the "task_pid() = proc_pid() check only
    helps if the task is group leader, /proc/self points to
    /proc/.

    Change this check to use task_tgid() so that the whole thread group can
    access its /proc/self/fd or /proc//fd.

    Notes:
    - CLONE_THREAD does not require CLONE_FILES so task->files
    can differ, but I don't think this can lead to any security
    problem. And this matches same_thread_group() in
    __ptrace_may_access().

    - /proc/self should probably point to /proc/, but
    it is too late to change the rules. Perhaps it makes sense
    to add /proc/thread though.

    Test-case:

    void *tfunc(void *arg)
    {
    assert(opendir("/proc/self/fd"));
    return NULL;
    }

    int main(void)
    {
    pthread_t t;
    pthread_create(&t, NULL, tfunc, NULL);
    pthread_join(t, NULL);
    return 0;
    }

    fails if, say, this executable is not readable and suid_dumpable = 0.

    Signed-off-by: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov