12 Sep, 2013

40 commits

  • When the rootfs code was a wrapper around ramfs, having them in the same
    file made sense. Now that it can wrap another filesystem type, move it in
    with the init code instead.

    This also allows a subsequent patch to access rootfstype= command line
    arg.

    Signed-off-by: Rob Landley
    Cc: Jeff Layton
    Cc: Jens Axboe
    Cc: Stephen Warren
    Cc: Rusty Russell
    Cc: Jim Cromie
    Cc: Sam Ravnborg
    Cc: Greg Kroah-Hartman
    Cc: "Eric W. Biederman"
    Cc: Alexander Viro
    Cc: "H. Peter Anvin"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rob Landley
     
  • Even though ramfs hasn't got a backing device, commit e0bf68ddec4f ("mm:
    bdi init hooks") added one anyway, and put the initialization in
    init_rootfs() since that's the first user, leaving it out of init_ramfs()
    to avoid duplication.

    But initmpfs uses init_tmpfs() instead, so move the init into the
    filesystem's init function, add a "once" guard to prevent duplicate
    initialization, and call the filesystem init from rootfs init.

    This goes part of the way to allowing ramfs to be built as a module.

    [akpm@linux-foundation.org; using bit 1 was odd]
    Signed-off-by: Rob Landley
    Cc: Jeff Layton
    Cc: Jens Axboe
    Cc: Stephen Warren
    Cc: Rusty Russell
    Cc: Jim Cromie
    Cc: Sam Ravnborg
    Cc: Greg Kroah-Hartman
    Cc: "Eric W. Biederman"
    Cc: Alexander Viro
    Cc: "H. Peter Anvin"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rob Landley
     
  • Mounting MS_NOUSER prevents --bind mounts from rootfs. Prevent new rootfs
    mounts with a different mechanism that doesn't affect bind mounts.

    Signed-off-by: Rob Landley
    Cc: Jeff Layton
    Cc: Jens Axboe
    Cc: Stephen Warren
    Cc: Rusty Russell
    Cc: Jim Cromie
    Cc: Sam Ravnborg
    Cc: Greg Kroah-Hartman
    Cc: "Eric W. Biederman"
    Cc: Alexander Viro
    Cc: "H. Peter Anvin"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rob Landley
     
  • With users of radix_tree_preload() run from interrupt (block/blk-ioc.c is
    one such possible user), the following race can happen:

    radix_tree_preload()
    ...
    radix_tree_insert()
    radix_tree_node_alloc()
    if (rtp->nr) {
    ret = rtp->nodes[rtp->nr - 1];

    ...
    radix_tree_preload()
    ...
    radix_tree_insert()
    radix_tree_node_alloc()
    if (rtp->nr) {
    ret = rtp->nodes[rtp->nr - 1];

    And we give out one radix tree node twice. That clearly results in radix
    tree corruption with different results (usually OOPS) depending on which
    two users of radix tree race.

    We fix the problem by making radix_tree_node_alloc() always allocate fresh
    radix tree nodes when in interrupt. Using preloading when in interrupt
    doesn't make sense since all the allocations have to be atomic anyway and
    we cannot steal nodes from process-context users because some users rely
    on radix_tree_insert() succeeding after radix_tree_preload().
    in_interrupt() check is somewhat ugly but we cannot simply key off passed
    gfp_mask as that is acquired from root_gfp_mask() and thus the same for
    all preload users.

    Another part of the fix is to avoid node preallocation in
    radix_tree_preload() when passed gfp_mask doesn't allow waiting. Again,
    preallocation in such case doesn't make sense and when preallocation would
    happen in interrupt we could possibly leak some allocated nodes. However,
    some users of radix_tree_preload() require following radix_tree_insert()
    to succeed. To avoid unexpected effects for these users,
    radix_tree_preload() only warns if passed gfp mask doesn't allow waiting
    and we provide a new function radix_tree_maybe_preload() for those users
    which get different gfp mask from different call sites and which are
    prepared to handle radix_tree_insert() failure.

    Signed-off-by: Jan Kara
    Cc: Jens Axboe
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     
  • It seems pretty unlikely that AFFS supports files over 4GB but we may as
    well leave use loff_t just for cleanness sake instead of truncating it to
    32 bits.

    Signed-off-by: Dan Carpenter
    Cc: Marco Stornelli
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Carpenter
     
  • The patch "s390/vmcore: Implement remap_oldmem_pfn_range for s390" allows
    now to use mmap also on s390.

    So enable mmap for s390 again.

    Signed-off-by: Michael Holzheu
    Cc: HATAYAMA Daisuke
    Cc: Jan Willeke
    Cc: Vivek Goyal
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michael Holzheu
     
  • For zfcpdump we can't map the HSA storage because it is only available via
    a read interface. Therefore, for the new vmcore mmap feature we have
    introduce a new mechanism to create mappings on demand.

    This patch introduces a new architecture function remap_oldmem_pfn_range()
    that should be used to create mappings with remap_pfn_range() for oldmem
    areas that can be directly mapped. For zfcpdump this is everything
    besides of the HSA memory. For the areas that are not mapped by
    remap_oldmem_pfn_range() a generic vmcore a new generic vmcore fault
    handler mmap_vmcore_fault() is called.

    This handler works as follows:

    * Get already available or new page from page cache (find_or_create_page)
    * Check if /proc/vmcore page is filled with data (PageUptodate)
    * If yes:
    Return that page
    * If no:
    Fill page using __vmcore_read(), set PageUptodate, and return page

    Signed-off-by: Michael Holzheu
    Acked-by: Vivek Goyal
    Cc: HATAYAMA Daisuke
    Cc: Jan Willeke
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michael Holzheu
     
  • For s390 we want to use /proc/vmcore for our SCSI stand-alone dump
    (zfcpdump). We have support where the first HSA_SIZE bytes are saved into
    a hypervisor owned memory area (HSA) before the kdump kernel is booted.
    When the kdump kernel starts, it is restricted to use only HSA_SIZE bytes.

    The advantages of this mechanism are:

    * No crashkernel memory has to be defined in the old kernel.
    * Early boot problems (before kexec_load has been done) can be dumped
    * Non-Linux systems can be dumped.

    We modify the s390 copy_oldmem_page() function to read from the HSA memory
    if memory below HSA_SIZE bytes is requested.

    Since we cannot use the kexec tool to load the kernel in this scenario,
    we have to build the ELF header in the 2nd (kdump/new) kernel.

    So with the following patch set we would like to introduce the new
    function that the ELF header for /proc/vmcore can be created in the 2nd
    kernel memory.

    The following steps are done during zfcpdump execution:

    1. Production system crashes
    2. User boots a SCSI disk that has been prepared with the zfcpdump tool
    3. Hypervisor saves CPU state of boot CPU and HSA_SIZE bytes of memory into HSA
    4. Boot loader loads kernel into low memory area
    5. Kernel boots and uses only HSA_SIZE bytes of memory
    6. Kernel saves registers of non-boot CPUs
    7. Kernel does memory detection for dump memory map
    8. Kernel creates ELF header for /proc/vmcore
    9. /proc/vmcore uses this header for initialization
    10. The zfcpdump user space reads /proc/vmcore to write dump to SCSI disk
    - copy_oldmem_page() copies from HSA for memory below HSA_SIZE
    - copy_oldmem_page() copies from real memory for memory above HSA_SIZE

    Currently for s390 we create the ELF core header in the 2nd kernel with a
    small trick. We relocate the addresses in the ELF header in a way that
    for the /proc/vmcore code it seems to be in the 1st kernel (old) memory
    and the read_from_oldmem() returns the correct data. This allows the
    /proc/vmcore code to use the ELF header in the 2nd kernel.

    This patch:

    Exchange the old mechanism with the new and much cleaner function call
    override feature that now offcially allows to create the ELF core header
    in the 2nd kernel.

    To use the new feature the following function have to be defined
    by the architecture backend code to read from new memory:

    * elfcorehdr_alloc: Allocate ELF header
    * elfcorehdr_free: Free the memory of the ELF header
    * elfcorehdr_read: Read from ELF header
    * elfcorehdr_read_notes: Read from ELF notes

    Signed-off-by: Michael Holzheu
    Acked-by: Vivek Goyal
    Cc: HATAYAMA Daisuke
    Cc: Jan Willeke
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michael Holzheu
     
  • The error hanling and ret-from-loop look confusing and inconsistent.

    - "retval >= 0" simply returns

    - "!bprm->file" returns too but with read_unlock() because
    binfmt_lock was already re-acquired

    - "retval != -ENOEXEC || bprm->mm == NULL" does "break" and
    relies on the same check after the main loop

    Consolidate these checks into a single if/return statement.

    need_retry still checks "retval == -ENOEXEC", but this and -ENOENT before
    the main loop are not needed. This is only for pathological and
    impossible list_empty(&formats) case.

    It is not clear why do we check "bprm->mm == NULL", probably this
    should be removed.

    Signed-off-by: Oleg Nesterov
    Acked-by: Kees Cook
    Cc: Al Viro
    Cc: Evgeniy Polyakov
    Cc: Zach Levis
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • A separate one-liner for better documentation.

    It doesn't make sense to retry if request_module() fails to exec
    /sbin/modprobe, add the additional "request_module() < 0" check.

    However, this logic still doesn't look exactly right:

    1. It would be better to check "request_module() != 0", the user
    space modprobe process should report the correct exit code.
    But I didn't dare to add the user-visible change.

    2. The whole ENOEXEC logic looks suboptimal. Suppose that we try
    to exec a "#!path-to-unsupported-binary" script. In this case
    request_module() + "retry" will be done twice: first by the
    "depth == 1" code, and then again by the "depth == 0" caller
    which doesn't make sense.

    3. And note that in the case above bprm->buf was already changed
    by load_script()->prepare_binprm(), so this looks even more
    ugly.

    Signed-off-by: Oleg Nesterov
    Acked-by: Kees Cook
    Cc: Al Viro
    Cc: Evgeniy Polyakov
    Cc: Zach Levis
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • search_binary_handler() uses "for (try=0; try
    Acked-by: Kees Cook
    Cc: Al Viro
    Cc: Evgeniy Polyakov
    Cc: Zach Levis
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • search_binary_handler() checks ->load_binary != NULL for no reason, this
    method should be always defined. Turn this check into WARN_ON() and move
    it into __register_binfmt().

    Also, kill the function pointer. The current code looks confusing, as if
    ->load_binary can go away after read_unlock(&binfmt_lock). But we rely on
    module_get(fmt->module), this fmt can't be changed or unregistered,
    otherwise this code is buggy anyway.

    Signed-off-by: Oleg Nesterov
    Acked-by: Kees Cook
    Cc: Al Viro
    Cc: Evgeniy Polyakov
    Cc: Zach Levis
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • When search_binary_handler() succeeds it does allow_write_access() and
    fput(), then it clears bprm->file to ensure the caller will not do the
    same.

    We can simply move this code to exec_binprm() which is called only once.
    In fact we could move this to free_bprm() and remove the same code in
    do_execve_common's error path.

    Signed-off-by: Oleg Nesterov
    Acked-by: Kees Cook
    Cc: Al Viro
    Cc: Evgeniy Polyakov
    Cc: Zach Levis
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • A separate one-liner with the minor fix.

    PROC_EVENT_EXEC reports the "exec" event, but this message is sent at
    least twice if search_binary_handler() is called by ->load_binary()
    recursively, say, load_script().

    Move it to exec_binprm(), this is "depth == 0" code too.

    Signed-off-by: Oleg Nesterov
    Acked-by: Kees Cook
    Cc: Al Viro
    Cc: Evgeniy Polyakov
    Cc: Zach Levis
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Nobody except search_binary_handler() should touch ->recursion_depth, "int
    depth" buys nothing but complicates the code, kill it.

    Probably we should also kill "fn" and the !NULL check, ->load_binary
    should be always defined. And it can not go away after read_unlock() or
    this code is buggy anyway.

    Signed-off-by: Oleg Nesterov
    Acked-by: Kees Cook
    Cc: Al Viro
    Cc: Evgeniy Polyakov
    Cc: Zach Levis
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • task_pid_nr_ns() and trace/ptrace code in the middle of the recursive
    search_binary_handler() looks confusing and imho annoying. We only need
    this code if "depth == 0", lets add a simple helper which calls
    search_binary_handler() and does trace_sched_process_exec() +
    ptrace_event().

    The patch also moves the setting of task->did_exec, we need to do this
    only once.

    Note: we can kill either task->did_exec or PF_FORKNOEXEC.

    Signed-off-by: Oleg Nesterov
    Acked-by: Kees Cook
    Cc: Al Viro
    Cc: Evgeniy Polyakov
    Cc: Zach Levis
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • proc_fd_permission() says "process can still access /proc/self/fd after it
    has executed a setuid()", but the "task_pid() = proc_pid() check only
    helps if the task is group leader, /proc/self points to
    /proc/.

    Change this check to use task_tgid() so that the whole thread group can
    access its /proc/self/fd or /proc//fd.

    Notes:
    - CLONE_THREAD does not require CLONE_FILES so task->files
    can differ, but I don't think this can lead to any security
    problem. And this matches same_thread_group() in
    __ptrace_may_access().

    - /proc/self should probably point to /proc/, but
    it is too late to change the rules. Perhaps it makes sense
    to add /proc/thread though.

    Test-case:

    void *tfunc(void *arg)
    {
    assert(opendir("/proc/self/fd"));
    return NULL;
    }

    int main(void)
    {
    pthread_t t;
    pthread_create(&t, NULL, tfunc, NULL);
    pthread_join(t, NULL);
    return 0;
    }

    fails if, say, this executable is not readable and suid_dumpable = 0.

    Signed-off-by: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • mpol_to_str() may fail, and not fill the buffer (e.g. -EINVAL), so need
    check about it, or buffer may not be zero based, and next seq_printf()
    will cause issue.

    The failure return need after mpol_cond_put() to match get_vma_policy().

    Signed-off-by: Chen Gang
    Cc: Cyrill Gorcunov
    Cc: Mel Gorman
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Chen Gang
     
  • Cc: "Eric W. Biederman"
    Cc: Andrey Vagin
    Cc: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • Add a new %P variable to be used in core_pattern. This variable contains
    the global PID (PID in the init namespace) as %p contains the PID in the
    current namespace which isn't always what we want.

    The main use for this is to make it easier to handle crashes that happened
    within a container. With that new variables it's possible to have the
    crashes dumped into the container or forwarded to the host with the right
    PID (from the host's point of view).

    Signed-off-by: Stéphane Graber
    Reported-by: Hans Feldt
    Cc: Alexander Viro
    Cc: Eric W. Biederman
    Cc: Andy Whitcroft
    Acked-by: Serge E. Hallyn
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Stéphane Graber
     
  • Integrate implemented POSIX ACLs support into hfsplus driver.

    Signed-off-by: Vyacheslav Dubeyko
    Cc: Al Viro
    Cc: Christoph Hellwig
    Cc: Hin-Tak Leung
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vyacheslav Dubeyko
     
  • Implement POSIX ACLs support in hfsplus driver.

    Signed-off-by: Vyacheslav Dubeyko
    Cc: Al Viro
    Cc: Christoph Hellwig
    Cc: Hin-Tak Leung
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vyacheslav Dubeyko
     
  • This patchset implements POSIX ACLs support in hfsplus driver.

    Mac OS X beginning with version 10.4 ("Tiger") support NFSv4 ACLs, which
    are part of the NFSv4 standard. HFS+ stores ACLs in the form of
    specially named extended attributes (com.apple.system.Security).

    But this patchset doesn't use "com.apple.system.Security" extended
    attributes. It implements support of POSIX ACLs in the form of extended
    attributes with names "system.posix_acl_access" and
    "system.posix_acl_default". These xattrs are treated only under Linux.
    POSIX ACLs doesn't mean something under Mac OS X. Thereby, this patch
    set provides opportunity to use POSIX ACLs under Linux on HFS+
    filesystem.

    This patch:

    Add CONFIG_HFSPLUS_FS_POSIX_ACL kernel configuration option, DBG_ACL_MOD
    debugging flag and acl.h file with declaration of essential functions
    for support POSIX ACLs in hfsplus driver.

    Signed-off-by: Vyacheslav Dubeyko
    Cc: Al Viro
    Cc: Christoph Hellwig
    Cc: Hin-Tak Leung
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vyacheslav Dubeyko
     
  • ep_free() might iterate on a huge set of epitems and hold cpu too long.
    Add two cond_resched() in order to yield cpu to other tasks. This is safe
    as we only hold mutexes in this function.

    Signed-off-by: Eric Dumazet
    Cc: Al Viro
    Cc: Theodore Ts'o
    Acked-by: Eric Wong
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric Dumazet
     
  • Free the bio_integrity_pool in the fail path of biovec_create_pool in
    function bioset_integrity_create().

    Signed-off-by: Gu Zheng
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Gu Zheng
     
  • There is a race between mark inode dirty and writeback thread, see the
    following scenario. In this case, writeback thread will not run though
    there is dirty_io.

    __mark_inode_dirty() bdi_writeback_workfn()
    ... ...
    spin_lock(&inode->i_lock);
    ...
    if (bdi_cap_writeback_dirty(bdi)) {
    <<< assume wb has dirty_io, so wakeup_bdi is false.
    <<< the following inode_dirty also have wakeup_bdi false.
    if (!wb_has_dirty_io(&bdi->wb))
    wakeup_bdi = true;
    }
    spin_unlock(&inode->i_lock);
    <<< assume last dirty_io is removed here.
    pages_written = wb_do_writeback(wb);
    ...
    <<< work_list empty and wb has no dirty_io,
    <<< delayed_work will not be queued.
    if (!list_empty(&bdi->work_list) ||
    (wb_has_dirty_io(wb) && dirty_writeback_interval))
    queue_delayed_work(bdi_wq, &wb->dwork,
    msecs_to_jiffies(dirty_writeback_interval * 10));
    spin_lock(&bdi->wb.list_lock);
    inode->dirtied_when = jiffies;
    <<< new dirty_io is added.
    list_move(&inode->i_wb_list, &bdi->wb.b_dirty);
    spin_unlock(&bdi->wb.list_lock);

    <<< though there is dirty_io, but wakeup_bdi is false,
    <<< so writeback thread will not be waked up and
    <<< the new dirty_io will not be flushed.
    if (wakeup_bdi)
    bdi_wakeup_thread_delayed(bdi);

    Writeback will run until there is a new flush work queued. This may cause
    a lot of dirty pages stay in memory for a long time.

    Signed-off-by: Junxiao Bi
    Reviewed-by: Jan Kara
    Cc: Fengguang Wu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Junxiao Bi
     
  • The feature prevents mistrusted filesystems (ie: FUSE mounts created by
    unprivileged users) to grow a large number of dirty pages before
    throttling. For such filesystems balance_dirty_pages always check bdi
    counters against bdi limits. I.e. even if global "nr_dirty" is under
    "freerun", it's not allowed to skip bdi checks. The only use case for now
    is fuse: it sets bdi max_ratio to 1% by default and system administrators
    are supposed to expect that this limit won't be exceeded.

    The feature is on if a BDI is marked by BDI_CAP_STRICTLIMIT flag. A
    filesystem may set the flag when it initializes its BDI.

    The problematic scenario comes from the fact that nobody pays attention to
    the NR_WRITEBACK_TEMP counter (i.e. number of pages under fuse
    writeback). The implementation of fuse writeback releases original page
    (by calling end_page_writeback) almost immediately. A fuse request queued
    for real processing bears a copy of original page. Hence, if userspace
    fuse daemon doesn't finalize write requests in timely manner, an
    aggressive mmap writer can pollute virtually all memory by those temporary
    fuse page copies. They are carefully accounted in NR_WRITEBACK_TEMP, but
    nobody cares.

    To make further explanations shorter, let me use "NR_WRITEBACK_TEMP
    problem" as a shortcut for "a possibility of uncontrolled grow of amount
    of RAM consumed by temporary pages allocated by kernel fuse to process
    writeback".

    The problem was very easy to reproduce. There is a trivial example
    filesystem implementation in fuse userspace distribution: fusexmp_fh.c. I
    added "sleep(1);" to the write methods, then recompiled and mounted it.
    Then created a huge file on the mount point and run a simple program which
    mmap-ed the file to a memory region, then wrote a data to the region. An
    hour later I observed almost all RAM consumed by fuse writeback. Since
    then some unrelated changes in kernel fuse made it more difficult to
    reproduce, but it is still possible now.

    Putting this theoretical happens-in-the-lab thing aside, there is another
    thing that really hurts real world (FUSE) users. This is write-through
    page cache policy FUSE currently uses. I.e. handling write(2), kernel
    fuse populates page cache and flushes user data to the server
    synchronously. This is excessively suboptimal. Pavel Emelyanov's patches
    ("writeback cache policy") solve the problem, but they also make resolving
    NR_WRITEBACK_TEMP problem absolutely necessary. Otherwise, simply copying
    a huge file to a fuse mount would result in memory starvation. Miklos,
    the maintainer of FUSE, believes strictlimit feature the way to go.

    And eventually putting FUSE topics aside, there is one more use-case for
    strictlimit feature. Using a slow USB stick (mass storage) in a machine
    with huge amount of RAM installed is a well-known pain. Let's make simple
    computations. Assuming 64GB of RAM installed, existing implementation of
    balance_dirty_pages will start throttling only after 9.6GB of RAM becomes
    dirty (freerun == 15% of total RAM). So, the command "cp 9GB_file
    /media/my-usb-storage/" may return in a few seconds, but subsequent
    "umount /media/my-usb-storage/" will take more than two hours if effective
    throughput of the storage is, to say, 1MB/sec.

    After inclusion of strictlimit feature, it will be trivial to add a knob
    (e.g. /sys/devices/virtual/bdi/x:y/strictlimit) to enable it on demand.
    Manually or via udev rule. May be I'm wrong, but it seems to be quite a
    natural desire to limit the amount of dirty memory for some devices we are
    not fully trust (in the sense of sustainable throughput).

    [akpm@linux-foundation.org: fix warning in page-writeback.c]
    Signed-off-by: Maxim Patlasov
    Cc: Jan Kara
    Cc: Miklos Szeredi
    Cc: Wu Fengguang
    Cc: Pavel Emelyanov
    Cc: James Bottomley
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Maxim Patlasov
     
  • It's not used globally and could be static.

    Signed-off-by: Wanpeng Li
    Cc: Dave Hansen
    Cc: Rik van Riel
    Cc: Fengguang Wu
    Cc: Joonsoo Kim
    Cc: Johannes Weiner
    Cc: Tejun Heo
    Cc: Yasuaki Ishimatsu
    Cc: David Rientjes
    Cc: KOSAKI Motohiro
    Cc: Jiri Kosina
    Cc: Wanpeng Li
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wanpeng Li
     
  • Pavel reported that in case if vma area get unmapped and then mapped (or
    expanded) in-place, the soft dirty tracker won't be able to recognize this
    situation since it works on pte level and ptes are get zapped on unmap,
    loosing soft dirty bit of course.

    So to resolve this situation we need to track actions on vma level, there
    VM_SOFTDIRTY flag comes in. When new vma area created (or old expanded)
    we set this bit, and keep it here until application calls for clearing
    soft dirty bit.

    Thus when user space application track memory changes now it can detect if
    vma area is renewed.

    Reported-by: Pavel Emelyanov
    Signed-off-by: Cyrill Gorcunov
    Cc: Andy Lutomirski
    Cc: Matt Mackall
    Cc: Xiao Guangrong
    Cc: Marcelo Tosatti
    Cc: KOSAKI Motohiro
    Cc: Stephen Rothwell
    Cc: Peter Zijlstra
    Cc: "Aneesh Kumar K.V"
    Cc: Rob Landley
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cyrill Gorcunov
     
  • In case when system contains no dirty pages, wakeup_flusher_threads() will
    submit WB_SYNC_NONE writeback for 0 pages so wb_writeback() exits
    immediately without doing anything, even though there are dirty inodes in
    the system. Thus sync(1) will write all the dirty inodes from a
    WB_SYNC_ALL writeback pass which is slow.

    Fix the problem by using get_nr_dirty_pages() in wakeup_flusher_threads()
    instead of calculating number of dirty pages manually. That function also
    takes number of dirty inodes into account.

    Signed-off-by: Jan Kara
    Reported-by: Paul Taysom
    Cc: Wu Fengguang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     
  • Call fiemap ioctl(2) with given start offset as well as an desired mapping
    range should show extents if possible. However, we somehow figure out the
    end offset of mapping via 'mapping_end -= cpos' before iterating the
    extent records which would cause problems if the given fiemap length is
    too small to a cluster size, e.g,

    Cluster size 4096:
    debugfs.ocfs2 1.6.3
    Block Size Bits: 12 Cluster Size Bits: 12

    The extended fiemap test utility From David:
    https://gist.github.com/anonymous/6172331

    # dd if=/dev/urandom of=/ocfs2/test_file bs=1M count=1000
    # ./fiemap /ocfs2/test_file 4096 10
    start: 4096, length: 10
    File /ocfs2/test_file has 0 extents:
    # Logical Physical Length Flags
    ^^^^^
    Reported-by: David Weber
    Tested-by: David Weber
    Cc: Sunil Mushran
    Cc: Mark Fashen
    Cc: Joel Becker
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jie Liu
     
  • Variable ip in dlmfs_get_root_inode() is defined but not used. So clean
    it up.

    Signed-off-by: Joseph Qi
    Reviewed-by: Jie Liu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joseph Qi
     
  • In o2hb_shutdown_slot() and o2hb_check_slot(), since event is defined as
    local, it is only valid during the call stack. So the following tiny race
    case may happen in a multi-volumes mounted environment:

    o2hb-vol1 o2hb-vol2
    1) o2hb_shutdown_slot
    allocate local event1
    2) queue_node_event
    add event1 to global o2hb_node_events
    3) o2hb_shutdown_slot
    allocate local event2
    4) queue_node_event
    add event2 to global o2hb_node_events
    5) o2hb_run_event_list
    delete event1 from o2hb_node_events
    6) o2hb_run_event_list
    event1 empty, return
    7) o2hb_shutdown_slot
    event1 lifecycle ends
    8) o2hb_fire_callbacks
    event1 is already *invalid*

    This patch lets it wait on o2hb_callback_sem when another thread is firing
    callbacks. And for performance consideration, we only call
    o2hb_run_event_list when there is an event queued.

    Signed-off-by: Joyce
    Signed-off-by: Joseph Qi
    Cc: Joel Becker
    Cc: Mark Fasheh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joyce
     
  • Since o2nm_get_node_by_num() may return NULL, we add this check in
    o2net_accept_one() to avoid possible NULL pointer dereference.

    Signed-off-by: Joseph Qi
    Cc: Mark Fasheh
    Cc: Joel Becker
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joseph Qi
     
  • Code in o2net_handler_tree_lookup() may be corrupted by mistake. So
    adjust it to promote readability.

    Signed-off-by: Joseph Qi
    Cc: Mark Fasheh
    Cc: Joel Becker
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joseph Qi
     
  • In ocfs2_remove_inode_range(), there is a memory leak. The variable path
    has allocated memory with ocfs2_new_path_from_et(), but it is not free.

    Signed-off-by: Younger Liu
    Reviewed-by: Jie Liu
    Cc: Mark Fasheh
    Cc: Joel Becker
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Younger Liu
     
  • In ocfs2_reflink_xattr_rec(), meta_ac and data_ac are allocated by calling
    ocfs2_lock_reflink_xattr_rec_allocators().

    Once an error occurs when allocating *data_ac, it frees *meta_ac which is
    allocated before. Here it mistakenly sets meta_ac to NULL but *meta_ac.
    Then ocfs2_reflink_xattr_rec() will try to free meta_ac again which is
    already invalid.

    Signed-off-by: Joseph Qi
    Reviewed-by: Jie Liu
    Cc: Mark Fasheh
    Cc: Joel Becker
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joseph Qi
     
  • dlm_do_local_recovery_cleanup() should force clean refmap if the owner of
    lockres is UNKNOWN. Otherwise node may hang when umounting filesystems.
    Here's the situation:

    Node1 Node2
    dlmlock()
    -> dlm_get_lock_resource()
    send DLM_MASTER_REQUEST_MSG to
    other nodes.

    trying to master this lockres,
    return MAYBE.

    selected as the master of lockresA,
    set mle->master to Node1,
    and do assert_master,
    send DLM_ASSERT_MASTER_MSG to Node2.
    Node 2 has interest on lockresA
    and return
    DLM_ASSERT_RESPONSE_MASTERY_REF
    then something happened and
    Node2 crashed.

    Receiving DLM_ASSERT_RESPONSE_MASTERY_REF, set Node2 into refmap, and keep
    sending DLM_ASSERT_MASTER_MSG to other nodes

    o2hb found node2 down, calling dlm_hb_node_down() -->
    dlm_do_local_recovery_cleanup() the master of lockresA is still UNKNOWN,
    no need to call dlm_free_dead_locks().

    Set the master of lockresA to Node1, but Node2 stills remains in refmap.

    When Node1 umount, it found that the refmap of lockresA is not empty and
    attempted to migrate it to Node2, But Node2 is already down, so umount
    hang, trying to migrate lockresA again and again.

    Signed-off-by: joyce
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Jie Liu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Xue jiufei
     
  • In ocfs2_xattr_set(), if ocfs2_start_trans failed, meta_ac and data_ac
    should be free. Otherwise, It would lead to a memory leak.

    Signed-off-by: Younger Liu
    Cc: Joseph Qi
    Reviewed-by: Jie Liu
    Cc: Mark Fasheh
    Cc: Joel Becker
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Younger Liu
     
  • In ocfs2_xattr_value_attach_refcount(), if error occurs when calling
    ocfs2_xattr_get_clusters(), it will go with unexpected behavior since
    local variables p_cluster, num_clusters and ext_flags are declared without
    initialization.

    Signed-off-by: Joseph Qi
    Cc: Joel Becker
    Cc: Mark Fasheh
    Acked-by: Jie Liu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joseph Qi