22 Jan, 2014

27 commits

  • Merge first patch-bomb from Andrew Morton:

    - a couple of misc things

    - inotify/fsnotify work from Jan

    - ocfs2 updates (partial)

    - about half of MM

    * emailed patches from Andrew Morton : (117 commits)
    mm/migrate: remove unused function, fail_migrate_page()
    mm/migrate: remove putback_lru_pages, fix comment on putback_movable_pages
    mm/migrate: correct failure handling if !hugepage_migration_support()
    mm/migrate: add comment about permanent failure path
    mm, page_alloc: warn for non-blockable __GFP_NOFAIL allocation failure
    mm: compaction: reset scanner positions immediately when they meet
    mm: compaction: do not mark unmovable pageblocks as skipped in async compaction
    mm: compaction: detect when scanners meet in isolate_freepages
    mm: compaction: reset cached scanner pfn's before reading them
    mm: compaction: encapsulate defer reset logic
    mm: compaction: trace compaction begin and end
    memcg, oom: lock mem_cgroup_print_oom_info
    sched: add tracepoints related to NUMA task migration
    mm: numa: do not automatically migrate KSM pages
    mm: numa: trace tasks that fail migration due to rate limiting
    mm: numa: limit scope of lock for NUMA migrate rate limiting
    mm: numa: make NUMA-migrate related functions static
    lib/show_mem.c: show num_poisoned_pages when oom
    mm/hwpoison: add '#' to hwpoison_inject
    mm/memblock: use WARN_ONCE when MAX_NUMNODES passed as input parameter
    ...

    Linus Torvalds
     
  • Pull dlm update from David Teigland:
    "A single change to speed up recovery times when using SCTP
    connections"

    * tag 'dlm-3.14' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm:
    dlm: set zero linger time on sctp socket

    Linus Torvalds
     
  • Pull GFS2 updates from Steven Whitehouse:
    "The main topics this time are allocation, in the form of Bob's
    improvements when searching resource groups and several updates to
    quotas which should increase scalability. The quota changes follow on
    from those in the last merge window, and there will likely be further
    work to come in this area in due course.

    There are also a few patches which help to improve efficiency of
    adding entries into directories, and clean up some of that code.

    One on-disk change is included this time, which is to write some
    additional information which should be useful to fsck and also
    potentially for debugging.

    Other than that, its just a few small random bug fixes and clean ups"

    * tag 'gfs2-merge-window' of git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-3.0-nmw: (24 commits)
    GFS2: revert "GFS2: d_splice_alias() can't return error"
    GFS2: Small cleanup
    GFS2: Don't use ENOBUFS when ENOMEM is the correct error code
    GFS2: Fix kbuild test robot reported warning
    GFS2: Move quota bitmap operations under their own lock
    GFS2: Clean up quota slot allocation
    GFS2: Only run logd and quota when mounted read/write
    GFS2: Use RCU/hlist_bl based hash for quotas
    GFS2: No need to invalidate pages for a dio read
    GFS2: Add initialization for address space in super block
    GFS2: Add hints to directory leaf blocks
    GFS2: For exhash conversion, only one block is needed
    GFS2: Increase i_writecount during gfs2_setattr_chown
    GFS2: Remember directory insert point
    GFS2: Consolidate transaction blocks calculation for dir add
    GFS2: Add directory addition info structure
    GFS2: Use only a single address space for rgrps
    GFS2: Use range based functions for rgrp sync/invalidation
    GFS2: Remove test which is always true
    GFS2: Remove gfs2_quota_change_host structure
    ...

    Linus Torvalds
     
  • Many load balancing and workload placing programs check /proc/meminfo to
    estimate how much free memory is available. They generally do this by
    adding up "free" and "cached", which was fine ten years ago, but is
    pretty much guaranteed to be wrong today.

    It is wrong because Cached includes memory that is not freeable as page
    cache, for example shared memory segments, tmpfs, and ramfs, and it does
    not include reclaimable slab memory, which can take up a large fraction
    of system memory on mostly idle systems with lots of files.

    Currently, the amount of memory that is available for a new workload,
    without pushing the system into swap, can be estimated from MemFree,
    Active(file), Inactive(file), and SReclaimable, as well as the "low"
    watermarks from /proc/zoneinfo.

    However, this may change in the future, and user space really should not
    be expected to know kernel internals to come up with an estimate for the
    amount of free memory.

    It is more convenient to provide such an estimate in /proc/meminfo. If
    things change in the future, we only have to change it in one place.

    Signed-off-by: Rik van Riel
    Reported-by: Erik Mouw
    Acked-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rik van Riel
     
  • The ramfs is always built in. It will never be modular, so using
    module_init as an alias for __initcall is rather misleading.

    Fix this up now, so that we can relocate module_init from init.h into
    module.h in the future. If we don't do this, we'd have to add module.h
    to obviously non-modular code, and that would be a worse thing.

    Note that direct use of __initcall is discouraged, vs. one of the
    priority categorized subgroups. As __initcall gets mapped onto
    device_initcall, our use of fs_initcall (which makes sense for fs code)
    will thus change this registration from level 6-device to level 5-fs
    (i.e. slightly earlier). However no observable impact of that small
    difference has been observed during testing, or is expected.

    Also note that this change uncovers a missing semicolon bug in the
    registration of the initcall.

    Signed-off-by: Paul Gortmaker
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Gortmaker
     
  • On fail path alloc_super() calls destroy_super(), which issues a warning
    if the sb's s_mounts list is not empty, in particular if it has not been
    initialized. That said s_mounts must be initialized in alloc_super()
    before any possible failure, but currently it is initialized close to
    the end of the function leading to a useless warning dumped to log if
    either percpu_counter_init() or list_lru_init() fails. Let's fix this.

    Signed-off-by: Vladimir Davydov
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • The compat_do_readv_writev() function was doing a verify_area on the
    incoming iov, but the nr_segs value is not checked. If someone passes
    in a -1 for nr_segs, for instance, the function should return an EINVAL.
    However, it returns a EFAULT because the verify_area fails because it is
    checking an array of size MAX_UINT. The check is bogus, anyway, because
    the next check, compat_rw_copy_check_uvector(), will do all the
    necessary checking, anyway. The non-compat do_readv_writev() function
    doesn't do this check, so I think it's safe to just remove the code.

    Signed-off-by: Corey Minyard
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Corey Minyard
     
  • We cap "nmsgs" at I2C_RDRW_IOCTL_MAX_MSGS (42) but the current code
    allows negative values. It's harmless but it makes my static checker
    upset so I've made nsmgs unsigned.

    Signed-off-by: Dan Carpenter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Carpenter
     
  • Uninline vast tracts of nested inline functions in
    include/linux/posix_acl.h.

    This reduces the text+data+bss size of x86_64 allyesconfig vmlinux by
    8026 bytes.

    The patch also regularises the positioning of the EXPORT_SYMBOLs in
    posix_acl.c.

    Cc: Alexander Viro
    Cc: J. Bruce Fields
    Cc: Trond Myklebust
    Tested-by: Benny Halevy
    Cc: Benny Halevy
    Cc: Andreas Gruenbacher
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • 2 nodes cluster, say Node A and Node B, mount the same ocfs2 volume, and
    create a file 1.

    Node A Node B
    open 1, get open lock
    rm 1, and then add 1 to orphan_dir
    storage link down,
    o2hb_write_timeout
    ->o2quo_disk_timeout
    ->emergency_restart
    at the moment, Node B dismount and do
    ocfs2rec simultaneously
    1) ocfs2_dismount_volume
    ->ocfs2_recovery_exit
    ->wait_event(osb->recovery_event)
    ->flush_workqueue(ocfs2_wq)
    2) ocfs2rec
    ->queue_work(&journal->j_recovery_work)
    ->ocfs2_recover_orphans
    ->ocfs2_commit_truncate
    ->queue_delayed_work(&osb->osb_truncate_log_wq)

    In ocfs2_recovery_exit, it flushes workqueue and then releases system
    inodes. When doing ocfs2rec, it will call ocfs2_flush_truncate_log
    which will try to get sys_root_inode, and NULL pointer dereference
    occurs.

    Signed-off-by: Yiwen Jiang
    Signed-off-by: joyce
    Signed-off-by: Joseph Qi
    Cc: Joel Becker
    Cc: Mark Fasheh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yiwen Jiang
     
  • An unreserve space ioctl OCFS2_IOC_UNRESVSP/64 should reject a negative
    length.

    Orabug:14789508

    Signed-off-by: Tariq Saseed
    Signed-off-by: Srinivas Eeda
    Cc: Joel Becker
    Cc: Mark Fasheh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tariq Saeed
     
  • Fixes the following sparse warning:

    fs/ocfs2/stack_user.c:930:32: warning:
    symbol 'ocfs2_ls_ops' was not declared. Should it be static?

    Signed-off-by: Wei Yongjun
    Cc: Mark Fasheh
    Cc: Joel Becker
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wei Yongjun
     
  • Adjust minlen with discard_granularity for FITRIM ioctl(2) if the given
    minimum size in bytes is less than it because, discard granularity is
    used to tell us that the minimum size of extent that can be discarded by
    the storage device.

    This is inspired by ext4 commit 5c2ed62fd447 ("ext4: Adjust minlen with
    discard_granularity in the FITRIM ioctl") from Lukas Czerner.

    Signed-off-by: Jie Liu
    Cc: Mark Fasheh
    Cc: Joel Becker
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jie Liu
     
  • For FITRIM ioctl(2), we should not keep silence if the given range
    length ls less than a block size as there is no data blocks would be
    discareded. Hence it should return EINVAL instead. This issue can be
    verified via xfstests/generic/288 which is used for FITRIM argument
    handling tests.

    Signed-off-by: Jie Liu
    Cc: Mark Fasheh
    Cc: Joel Becker
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jie Liu
     
  • For FITRIM ioctl(2), we should return EOPNOTSUPP to inform the user that
    the storage device does not support discard if it is, otherwise return
    success would confuse the user even though there is no free blocks were
    trimmed at all.

    Signed-off-by: Jie Liu
    Cc: Mark Fasheh
    Cc: Joel Becker
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jie Liu
     
  • ocfs2_alloc_dinode_update_counts() and ocfs2_block_group_set_bits() are
    already provided in suballoc.c. So, the same functions in
    move_extents.c are not needed any more.

    Declare the functions in suballoc.h and remove redundant functions in
    move_extents.c.

    Signed-off-by: Younger Liu
    Cc: Younger Liu
    Cc: Mark Fasheh
    Cc: Joel Becker
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Younger Liu
     
  • Attempt to use the new DLM operations. If it is not supported, use the
    traditional ocfs2_controld.

    To exchange ocfs2 versioning, we use the LVB of the version dlm lock.
    It first attempts to take the lock in EX mode (non-blocking). If
    successful (which means it is the first mount), it writes the version
    number and downconverts to PR lock. If it is unsuccessful, it reads the
    version from the lock.

    If this becomes the standard (with o2cb as well), it could simplify
    userspace tools to check if the filesystem is mounted on other nodes.

    Dan: Since ocfs2_protocol_version are two u8 values, the additional
    checks with LONG* don't make sense.

    Signed-off-by: Goldwyn Rodrigues
    Signed-off-by: Dan Carpenter
    Reviewed-by: Mark Fasheh
    Cc: Joel Becker
    Cc: Dan Carpenter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Goldwyn Rodrigues
     
  • Use the native DLM locks for version control negotiation. Most of the
    framework is taken from gfs2/lock_dlm.c

    Signed-off-by: Goldwyn Rodrigues
    Reviewed-by: Mark Fasheh
    Cc: Joel Becker
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Goldwyn Rodrigues
     
  • This is done to differentiate between using and not using controld and
    use the connection information accordingly.

    We need to be backward compatible. So, we use a new enum
    ocfs2_connection_type to identify when controld is used and when it is
    not.

    Signed-off-by: Goldwyn Rodrigues
    Reviewed-by: Mark Fasheh
    Cc: Joel Becker
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Goldwyn Rodrigues
     
  • We perform this because the DLM recovery callbacks will require the
    ocfs2_live_connection structure to record the node information when
    dlm_new_lockspace() is updated (in the last patch of the series).

    Before calling dlm_new_lockspace(), we need the structure ready for the
    .recover_done() callback, which would set oc_this_node. This is the
    reason we allocate ocfs2_live_connection beforehand in user_connect().

    [AKPM] rc initialization is not required because it assigned in case of
    errors. It will be cleared by compiler anyways.

    Signed-off-by: Goldwyn Rodrigues
    Reveiwed-by: Mark Fasheh
    Cc: Joel Becker
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Goldwyn Rodrigues
     
  • These are the callbacks called by the fs/dlm code in case the membership
    changes. If there is a failure while/during calling any of these, the
    DLM creates a new membership and relays to the rest of the nodes.

    - recover_prep() is called when DLM understands a node is down.
    - recover_slot() is called once all nodes have acknowledged
    recover_prep and recovery can begin.
    - recover_done() is called once the recovery is complete. It returns
    the new membership.

    Signed-off-by: Goldwyn Rodrigues
    Reviewed-by: Mark Fasheh
    Cc: Joel Becker
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Goldwyn Rodrigues
     
  • This is an effort of removing ocfs2_controld.pcmk and getting ocfs2 DLM
    handling up to the times with respect to DLM (>=4.0.1) and corosync
    (2.3.x). AFAIK, cman also is being phased out for a unified corosync
    cluster stack.

    fs/dlm performs all the functions with respect to fencing and node
    management and provides the API's to do so for ocfs2. For all future
    references, DLM stands for fs/dlm code.

    The advantages are:
    + No need to run an additional userspace daemon (ocfs2_controld)
    + No controld device handling and controld protocol
    + Shifting responsibilities of node management to DLM layer

    For backward compatibility, we are keeping the controld handling code.
    Once enough time has passed we can remove a significant portion of the
    code. This was tested by using the kernel with changes on older
    unmodified tools. The kernel used ocfs2_controld as expected, and
    displayed the appropriate warning message.

    This feature requires modification in the userspace ocfs2-tools. The
    changes can be found at: https://github.com/goldwynr/ocfs2-tools branch:
    nocontrold Currently, not many checks are present in the userspace code,
    but that would change soon.

    This patch (of 6):

    Add clustername to cluster connection.

    Signed-off-by: Goldwyn Rodrigues
    Reviewed-by: Mark Fasheh
    Cc: Joel Becker
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Goldwyn Rodrigues
     
  • The versioning information is confusing for end-users. The numbers are
    stuck at 1.5.0 when the tools version have moved to 1.8.2. Remove the
    versioning system in the OCFS2 modules and let the kernel version be the
    guide to debug issues.

    Signed-off-by: Goldwyn Rodrigues
    Acked-by: Sunil Mushran
    Cc: Mark Fasheh
    Acked-by: Joel Becker
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Goldwyn Rodrigues
     
  • We usually rely on the fact that struct members not specified in the
    initializer are set to NULL. So do that with fsnotify function pointers
    as well.

    Signed-off-by: Jan Kara
    Reviewed-by: Christoph Hellwig
    Cc: Eric Paris
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     
  • After removing event structure creation from the generic layer there is
    no reason for separate .should_send_event and .handle_event callbacks.
    So just remove the first one.

    Signed-off-by: Jan Kara
    Reviewed-by: Christoph Hellwig
    Cc: Eric Paris
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     
  • Currently fsnotify framework creates one event structure for each
    notification event and links this event into all interested notification
    groups. This is done so that we save memory when several notification
    groups are interested in the event. However the need for event
    structure shared between inotify & fanotify bloats the event structure
    so the result is often higher memory consumption.

    Another problem is that fsnotify framework keeps path references with
    outstanding events so that fanotify can return open file descriptors
    with its events. This has the undesirable effect that filesystem cannot
    be unmounted while there are outstanding events - a regression for
    inotify compared to a situation before it was converted to fsnotify
    framework. For fanotify this problem is hard to avoid and users of
    fanotify should kind of expect this behavior when they ask for file
    descriptors from notified files.

    This patch changes fsnotify and its users to create separate event
    structure for each group. This allows for much simpler code (~400 lines
    removed by this patch) and also smaller event structures. For example
    on 64-bit system original struct fsnotify_event consumes 120 bytes, plus
    additional space for file name, additional 24 bytes for second and each
    subsequent group linking the event, and additional 32 bytes for each
    inotify group for private data. After the conversion inotify event
    consumes 48 bytes plus space for file name which is considerably less
    memory unless file names are long and there are several groups
    interested in the events (both of which are uncommon). Fanotify event
    fits in 56 bytes after the conversion (fanotify doesn't care about file
    names so its events don't have to have it allocated). A win unless
    there are four or more fanotify groups interested in the event.

    The conversion also solves the problem with unmount when only inotify is
    used as we don't have to grab path references for inotify events.

    [hughd@google.com: fanotify: fix corruption preventing startup]
    Signed-off-by: Jan Kara
    Reviewed-by: Christoph Hellwig
    Cc: Eric Paris
    Cc: Al Viro
    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     
  • Rounding of name length when passing it to userspace was done in several
    places. Provide a function to do it and use it in all places.

    Signed-off-by: Jan Kara
    Reviewed-by: Christoph Hellwig
    Cc: Eric Paris
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     

21 Jan, 2014

1 commit

  • Pull driver core / sysfs patches from Greg KH:
    "Here's the big driver core and sysfs patch set for 3.14-rc1.

    There's a lot of work here moving sysfs logic out into a "kernfs" to
    allow other subsystems to also have a virtual filesystem with the same
    attributes of sysfs (handle device disconnect, dynamic creation /
    removal as needed / unneeded, etc)

    This is primarily being done for the cgroups filesystem, but the goal
    is to also move debugfs to it when it is ready, solving all of the
    known issues in that filesystem as well. The code isn't completed
    yet, but all should be stable now (there is a big section that was
    reverted due to problems found when testing)

    There's also some other smaller fixes, and a driver core addition that
    allows for a "collection" of objects, that the DRM people will be
    using soon (it's in this tree to make merges after -rc1 easier)

    All of this has been in linux-next with no reported issues"

    * tag 'driver-core-3.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (113 commits)
    kernfs: associate a new kernfs_node with its parent on creation
    kernfs: add struct dentry declaration in kernfs.h
    kernfs: fix get_active failure handling in kernfs_seq_*()
    Revert "kernfs: fix get_active failure handling in kernfs_seq_*()"
    Revert "kernfs: replace kernfs_node->u.completion with kernfs_root->deactivate_waitq"
    Revert "kernfs: remove KERNFS_ACTIVE_REF and add kernfs_lockdep()"
    Revert "kernfs: remove KERNFS_REMOVED"
    Revert "kernfs: restructure removal path to fix possible premature return"
    Revert "kernfs: invoke kernfs_unmap_bin_file() directly from __kernfs_remove()"
    Revert "kernfs: remove kernfs_addrm_cxt"
    Revert "kernfs: make kernfs_get_active() block if the node is deactivated but not removed"
    Revert "kernfs: implement kernfs_{de|re}activate[_self]()"
    Revert "kernfs, sysfs, driver-core: implement kernfs_remove_self() and its wrappers"
    Revert "pci: use device_remove_file_self() instead of device_schedule_callback()"
    Revert "scsi: use device_remove_file_self() instead of device_schedule_callback()"
    Revert "s390: use device_remove_file_self() instead of device_schedule_callback()"
    Revert "sysfs, driver-core: remove unused {sysfs|device}_schedule_callback_owner()"
    Revert "kernfs: remove unnecessary NULL check in __kernfs_remove()"
    kernfs: remove unnecessary NULL check in __kernfs_remove()
    drivers/base: provide an infrastructure for componentised subsystems
    ...

    Linus Torvalds
     

18 Jan, 2014

3 commits

  • 0d0d110720d7960b77c03c9f2597faaff4b484ae asserts that "d_splice_alias()
    can't return error unless it was given an IS_ERR(inode)".

    That was true of the implementation of d_splice_alias, but this is
    really a problem with d_splice_alias: at a minimum it should be able to
    return -ELOOP in the case where inserting the given dentry would cause a
    directory loop.

    Signed-off-by: J. Bruce Fields
    Signed-off-by: Steven Whitehouse

    J. Bruce Fields
     
  • Pull namespace fixes from Eric Biederman:
    "This is a set of 3 regression fixes.

    This fixes /proc/mounts when using "ip netns add " to display
    the actual mount point.

    This fixes a regression in clone that broke lxc-attach.

    This fixes a regression in the permission checks for mounting /proc
    that made proc unmountable if binfmt_misc was in use. Oops.

    My apologies for sending this pull request so late. Al Viro gave
    interesting review comments about the d_path fix that I wanted to
    address in detail before I sent this pull request. Unfortunately a
    bad round of colds kept from addressing that in detail until today.
    The executive summary of the review was:

    Al: Is patching d_path really sufficient?
    The prepend_path, d_path, d_absolute_path, and __d_path family of
    functions is a really mess.

    Me: Yes, patching d_path is really sufficient. Yes, the code is mess.
    No it is not appropriate to rewrite all of d_path for a regression
    that has existed for entirely too long already, when a two line
    change will do"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
    vfs: Fix a regression in mounting proc
    fork: Allow CLONE_PARENT after setns(CLONE_NEWPID)
    vfs: In d_path don't call d_dname on a mount point

    Linus Torvalds
     
  • Once created, a kernfs_node is always destroyed by kernfs_put().
    Since ba7443bc656e ("sysfs, kernfs: implement
    kernfs_create/destroy_root()"), kernfs_put() depends on kernfs_root()
    to locate the ino_ida. kernfs_root() in turn depends on
    kernfs_node->parent being set for !dir nodes. This means that
    kernfs_put() of a !dir node requires its ->parent to be initialized.

    This leads to oops when a newly created !dir node is destroyed without
    going through kernfs_add_one() or after failing kernfs_add_one()
    before ->parent is set. kernfs_root() invoked from kernfs_put() will
    try to dereference NULL parent.

    Fix it by moving parent association to kernfs_new_node() from
    kernfs_add_one(). kernfs_new_node() now takes @parent instead of
    @root and determines the root from the parent and also sets the new
    node's parent properly. @parent parameter is removed from
    kernfs_add_one(). As there's no parent when creating the root node,
    __kernfs_new_node() which takes @root as before and doesn't set the
    parent is used in that case.

    This ensures that a kernfs_node in any stage in its life has its
    parent associated and thus can be put.

    Signed-off-by: Tejun Heo
    Signed-off-by: Greg Kroah-Hartman

    Tejun Heo
     

16 Jan, 2014

3 commits


15 Jan, 2014

6 commits

  • Well I don't get the same warning locally as the kbuild
    robot, but I guess this should fix the problem, anyway.
    Here is the warning:

    head: 2d9e72303d538024627fb1fe2cbde48aec12acc0
    commit: ee2411a8db49a21bc55dc124e1b434ba194c8903 [19/20] GFS2: Clean up quota slot allocation
    config: make ARCH=powerpc allmodconfig

    All error/warnings:

    fs/gfs2/quota.c: In function 'gfs2_quota_init':
    >> fs/gfs2/quota.c:1246:3: error: implicit declaration of function '__vmalloc' [-Werror=implicit-function-declaration]
    sdp->sd_quota_bitmap = __vmalloc(bm_size, GFP_NOFS, PAGE_KERNEL);
    ^
    >> fs/gfs2/quota.c:1246:24: warning: assignment makes pointer from integer without a cast [enabled by default]
    sdp->sd_quota_bitmap = __vmalloc(bm_size, GFP_NOFS, PAGE_KERNEL);
    ^
    fs/gfs2/quota.c: In function 'gfs2_quota_cleanup':
    >> fs/gfs2/quota.c:1361:4: error: implicit declaration of function 'vfree' [-Werror=implicit-function-declaration]
    vfree(sdp->sd_quota_bitmap);

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     
  • There is a bug in the function nilfs_segctor_collect, which results in
    active data being written to a segment, that is marked as clean. It is
    possible, that this segment is selected for a later segment
    construction, whereby the old data is overwritten.

    The problem shows itself with the following kernel log message:

    nilfs_sufile_do_cancel_free: segment 6533 must be clean

    Usually a few hours later the file system gets corrupted:

    NILFS: bad btree node (blocknr=8748107): level = 0, flags = 0x0, nchildren = 0
    NILFS error (device sdc1): nilfs_bmap_last_key: broken bmap (inode number=114660)

    The issue can be reproduced with a file system that is nearly full and
    with the cleaner running, while some IO intensive task is running.
    Although it is quite hard to reproduce.

    This is what happens:

    1. The cleaner starts the segment construction
    2. nilfs_segctor_collect is called
    3. sc_stage is on NILFS_ST_SUFILE and segments are freed
    4. sc_stage is on NILFS_ST_DAT current segment is full
    5. nilfs_segctor_extend_segments is called, which
    allocates a new segment
    6. The new segment is one of the segments freed in step 3
    7. nilfs_sufile_cancel_freev is called and produces an error message
    8. Loop around and the collection starts again
    9. sc_stage is on NILFS_ST_SUFILE and segments are freed
    including the newly allocated segment, which will contain active
    data and can be allocated at a later time
    10. A few hours later another segment construction allocates the
    segment and causes file system corruption

    This can be prevented by simply reordering the statements. If
    nilfs_sufile_cancel_freev is called before nilfs_segctor_extend_segments
    the freed segments are marked as dirty and cannot be allocated any more.

    Signed-off-by: Andreas Rohner
    Reviewed-by: Ryusuke Konishi
    Tested-by: Andreas Rohner
    Signed-off-by: Ryusuke Konishi
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andreas Rohner
     
  • Gradually, the global qd_lock is being used for less and less.
    After this patch it will only be used for the per super block
    list whose purpose is to allow syncing of changes back to the
    master quota file from the local quota changes file. Fixing
    up that process to make it more efficient will be the subject
    of a later patch, however this patch removes another barrier
    to doing that.

    Signed-off-by: Steven Whitehouse
    Cc: Abhijith Das

    Steven Whitehouse
     
  • Quota slot allocation has historically used a vector of pages
    and a set of homegrown find/test/set/clear bit functions. Since
    the size of the bitmap is likely to be based on the default
    qc file size, thats a couple of pages at most. So we ought
    to be able to allocate that as a single chunk, with a vmalloc
    fallback, just in case of memory fragmentation.

    We are then able to use the kernel's own find/test/set/clear
    bit functions, rather than rolling our own.

    Signed-off-by: Steven Whitehouse
    Cc: Abhijith Das

    Steven Whitehouse
     
  • While investigating a rather strange bit of code in the quota
    clean up function, I spotted that the reason for its existence
    was that when remounting read only, we were not stopping the
    quotad thread, and thus it was possible for it to still have
    a reference to some of the quotas in that case.

    This patch moves the logd and quota thread start and stop into
    the make_fs_rw/ro functions, so that we now stop those threads
    when mounted read only.

    This means that quotad will always be stopped before we call
    the quota clean up function, and we can thus dispose of the
    (rather hackish) code that waits for it to give up its
    reference on the quotas.

    Signed-off-by: Steven Whitehouse
    Cc: Abhijith Das

    Steven Whitehouse
     
  • Prior to this patch, GFS2 kept all the quotas for each
    super block in a single linked list. This is rather slow
    when there are large numbers of quotas.

    This patch introduces a hlist_bl based hash table, similar
    to the one used for glocks. The initial look up of the quota
    is now lockless in the case where it is already cached,
    although we still have to take the per quota spinlock in
    order to bump the ref count. Either way though, this is a
    big improvement on what was there before.

    The qd_lock and the per super block list is preserved, for
    the time being. However it is intended that since this is no
    longer used for its original role, it should be possible to
    shrink the number of items on that list in due course and
    remove the requirement to take qd_lock in qd_get.

    Signed-off-by: Steven Whitehouse
    Cc: Abhijith Das
    Cc: Paul E. McKenney

    Steven Whitehouse