31 Mar, 2011

1 commit


29 Mar, 2011

1 commit

  • * 'for-linus' of git://oss.sgi.com/xfs/xfs:
    xfs: stop using the page cache to back the buffer cache
    xfs: register the inode cache shrinker before quotachecks
    xfs: xfs_trans_read_buf() should return an error on failure
    xfs: introduce inode cluster buffer trylocks for xfs_iflush
    vmap: flush vmap aliases when mapping fails
    xfs: preallocation transactions do not need to be synchronous

    Fix up trivial conflicts in fs/xfs/linux-2.6/xfs_buf.c due to plug removal.

    Linus Torvalds
     

26 Mar, 2011

5 commits

  • Now that the buffer cache has it's own LRU, we do not need to use
    the page cache to provide persistent caching and reclaim
    infrastructure. Convert the buffer cache to use alloc_pages()
    instead of the page cache. This will remove all the overhead of page
    cache management from setup and teardown of the buffers, as well as
    needing to mark pages accessed as we find buffers in the buffer
    cache.

    By avoiding the page cache, we also remove the need to keep state in
    the page_private(page) field for persistant storage across buffer
    free/buffer rebuild and so all that code can be removed. This also
    fixes the long-standing problem of not having enough bits in the
    page_private field to track all the state needed for a 512
    sector/64k page setup.

    It also removes the need for page locking during reads as the pages
    are unique to the buffer and nobody else will be attempting to
    access them.

    Finally, it removes the buftarg address space lock as a point of
    global contention on workloads that allocate and free buffers
    quickly such as when creating or removing large numbers of inodes in
    parallel. This remove the 16TB limit on filesystem size on 32 bit
    machines as the page index (32 bit) is no longer used for lookups
    of metadata buffers - the buffer cache is now solely indexed by disk
    address which is stored in a 64 bit field in the buffer.

    Signed-off-by: Dave Chinner
    Reviewed-by: Alex Elder

    Dave Chinner
     
  • During mount, we can do a quotacheck that involves a bulkstat pass
    on all inodes. If there are more inodes in the filesystem than can
    be held in memory, we require the inode cache shrinker to run to
    ensure that we don't run out of memory.

    Unfortunately, the inode cache shrinker is not registered until we
    get to the end of the superblock setup process, which is after a
    quotacheck is run if it is needed. Hence we need to register the
    inode cache shrinker earlier in the mount process so that we don't
    OOM during mount. This requires that we also initialise the syncd
    work before we register the shrinker, so we nee dto juggle that
    around as well.

    While there, make sure that we have set up the block sizes in the
    VFS superblock correctly before the quotacheck is run so that any
    inodes that are cached as a result of the quotacheck have their
    block size fields set up correctly.

    Cc: stable@kernel.org
    Signed-off-by: Dave Chinner
    Reviewed-by: Alex Elder

    Dave Chinner
     
  • There is an ABBA deadlock between synchronous inode flushing in
    xfs_reclaim_inode and xfs_icluster_free. xfs_icluster_free locks the
    buffer, then takes inode ilocks, whilst synchronous reclaim takes
    the ilock followed by the buffer lock in xfs_iflush().

    To avoid this deadlock, separate the inode cluster buffer locking
    semantics from the synchronous inode flush semantics, allowing
    callers to attempt to lock the buffer but still issue synchronous IO
    if it can get the buffer. This requires xfs_iflush() calls that
    currently use non-blocking semantics to pass SYNC_TRYLOCK rather
    than 0 as the flags parameter.

    This allows xfs_reclaim_inode to avoid the deadlock on the buffer
    lock and detect the failure so that it can drop the inode ilock and
    restart the reclaim attempt on the inode. This allows
    xfs_ifree_cluster to obtain the inode lock, mark the inode stale and
    release it and hence defuse the deadlock situation. It also has the
    pleasant side effect of avoiding IO in xfs_reclaim_inode when it
    tries to next reclaim the inode as it is now marked stale.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Alex Elder

    Dave Chinner
     
  • On 32 bit systems, vmalloc space is limited and XFS can chew through
    it quickly as the vmalloc space is lazily freed. This can result in
    failure to map buffers, even when there is apparently large amounts
    of vmalloc space available. Hence, if we fail to map a buffer, purge
    the aliases that have not yet been freed to hopefuly free up enough
    vmalloc space to allow a retry to succeed.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Alex Elder

    Dave Chinner
     
  • Preallocation and hole punch transactions are currently synchronous
    and this is causing performance problems in some cases. The
    transactions don't need to be synchronous as we don't need to
    guarantee the preallocation is persistent on disk until a
    fdatasync, fsync, sync operation occurs. If the file is opened
    O_SYNC or O_DATASYNC, only then should the transaction be issued
    synchronously.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Alex Elder

    Dave Chinner
     

25 Mar, 2011

1 commit

  • * 'for-2.6.39/core' of git://git.kernel.dk/linux-2.6-block: (65 commits)
    Documentation/iostats.txt: bit-size reference etc.
    cfq-iosched: removing unnecessary think time checking
    cfq-iosched: Don't clear queue stats when preempt.
    blk-throttle: Reset group slice when limits are changed
    blk-cgroup: Only give unaccounted_time under debug
    cfq-iosched: Don't set active queue in preempt
    block: fix non-atomic access to genhd inflight structures
    block: attempt to merge with existing requests on plug flush
    block: NULL dereference on error path in __blkdev_get()
    cfq-iosched: Don't update group weights when on service tree
    fs: assign sb->s_bdi to default_backing_dev_info if the bdi is going away
    block: Require subsystems to explicitly allocate bio_set integrity mempool
    jbd2: finish conversion from WRITE_SYNC_PLUG to WRITE_SYNC and explicit plugging
    jbd: finish conversion from WRITE_SYNC_PLUG to WRITE_SYNC and explicit plugging
    fs: make fsync_buffers_list() plug
    mm: make generic_writepages() use plugging
    blk-cgroup: Add unaccounted time to timeslice_used.
    block: fixup plugging stubs for !CONFIG_BLOCK
    block: remove obsolete comments for blkdev_issue_zeroout.
    blktrace: Use rq->cmd_flags directly in blk_add_trace_rq.
    ...

    Fix up conflicts in fs/{aio.c,super.c}

    Linus Torvalds
     

22 Mar, 2011

1 commit

  • * 'for-linus' of git://oss.sgi.com/xfs/xfs: (23 commits)
    xfs: don't name variables "panic"
    xfs: factor agf counter updates into a helper
    xfs: clean up the xfs_alloc_compute_aligned calling convention
    xfs: kill support/debug.[ch]
    xfs: Convert remaining cmn_err() callers to new API
    xfs: convert the quota debug prints to new API
    xfs: rename xfs_cmn_err_fsblock_zero()
    xfs: convert xfs_fs_cmn_err to new error logging API
    xfs: kill xfs_fs_mount_cmn_err() macro
    xfs: kill xfs_fs_repair_cmn_err() macro
    xfs: convert xfs_cmn_err to xfs_alert_tag
    xfs: Convert xlog_warn to new logging interface
    xfs: Convert linux-2.6/ files to new logging interface
    xfs: introduce new logging API.
    xfs: zero proper structure size for geometry calls
    xfs: enable delaylog by default
    xfs: more sensible inode refcounting for ialloc
    xfs: stop using xfs_trans_iget in the RT allocator
    xfs: check if device support discard in xfs_ioc_trim()
    xfs: prevent leaking uninitialized stack memory in FSGEOMETRY_V1
    ...

    Linus Torvalds
     

17 Mar, 2011

1 commit

  • …s/security-testing-2.6

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6: (33 commits)
    AppArmor: kill unused macros in lsm.c
    AppArmor: cleanup generated files correctly
    KEYS: Add an iovec version of KEYCTL_INSTANTIATE
    KEYS: Add a new keyctl op to reject a key with a specified error code
    KEYS: Add a key type op to permit the key description to be vetted
    KEYS: Add an RCU payload dereference macro
    AppArmor: Cleanup make file to remove cruft and make it easier to read
    SELinux: implement the new sb_remount LSM hook
    LSM: Pass -o remount options to the LSM
    SELinux: Compute SID for the newly created socket
    SELinux: Socket retains creator role and MLS attribute
    SELinux: Auto-generate security_is_socket_class
    TOMOYO: Fix memory leak upon file open.
    Revert "selinux: simplify ioctl checking"
    selinux: drop unused packet flow permissions
    selinux: Fix packet forwarding checks on postrouting
    selinux: Fix wrong checks for selinux_policycap_netpeer
    selinux: Fix check for xfrm selinux context algorithm
    ima: remove unnecessary call to ima_must_measure
    IMA: remove IMA imbalance checking
    ...

    Linus Torvalds
     

16 Mar, 2011

1 commit

  • * 'for-2.6.39' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
    workqueue: fix build failure introduced by s/freezeable/freezable/
    workqueue: add system_freezeable_wq
    rds/ib: use system_wq instead of rds_ib_fmr_wq
    net/9p: replace p9_poll_task with a work
    net/9p: use system_wq instead of p9_mux_wq
    xfs: convert to alloc_workqueue()
    reiserfs: make commit_wq use the default concurrency level
    ocfs2: use system_wq instead of ocfs2_quota_wq
    ext4: convert to alloc_workqueue()
    scsi/scsi_tgt_lib: scsi_tgtd isn't used in memory reclaim path
    scsi/be2iscsi,qla2xxx: convert to alloc_workqueue()
    misc/iwmc3200top: use system_wq instead of dedicated workqueues
    i2o: use alloc_workqueue() instead of create_workqueue()
    acpi: kacpi*_wq don't need WQ_MEM_RECLAIM
    fs/aio: aio_wq isn't used in memory reclaim path
    input/tps6507x-ts: use system_wq instead of dedicated workqueue
    cpufreq: use system_wq instead of dedicated workqueues
    wireless/ipw2x00: use system_wq instead of dedicated workqueues
    arm/omap: use system_wq in mailbox
    workqueue: use WQ_MEM_RECLAIM instead of WQ_RESCUER

    Linus Torvalds
     

14 Mar, 2011

1 commit

  • The exportfs encode handle function should return the minimum required
    handle size. This helps user to find out the handle size by passing 0
    handle size in the first step and then redoing to the call again with
    the returned handle size value.

    Acked-by: Serge Hallyn
    Signed-off-by: Aneesh Kumar K.V
    Signed-off-by: Al Viro

    Aneesh Kumar K.V
     

12 Mar, 2011

1 commit


10 Mar, 2011

3 commits


08 Mar, 2011

1 commit


07 Mar, 2011

2 commits

  • The remaining functionality in debug.[ch] is effectively just assert
    handling, conditional debug definitions and hex dumping. The hex
    dumping and assert function can be moved into the new printk module,
    while the rest can be moved into top-level header files. This allows
    fs/xfs/support/debug.[ch] to be completely removed from the
    codebase.

    Signed-off-by: Dave Chinner
    Reviewed-by: Alex Elder
    Reviewed-by: Christoph Hellwig

    Dave Chinner
     
  • Convert the files in fs/xfs/linux-2.6/ to use the new xfs_
    logging format that replaces the old Irix inherited cmn_err()
    interfaces. While there, also convert naked printk calls to use the
    relevant xfs logging function to standardise output format.

    Signed-off-by: Dave Chinner
    Reviewed-by: Alex Elder
    Reviewed-by: Christoph Hellwig

    Dave Chinner
     

02 Mar, 2011

3 commits

  • Commit 493f3358cb289ccf716c5a14fa5bb52ab75943e5 added this call to
    xfs_fs_geometry() in order to avoid passing kernel stack data back
    to user space:

    + memset(geo, 0, sizeof(*geo));

    Unfortunately, one of the callers of that function passes the
    address of a smaller data type, cast to fit the type that
    xfs_fs_geometry() requires. As a result, this can happen:

    Kernel panic - not syncing: stack-protector: Kernel stack is corrupted
    in: f87aca93

    Pid: 262, comm: xfs_fsr Not tainted 2.6.38-rc6-493f3358cb2+ #1
    Call Trace:

    [] ? panic+0x50/0x150
    [] ? __stack_chk_fail+0x10/0x18
    [] ? xfs_ioc_fsgeometry_v1+0x56/0x5d [xfs]

    Fix this by fixing that one caller to pass the right type and then
    copy out the subset it is interested in.

    Note: This patch is an alternative to one originally proposed by
    Eric Sandeen.

    Reported-by: Jeffrey Hundstad
    Signed-off-by: Alex Elder
    Reviewed-by: Eric Sandeen
    Tested-by: Jeffrey Hundstad

    Alex Elder
     
  • Most of the logging infrastructure in XFS is unneccessary and
    designed around the infrastructure supplied by Irix rather than
    Linux. To rationalise the logging interfaces, start by introducing
    simple printk wrappers similar to the dev_printk() infrastructure.
    Later patches will convert code to use this new interface.

    Signed-off-by: Dave Chinner
    Reviewed-by: Alex Elder
    Reviewed-by: Christoph Hellwig

    Dave Chinner
     
  • Commit 493f3358cb289ccf716c5a14fa5bb52ab75943e5 added this call to
    xfs_fs_geometry() in order to avoid passing kernel stack data back
    to user space:

    + memset(geo, 0, sizeof(*geo));

    Unfortunately, one of the callers of that function passes the
    address of a smaller data type, cast to fit the type that
    xfs_fs_geometry() requires. As a result, this can happen:

    Kernel panic - not syncing: stack-protector: Kernel stack is corrupted
    in: f87aca93

    Pid: 262, comm: xfs_fsr Not tainted 2.6.38-rc6-493f3358cb2+ #1
    Call Trace:

    [] ? panic+0x50/0x150
    [] ? __stack_chk_fail+0x10/0x18
    [] ? xfs_ioc_fsgeometry_v1+0x56/0x5d [xfs]

    Fix this by fixing that one caller to pass the right type and then
    copy out the subset it is interested in.

    Note: This patch is an alternative to one originally proposed by
    Eric Sandeen.

    Reported-by: Jeffrey Hundstad
    Signed-off-by: Alex Elder
    Reviewed-by: Eric Sandeen
    Tested-by: Jeffrey Hundstad

    Alex Elder
     

23 Feb, 2011

2 commits

  • Signed-off-by: Christoph Hellwig
    Signed-off-by: Dave Chinner
    Signed-off-by: Alex Elder

    Christoph Hellwig
     
  • Right now we, are relying on the fact that when we attempt to
    actually do the discard, blkdev_issue_discar() returns -EOPNOTSUPP
    and the user is informed that the device does not support discard.

    However, in the case where the we do not hit any suitable free
    extent to trim in FITRIM code, it will finish without any error.
    This is very confusing, because it seems that FITRIM was successful
    even though the device does not actually supports discard.

    Solution: Check for the discard support before attempt to search for
    free extents.

    Signed-off-by: Lukas Czerner
    Signed-off-by: Alex Elder

    Lukas Czerner
     

22 Feb, 2011

1 commit

  • Right now we, are relying on the fact that when we attempt to
    actually do the discard, blkdev_issue_discar() returns -EOPNOTSUPP
    and the user is informed that the device does not support discard.

    However, in the case where the we do not hit any suitable free
    extent to trim in FITRIM code, it will finish without any error.
    This is very confusing, because it seems that FITRIM was successful
    even though the device does not actually supports discard.

    Solution: Check for the discard support before attempt to search for
    free extents.

    Signed-off-by: Lukas Czerner
    Signed-off-by: Alex Elder

    Lukas Czerner
     

21 Feb, 2011

1 commit


02 Feb, 2011

1 commit

  • SELinux would like to implement a new labeling behavior of newly created
    inodes. We currently label new inodes based on the parent and the creating
    process. This new behavior would also take into account the name of the
    new object when deciding the new label. This is not the (supposed) full path,
    just the last component of the path.

    This is very useful because creating /etc/shadow is different than creating
    /etc/passwd but the kernel hooks are unable to differentiate these
    operations. We currently require that userspace realize it is doing some
    difficult operation like that and than userspace jumps through SELinux hoops
    to get things set up correctly. This patch does not implement new
    behavior, that is obviously contained in a seperate SELinux patch, but it
    does pass the needed name down to the correct LSM hook. If no such name
    exists it is fine to pass NULL.

    Signed-off-by: Eric Paris

    Eric Paris
     

01 Feb, 2011

1 commit

  • Convert from create[_singlethread]_workqueue() to alloc_workqueue().

    * xfsdatad_workqueue and xfsconvertd_workqueue are identity converted.
    Using higher concurrency limit might be useful but given the
    complexity of workqueue usage in xfs, proceeding cautiously seems
    better.

    * xfs_mru_reap_wq is converted to non-ordered workqueue with max
    concurrency of 1 as the work items don't require any specific
    ordering and already have proper synchronization. It seems it was
    singlethreaded to save worker threads, which is no longer a concern.

    Signed-off-by: Tejun Heo
    Cc: Alex Elder
    Cc: xfs-masters@oss.sgi.com
    Cc: Christoph Hellwig

    Tejun Heo
     

28 Jan, 2011

1 commit

  • The extent size hint can be set to larger than an AG. This means
    that the alignment process can push the range to be allocated
    outside the bounds of the AG, resulting in assert failures or
    corrupted bmbt records. Similarly, if the extsize is larger than the
    maximum extent size supported, the alignment process will produce
    extents that are too large to fit into the bmbt records, resulting
    in a different type of assert/corruption failure.

    Fix this by limiting extsize at the time іt is set firstly to be
    less than MAXEXTLEN, then to be a maximum of half the size of the
    AGs in the filesystem for non-realtime inodes. Realtime inodes do
    not allocate out of AGs, so don't have to be restricted by the size
    of AGs.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Alex Elder

    Dave Chinner
     

17 Jan, 2011

2 commits

  • Currently all filesystems except XFS implement fallocate asynchronously,
    while XFS forced a commit. Both of these are suboptimal - in case of O_SYNC
    I/O we really want our allocation on disk, especially for the !KEEP_SIZE
    case where we actually grow the file with user-visible zeroes. On the
    other hand always commiting the transaction is a bad idea for fast-path
    uses of fallocate like for example in recent Samba versions. Given
    that block allocation is a data plane operation anyway change it from
    an inode operation to a file operation so that we have the file structure
    available that lets us check for O_SYNC.

    This also includes moving the code around for a few of the filesystems,
    and remove the already unnedded S_ISDIR checks given that we only wire
    up fallocate for regular files.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Christoph Hellwig
     
  • Instead of various home grown checks that might need updates for new
    flags just check for any bit outside the mask of the features supported
    by the filesystem. This makes the check future proof for any newly
    added flag.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Christoph Hellwig
     

15 Jan, 2011

1 commit

  • * 'for-linus' of git://oss.sgi.com/xfs/xfs:
    xfs: prevent NMI timeouts in cmn_err
    xfs: Add log level to assertion printk
    xfs: fix an assignment within an ASSERT()
    xfs: fix error handling for synchronous writes
    xfs: add FITRIM support
    xfs: ensure log covering transactions are synchronous
    xfs: serialise unaligned direct IOs
    xfs: factor common write setup code
    xfs: split buffered IO write path from xfs_file_aio_write
    xfs: split direct IO write path from xfs_file_aio_write
    xfs: introduce xfs_rw_lock() helpers for locking the inode
    xfs: factor post-write newsize updates
    xfs: factor common post-write isize handling code
    xfs: ensure sync write errors are returned

    Linus Torvalds
     

14 Jan, 2011

3 commits

  • * 'for-2.6.38/core' of git://git.kernel.dk/linux-2.6-block: (43 commits)
    block: ensure that completion error gets properly traced
    blktrace: add missing probe argument to block_bio_complete
    block cfq: don't use atomic_t for cfq_group
    block cfq: don't use atomic_t for cfq_queue
    block: trace event block fix unassigned field
    block: add internal hd part table references
    block: fix accounting bug on cross partition merges
    kref: add kref_test_and_get
    bio-integrity: mark kintegrityd_wq highpri and CPU intensive
    block: make kblockd_workqueue smarter
    Revert "sd: implement sd_check_events()"
    block: Clean up exit_io_context() source code.
    Fix compile warnings due to missing removal of a 'ret' variable
    fs/block: type signature of major_to_index(int) to major_to_index(unsigned)
    block: convert !IS_ERR(p) && p to !IS_ERR_NOR_NULL(p)
    cfq-iosched: don't check cfqg in choose_service_tree()
    fs/splice: Pull buf->ops->confirm() from splice_from_pipe actors
    cdrom: export cdrom_check_events()
    sd: implement sd_check_events()
    sr: implement sr_check_events()
    ...

    Linus Torvalds
     
  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (41 commits)
    fs: add documentation on fallocate hole punching
    Gfs2: fail if we try to use hole punch
    Btrfs: fail if we try to use hole punch
    Ext4: fail if we try to use hole punch
    Ocfs2: handle hole punching via fallocate properly
    XFS: handle hole punching via fallocate properly
    fs: add hole punching to fallocate
    vfs: pass struct file to do_truncate on O_TRUNC opens (try #2)
    fix signedness mess in rw_verify_area() on 64bit architectures
    fs: fix kernel-doc for dcache::prepend_path
    fs: fix kernel-doc for dcache::d_validate
    sanitize ecryptfs ->mount()
    switch afs
    move internal-only parts of ncpfs headers to fs/ncpfs
    switch ncpfs
    switch 9p
    pass default dentry_operations to mount_pseudo()
    switch hostfs
    switch affs
    switch configfs
    ...

    Linus Torvalds
     
  • * 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (43 commits)
    Documentation/trace/events.txt: Remove obsolete sched_signal_send.
    writeback: fix global_dirty_limits comment runtime -> real-time
    ppc: fix comment typo singal -> signal
    drivers: fix comment typo diable -> disable.
    m68k: fix comment typo diable -> disable.
    wireless: comment typo fix diable -> disable.
    media: comment typo fix diable -> disable.
    remove doc for obsolete dynamic-printk kernel-parameter
    remove extraneous 'is' from Documentation/iostats.txt
    Fix spelling milisec -> ms in snd_ps3 module parameter description
    Fix spelling mistakes in comments
    Revert conflicting V4L changes
    i7core_edac: fix typos in comments
    mm/rmap.c: fix comment
    sound, ca0106: Fix assignment to 'channel'.
    hrtimer: fix a typo in comment
    init/Kconfig: fix typo
    anon_inodes: fix wrong function name in comment
    fix comment typos concerning "consistent"
    poll: fix a typo in comment
    ...

    Fix up trivial conflicts in:
    - drivers/net/wireless/iwlwifi/iwl-core.c (moved to iwl-legacy.c)
    - fs/ext4/ext4.h

    Also fix missed 'diabled' typo in drivers/net/bnx2x/bnx2x.h while at it.

    Linus Torvalds
     

13 Jan, 2011

1 commit

  • This patch simply allows XFS to handle the hole punching flag in fallocate
    properly. I've tested this with a little program that does a bunch of random
    hole punching with FL_KEEP_SIZE and without it to make sure it does the right
    thing. Thanks,

    Signed-off-by: Josef Bacik
    Signed-off-by: Al Viro

    Josef Bacik
     

12 Jan, 2011

4 commits

  • We currently have a global error message buffer in cmn_err that is
    protected by a spin lock that disables interrupts. Recently there
    have been reports of NMI timeouts occurring when the console is
    being flooded by SCSI error reports due to cmn_err() getting stuck
    trying to print to the console while holding this lock (i.e. with
    interrupts disabled). The NMI watchdog is seeing this CPU as
    non-responding and so is triggering a panic. While the trigger for
    the reported case is SCSI errors, pretty much anything that spams
    the kernel log could cause this to occur.

    Realistically the only reason that we have the intemediate message
    buffer is to prepend the correct kernel log level prefix to the log
    message. The only reason we have the lock is to protect the global
    message buffer and the only reason the message buffer is global is
    to keep it off the stack. Hence if we can avoid needing a global
    message buffer we avoid needing the lock, and we can do this with a
    small amount of cleanup and some preprocessor tricks:

    1. clean up xfs_cmn_err() panic mask functionality to avoid
    needing debug code in xfs_cmn_err()
    2. remove the couple of "!" message prefixes that still exist that
    the existing cmn_err() code steps over.
    3. redefine CE_* levels directly to KERN_*
    4. redefine cmn_err() and friends to use printk() directly
    via variable argument length macros.

    By doing this, we can completely remove the cmn_err() code and the
    lock that is causing the problems, and rely solely on printk()
    serialisation to ensure that we don't get garbled messages.

    A series of followup patches is really needed to clean up all the
    cmn_err() calls and related messages properly, but that results in a
    series that is not easily back portable to enterprise kernels. Hence
    this initial fix is only to address the direct problem in the lowest
    impact way possible.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Alex Elder

    Dave Chinner
     
  • If we get an IO error on a synchronous superblock write, we attach an
    error release function to it so that when the last reference goes away
    the release function is called and the buffer is invalidated and
    unlocked. The buffer is left locked until the release function is
    called so that other concurrent users of the buffer will be locked out
    until the buffer error is fully processed.

    Unfortunately, for the superblock buffer the filesyetm itself holds a
    reference to the buffer which prevents the reference count from
    dropping to zero and the release function being called. As a result,
    once an IO error occurs on a sync write, the buffer will never be
    unlocked and all future attempts to lock the buffer will hang.

    To make matters worse, this problems is not unique to such buffers;
    if there is a concurrent _xfs_buf_find() running, the lookup will grab
    a reference to the buffer and then wait on the buffer lock, preventing
    the reference count from ever falling to zero and hence unlocking the
    buffer.

    As such, the whole b_relse function implementation is broken because it
    cannot rely on the buffer reference count falling to zero to unlock the
    errored buffer. The synchronous write error path is the only path that
    uses this callback - it is used to ensure that the synchronous waiter
    gets the buffer error before the error state is cleared from the buffer
    by the release function.

    Given that the only sychronous buffer writes now go through xfs_bwrite
    and the error path in question can only occur for a write of a dirty,
    logged buffer, we can move most of the b_relse processing to happen
    inline in xfs_buf_iodone_callbacks, just like a normal I/O completion.
    In addition to that we make sure the error is not cleared in
    xfs_buf_iodone_callbacks, so that xfs_bwrite can reliably check it.
    Given that xfs_bwrite keeps the buffer locked until it has waited for
    it and checked the error this allows to reliably propagate the error
    to the caller, and make sure that the buffer is reliably unlocked.

    Given that xfs_buf_iodone_callbacks was the only instance of the
    b_relse callback we can remove it entirely.

    Based on earlier patches by Dave Chinner and Ajeet Yadav.

    Signed-off-by: Christoph Hellwig
    Reported-by: Ajeet Yadav
    Reviewed-by: Dave Chinner
    Signed-off-by: Alex Elder

    Christoph Hellwig
     
  • Allow manual discards from userspace using the FITRIM ioctl. This is not
    intended to be run during normal workloads, as the freepsace btree walks
    can cause large performance degradation.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Dave Chinner
    Signed-off-by: Alex Elder

    Christoph Hellwig
     
  • To ensure the log is covered and the filesystem idles correctly, we
    need to ensure that dummy transactions hit the disk and do not stay
    pinned in memory. If the superblock is pinned in memory, it can't
    be flushed so the log covering cannot make progress. The result is
    dependent on timing - more oftent han not we continue to issues a
    log covering transaction every 36s rather than idling after ~90s.

    Fix this by making the log covering transaction synchronous. To
    avoid additional log force from xfssyncd, make the log covering
    transaction take the place of the existing log force in the xfssyncd
    background sync process.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Alex Elder

    Dave Chinner