11 Aug, 2010

3 commits

  • The cgroup device whitelist code gets confused when trying to grant
    permission to a disk partition that is not currently open. Part of
    blkdev_open() includes __blkdev_get() on the whole disk.

    Basically, the only ways to reliably allow a cgroup access to a partition
    on a block device when using the whitelist are to 1) also give it access
    to the whole block device or 2) make sure the partition is already open in
    a different context.

    The patch avoids the cgroup check for the whole disk case when opening a
    partition.

    Addresses https://bugzilla.redhat.com/show_bug.cgi?id=589662

    Signed-off-by: Chris Wright
    Acked-by: Serge E. Hallyn
    Tested-by: Serge E. Hallyn
    Reported-by: Vivek Goyal
    Cc: Al Viro
    Cc: Christoph Hellwig
    Cc: "Daniel P. Berrange"
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Chris Wright
     
  • * 'for-2.6.36' of git://git.kernel.dk/linux-2.6-block: (149 commits)
    block: make sure that REQ_* types are seen even with CONFIG_BLOCK=n
    xen-blkfront: fix missing out label
    blkdev: fix blkdev_issue_zeroout return value
    block: update request stacking methods to support discards
    block: fix missing export of blk_types.h
    writeback: fix bad _bh spinlock nesting
    drbd: revert "delay probes", feature is being re-implemented differently
    drbd: Initialize all members of sync_conf to their defaults [Bugz 315]
    drbd: Disable delay probes for the upcomming release
    writeback: cleanup bdi_register
    writeback: add new tracepoints
    writeback: remove unnecessary init_timer call
    writeback: optimize periodic bdi thread wakeups
    writeback: prevent unnecessary bdi threads wakeups
    writeback: move bdi threads exiting logic to the forker thread
    writeback: restructure bdi forker loop a little
    writeback: move last_active to bdi
    writeback: do not remove bdi from bdi_list
    writeback: simplify bdi code a little
    writeback: do not lose wake-ups in bdi threads
    ...

    Fixed up pretty trivial conflicts in drivers/block/virtio_blk.c and
    drivers/scsi/scsi_error.c as per Jens.

    Linus Torvalds
     
  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (96 commits)
    no need for list_for_each_entry_safe()/resetting with superblock list
    Fix sget() race with failing mount
    vfs: don't hold s_umount over close_bdev_exclusive() call
    sysv: do not mark superblock dirty on remount
    sysv: do not mark superblock dirty on mount
    btrfs: remove junk sb_dirt change
    BFS: clean up the superblock usage
    AFFS: wait for sb synchronization when needed
    AFFS: clean up dirty flag usage
    cifs: truncate fallout
    mbcache: fix shrinker function return value
    mbcache: Remove unused features
    add f_flags to struct statfs(64)
    pass a struct path to vfs_statfs
    update VFS documentation for method changes.
    All filesystems that need invalidate_inode_buffers() are doing that explicitly
    convert remaining ->clear_inode() to ->evict_inode()
    Make ->drop_inode() just return whether inode needs to be dropped
    fs/inode.c:clear_inode() is gone
    fs/inode.c:evict() doesn't care about delete vs. non-delete paths now
    ...

    Fix up trivial conflicts in fs/nilfs2/super.c

    Linus Torvalds
     

10 Aug, 2010

3 commits

  • Signed-off-by: Al Viro

    Al Viro
     
  • Move the call to vmtruncate to get rid of accessive blocks to the callers
    in preparation of the new truncate sequence and rename the non-truncating
    version to block_write_begin.

    While we're at it also remove several unused arguments to block_write_begin.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Christoph Hellwig
     
  • Move the call to vmtruncate to get rid of accessive blocks to the callers
    in prepearation of the new truncate calling sequence. This was only done
    for DIO_LOCKING filesystems, so the __blockdev_direct_IO_newtrunc variant
    was not needed anyway. Get rid of blockdev_direct_IO_no_locking and
    its _newtrunc variant while at it as just opencoding the two additional
    paramters is shorted than the name suffix.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Christoph Hellwig
     

08 Aug, 2010

1 commit

  • The open and release block_device_operations are currently
    called with the BKL held. In order to change that, we must
    first make sure that all drivers that currently rely
    on this have no regressions.

    This blindly pushes the BKL into all .open and .release
    operations for all block drivers to prepare for the
    next step. The drivers can subsequently replace the BKL
    with their own locks or remove it completely when it can
    be shown that it is not needed.

    The functions blkdev_get and blkdev_put are the only
    remaining users of the big kernel lock in the block
    layer, besides a few uses in the ioctl code, none
    of which need to serialize with blkdev_{get,put}.

    Most of these two functions is also under the protection
    of bdev->bd_mutex, including the actual calls to
    ->open and ->release, and the common code does not
    access any global data structures that need the BKL.

    Signed-off-by: Arnd Bergmann
    Acked-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Arnd Bergmann
     

05 Aug, 2010

1 commit

  • bd_prepare_to_claim() incorrectly allowed multiple attempts for
    exclusive open to progress in parallel if the attempting holders are
    identical. This triggered BUG_ON() as reported in the following bug.

    https://bugzilla.kernel.org/show_bug.cgi?id=16393

    __bd_abort_claiming() is used to finish claiming blocks and doesn't
    work if multiple openers are inside a claiming block. Allowing
    multiple parallel open attempts to continue doesn't gain anything as
    those are serialized down in the call chain anyway. Fix it by always
    allowing only single open attempt in a claiming block.

    This problem can easily be reproduced by adding a delay after
    bd_prepare_to_claim() and attempting to mount two partitions of a
    disk.

    stable: only applicable to v2.6.35

    Signed-off-by: Tejun Heo
    Reported-by: Markus Trippelsdorf
    Cc: stable@kernel.org
    Signed-off-by: Linus Torvalds

    Tejun Heo
     

11 Jun, 2010

3 commits


28 May, 2010

2 commits


22 May, 2010

3 commits

  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (69 commits)
    fix handling of offsets in cris eeprom.c, get rid of fake on-stack files
    get rid of home-grown mutex in cris eeprom.c
    switch ecryptfs_write() to struct inode *, kill on-stack fake files
    switch ecryptfs_get_locked_page() to struct inode *
    simplify access to ecryptfs inodes in ->readpage() and friends
    AFS: Don't put struct file on the stack
    Ban ecryptfs over ecryptfs
    logfs: replace inode uid,gid,mode initialization with helper function
    ufs: replace inode uid,gid,mode initialization with helper function
    udf: replace inode uid,gid,mode init with helper
    ubifs: replace inode uid,gid,mode initialization with helper function
    sysv: replace inode uid,gid,mode initialization with helper function
    reiserfs: replace inode uid,gid,mode initialization with helper function
    ramfs: replace inode uid,gid,mode initialization with helper function
    omfs: replace inode uid,gid,mode initialization with helper function
    bfs: replace inode uid,gid,mode initialization with helper function
    ocfs2: replace inode uid,gid,mode initialization with helper function
    nilfs2: replace inode uid,gid,mode initialization with helper function
    minix: replace inode uid,gid,mode init with helper
    ext4: replace inode uid,gid,mode init with helper
    ...

    Trivial conflict in fs/fs-writeback.c (mark bitfields unsigned)

    Linus Torvalds
     
  • Currently the way we do freezing is by passing sb>s_bdev to freeze_bdev and then
    letting it do all the work. But freezing is more of an fs thing, and doesn't
    really have much to do with the bdev at all, all the work gets done with the
    super. In btrfs we do not populate s_bdev, since we can have multiple bdev's
    for one fs and setting s_bdev makes removing devices from a pool kind of tricky.
    This means that freezing a btrfs filesystem fails, which causes us to corrupt
    with things like tux-on-ice which use the fsfreeze mechanism. So instead of
    populating sb->s_bdev with a random bdev in our pool, I've broken the actual fs
    freezing stuff into freeze_super and thaw_super. These just take the
    super_block that we're freezing and does the appropriate work. It's basically
    just copy and pasted from freeze_bdev. I've then converted freeze_bdev over to
    use the new super helpers. I've tested this with ext4 and btrfs and verified
    everything continues to work the same as before.

    The only new gotcha is multiple calls to the fsfreeze ioctl will return EBUSY if
    the fs is already frozen. I thought this was a better solution than adding a
    freeze counter to the super_block, but if everybody hates this idea I'm open to
    suggestions. Thanks,

    Signed-off-by: Josef Bacik
    Signed-off-by: Al Viro

    Josef Bacik
     
  • Signed-off-by: Al Viro

    Al Viro
     

29 Apr, 2010

2 commits


27 Apr, 2010

2 commits

  • Currently, device claiming for exclusive open is done after low level
    open - disk->fops->open() - has completed successfully. This means
    that exclusive open attempts while a device is already exclusively
    open will fail only after disk->fops->open() is called.

    cdrom driver issues commands during open() which means that O_EXCL
    open attempt can unintentionally inject commands to in-progress
    command stream for burning thus disturbing burning process. In most
    cases, this doesn't cause problems because the first command to be
    issued is TUR which most devices can process in the middle of burning.
    However, depending on how a device replies to TUR during burning,
    cdrom driver may end up issuing further commands.

    This can't be resolved trivially by moving bd_claim() before doing
    actual open() because that means an open attempt which will end up
    failing could interfere other legit O_EXCL open attempts.
    ie. unconfirmed open attempts can fail others.

    This patch resolves the problem by introducing claiming block which is
    started by bd_start_claiming() and terminated either by bd_claim() or
    bd_abort_claiming(). bd_claim() from inside a claiming block is
    guaranteed to succeed and once a claiming block is started, other
    bd_start_claiming() or bd_claim() attempts block till the current
    claiming block is terminated.

    bd_claim() can still be used standalone although now it always
    synchronizes against claiming blocks, so the existing users will keep
    working without any change.

    blkdev_open() and open_bdev_exclusive() are converted to use claiming
    blocks so that exclusive open attempts from these functions don't
    interfere with the existing exclusive open.

    This problem was discovered while investigating bko#15403.

    https://bugzilla.kernel.org/show_bug.cgi?id=15403

    The burning problem itself can be resolved by updating userspace
    probing tools to always open w/ O_EXCL.

    Signed-off-by: Tejun Heo
    Reported-by: Matthias-Christian Ott
    Cc: Kay Sievers
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • Factor out bd_may_claim() from bd_claim(), add comments and apply a
    couple of cosmetic edits. This is to prepare for further updates to
    claim path.

    Signed-off-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Tejun Heo
     

25 Apr, 2010

1 commit

  • We are seeing a large regression in database performance on recent
    kernels. The database opens a block device with O_DIRECT|O_SYNC and a
    number of threads write to different regions of the file at the same time.

    A simple test case is below. I haven't defined DEVICE since getting it
    wrong will destroy your data :) On an 3 disk LVM with a 64k chunk size we
    see about 17MB/sec and only a few threads in IO wait:

    procs -----io---- -system-- -----cpu------
    r b bi bo in cs us sy id wa st
    0 3 0 16170 656 2259 0 0 86 14 0
    0 2 0 16704 695 2408 0 0 92 8 0
    0 2 0 17308 744 2653 0 0 86 14 0
    0 2 0 17933 759 2777 0 0 89 10 0

    Most threads are blocking in vfs_fsync_range, which has:

    mutex_lock(&mapping->host->i_mutex);
    err = fop->fsync(file, dentry, datasync);
    if (!ret)
    ret = err;
    mutex_unlock(&mapping->host->i_mutex);

    commit 148f948ba877f4d3cdef036b1ff6d9f68986706a (vfs: Introduce new
    helpers for syncing after writing to O_SYNC file or IS_SYNC inode) offers
    some explanation of what is going on:

    Use these new helpers for syncing from generic VFS functions. This makes
    O_SYNC writes to block devices acquire i_mutex for syncing. If we really
    care about this, we can make block_fsync() drop the i_mutex and reacquire
    it before it returns.

    Thanks Jan for such a good commit message! As well as dropping i_mutex,
    Christoph suggests we should remove the call to sync_blockdev():

    > sync_blockdev is an overcomplicated alias for filemap_write_and_wait on
    > the block device inode, which is exactly what we did just before calling
    > into ->fsync

    The patch below incorporates both suggestions. With it the testcase improves
    from 17MB/s to 68M/sec:

    procs -----io---- -system-- -----cpu------
    r b bi bo in cs us sy id wa st
    0 7 0 65536 1000 3878 0 0 70 30 0
    0 34 0 69632 1016 3921 0 1 46 53 0
    0 57 0 69632 1000 3921 0 0 55 45 0
    0 53 0 69640 754 4111 0 0 81 19 0

    Testcase:

    #define _GNU_SOURCE
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include

    #define NR_THREADS 64
    #define BUFSIZE (64 * 1024)

    #define DEVICE "/dev/mapper/XXXXXX"

    #define ALIGN(VAL, SIZE) (((VAL)+(SIZE)-1) & ~((SIZE)-1))

    static int fd;

    static void *doit(void *arg)
    {
    unsigned long offset = (long)arg;
    char *b, *buf;

    b = malloc(BUFSIZE + 1024);
    buf = (char *)ALIGN((unsigned long)b, 1024);
    memset(buf, 0, BUFSIZE);

    while (1)
    pwrite(fd, buf, BUFSIZE, offset);
    }

    int main(int argc, char *argv[])
    {
    int flags = O_RDWR|O_DIRECT;
    int i;
    unsigned long offset = 0;

    if (argc > 1 && !strcmp(argv[1], "O_SYNC"))
    flags |= O_SYNC;

    fd = open(DEVICE, flags);
    if (fd == -1) {
    perror("open");
    exit(1);
    }

    for (i = 0; i < NR_THREADS-1; i++) {
    pthread_t tid;
    pthread_create(&tid, NULL, doit, (void *)offset);
    offset += BUFSIZE;
    }
    doit((void *)offset);

    return 0;
    }

    Signed-off-by: Anton Blanchard
    Acked-by: Jan Kara
    Cc: Christoph Hellwig
    Cc: Alexander Viro
    Cc: Jens Axboe
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Anton Blanchard
     

07 Apr, 2010

2 commits

  • Requested by hch, for consistency now it is exported.

    Cc: Alexander Viro
    Cc: Anton Blanchard
    Cc: Christoph Hellwig
    Cc: Jan Kara
    Cc: Jeff Moyer
    Cc: Jens Axboe
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • Commit 148f948ba877f4d3cdef036b1ff6d9f68986706a (vfs: Introduce new
    helpers for syncing after writing to O_SYNC file or IS_SYNC inode) broke
    the raw driver.

    We now call through generic_file_aio_write -> generic_write_sync ->
    vfs_fsync_range. vfs_fsync_range has:

    if (!fop || !fop->fsync) {
    ret = -EINVAL;
    goto out;
    }

    But drivers/char/raw.c doesn't set an fsync method.

    We have two options: fix it or remove the raw driver completely. I'm
    happy to do either, the fact this has been broken for so long suggests it
    is rarely used.

    The patch below adds an fsync method to the raw driver. My knowledge of
    the block layer is pretty sketchy so this could do with a once over.

    If we instead decide to remove the raw driver, this patch might still be
    useful as a backport to 2.6.33 and 2.6.32.

    Signed-off-by: Anton Blanchard
    Reviewed-by: Jan Kara
    Cc: Christoph Hellwig
    Cc: Alexander Viro
    Cc: Jens Axboe
    Reviewed-by: Jeff Moyer
    Tested-by: Jeff Moyer
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Anton Blanchard
     

07 Feb, 2010

1 commit

  • Thanks Thomas and Christoph for testing and review.
    I removed 'smp_wmb()' before up_write from the previous patch,
    since up_write() should have necessary ordering constraints.
    (I.e. the change of s_frozen is visible to others after up_write)
    I'm quite sure the change is harmless but if you are uncomfortable
    with Tested-by/Reviewed-by on the modified patch, please remove them.

    If MS_RDONLY, freeze_bdev should just up_write(s_umount) instead of
    deactivate_locked_super().
    Also, keep sb->s_frozen consistent so that remount can check the frozen state.

    Otherwise a crash reported here can happen:
    http://lkml.org/lkml/2010/1/16/37
    http://lkml.org/lkml/2010/1/28/53

    This patch should be applied for 2.6.32 stable series, too.

    Reviewed-by: Christoph Hellwig
    Tested-by: Thomas Backlund
    Signed-off-by: Jun'ichi Nomura
    Cc: stable@kernel.org
    Signed-off-by: Al Viro

    Jun'ichi Nomura
     

04 Nov, 2009

1 commit


29 Oct, 2009

1 commit

  • Currently there is no barrier support in the block device code. That
    means we cannot guarantee any sort of data integerity when using the
    block device node with dis kwrite caches enabled. Using the raw block
    device node is a typical use case for virtualization (and I assume
    databases, too). This patch changes block_fsync to issue a cache flush
    and thus make fsync on block device nodes actually useful.

    Note that in mainline we would also need to add such code to the
    ->aio_write method for O_SYNC handling, but assuming that Jan's patch
    series for the O_SYNC rewrite goes in it will also call into ->fsync
    for 2.6.32.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

26 Oct, 2009

1 commit

  • commit 0762b8bde9729f10f8e6249809660ff2ec3ad735
    (from 14 months ago) introduced a use-after-free bug which has just
    recently started manifesting in my md testing.
    I tried git bisect to find out what caused the bug to start
    manifesting, and it could have been the recent change to
    blk_unregister_queue (48c0d4d4c04) but the results were inconclusive.

    This patch certainly fixes my symptoms and looks correct as the two
    calls are now in the same order as elsewhere in that function.

    Signed-off-by: NeilBrown
    Acked-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Neil Brown
     

24 Sep, 2009

2 commits

  • Currently we held s_umount while a filesystem is frozen, despite that we
    might return to userspace and unlock it from a different process. Instead
    grab an active reference to keep the file system busy and add an explicit
    check for frozen filesystems in remount and reject the remount instead
    of blocking on s_umount.

    Add a new get_active_super helper to super.c for use by freeze_bdev that
    grabs an active reference to a superblock from a given block device.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Christoph Hellwig
     
  • Now that we have the freeze count there is not much reason for bd_mount_sem
    anymore. The actual freeze/thaw operations are serialized using the
    bd_fsfreeze_mutex, and the only other place we take bd_mount_sem is
    get_sb_bdev which tries to prevent mounting a filesystem while the block
    device is frozen. Instead of add a check for bd_fsfreeze_count and
    return -EBUSY if a filesystem is frozen. While that is a change in user
    visible behaviour a failing mount is much better for this case rather
    than having the mount process stuck uninterruptible for a long time.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Christoph Hellwig
     

22 Sep, 2009

1 commit


16 Sep, 2009

1 commit

  • It has been unused since it was introduced in:

    commit 520808bf20e90fdbdb320264ba7dd5cf9d47dcac
    Author: Andrew Morton
    Date: Fri May 21 00:46:17 2004 -0700

    [PATCH] block device layer: separate backing_dev_info infrastructure

    So lets just kill it.

    Acked-by: Jan Kara
    Signed-off-by: Jens Axboe

    Jens Axboe
     

14 Sep, 2009

1 commit

  • generic_file_aio_write_nolock() is now used only by block devices and raw
    character device. Filesystems should use __generic_file_aio_write() in case
    generic_file_aio_write() doesn't suit them. So rename the function to
    blkdev_aio_write() and move it to fs/blockdev.c.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jan Kara

    Christoph Hellwig
     

30 Jul, 2009

1 commit

  • Create bdgrab(). This function copies an existing reference to a
    block_device. It is safe to call from any context.

    Hibernation code wishes to copy a reference to the active swap device.
    Right now it calls bdget() under a spinlock, but this is wrong because
    bdget() can sleep. It doesn't need a full bdget() because we already
    hold a reference to active swap devices (and the spinlock protects
    against swapoff).

    Fixes http://bugzilla.kernel.org/show_bug.cgi?id=13827

    Signed-off-by: Alan Jenkins
    Signed-off-by: Rafael J. Wysocki

    Alan Jenkins
     

12 Jun, 2009

5 commits

  • Rename the function so that it better describe what it really does. Also
    remove the unnecessary include of buffer_head.h.

    Signed-off-by: Jan Kara
    Signed-off-by: Al Viro

    Jan Kara
     
  • It is unnecessarily fragile to have two places (fsync_super() and do_sync())
    doing data integrity sync of the filesystem. Alter __fsync_super() to
    accommodate needs of both callers and use it. So after this patch
    __fsync_super() is the only place where we gather all the calls needed to
    properly send all data on a filesystem to disk.

    Nice bonus is that we get a complete livelock avoidance and write_supers()
    is now only used for periodic writeback of superblocks.

    sync_blockdevs() introduced a couple of patches ago is gone now.

    [build fixes folded]

    Signed-off-by: Jan Kara
    Signed-off-by: Al Viro

    Jan Kara
     
  • __fsync_super() does the same thing as fsync_super(). So change the only
    caller to use fsync_super() and make __fsync_super() static. This removes
    unnecessarily duplicated call to sync_blockdev() and prepares ground
    for the changes to __fsync_super() in the following patches.

    Signed-off-by: Jan Kara
    Signed-off-by: Al Viro

    Jan Kara
     
  • * 'for-linus' of git://linux-arm.org/linux-2.6:
    kmemleak: Add the corresponding MAINTAINERS entry
    kmemleak: Simple testing module for kmemleak
    kmemleak: Enable the building of the memory leak detector
    kmemleak: Remove some of the kmemleak false positives
    kmemleak: Add modules support
    kmemleak: Add kmemleak_alloc callback from alloc_large_system_hash
    kmemleak: Add the vmalloc memory allocation/freeing hooks
    kmemleak: Add the slub memory allocation/freeing hooks
    kmemleak: Add the slob memory allocation/freeing hooks
    kmemleak: Add the slab memory allocation/freeing hooks
    kmemleak: Add documentation on the memory leak detector
    kmemleak: Add the base support

    Manual conflict resolution (with the slab/earlyboot changes) in:
    drivers/char/vt.c
    init/main.c
    mm/slab.c

    Linus Torvalds
     
  • There are allocations for which the main pointer cannot be found but
    they are not memory leaks. This patch fixes some of them. For more
    information on false positives, see Documentation/kmemleak.txt.

    Signed-off-by: Catalin Marinas

    Catalin Marinas
     

05 Jun, 2009

1 commit

  • This reverts commit db2dbb12dc47a50c7a4c5678f526014063e486f6.

    It apparently causes problems with partition table read-ahead
    on archs with large page sizes. Until that problem is diagnosed
    further, just drop the readpages support on block devices.

    Signed-off-by: Jens Axboe

    Jens Axboe
     

23 May, 2009

1 commit

  • Until now we have had a 1:1 mapping between storage device physical
    block size and the logical block sized used when addressing the device.
    With SATA 4KB drives coming out that will no longer be the case. The
    sector size will be 4KB but the logical block size will remain
    512-bytes. Hence we need to distinguish between the physical block size
    and the logical ditto.

    This patch renames hardsect_size to logical_block_size.

    Signed-off-by: Martin K. Petersen
    Signed-off-by: Jens Axboe

    Martin K. Petersen