29 Oct, 2011

5 commits

  • * 'for-linus' of git://ceph.newdream.net/git/ceph-client:
    libceph: fix double-free of page vector
    ceph: fix 32-bit ino numbers
    libceph: force resend of osd requests if we skip an osdmap
    ceph: use kernel DNS resolver
    ceph: fix ceph_monc_init memory leak
    ceph: let the set_layout ioctl set single traits
    Revert "ceph: don't truncate dirty pages in invalidate work thread"
    ceph: replace leading spaces with tabs
    libceph: warn on msg allocation failures
    libceph: don't complain on msgpool alloc failures
    libceph: always preallocate mon connection
    libceph: create messenger with client
    ceph: document ioctls
    ceph: implement (optional) max read size
    ceph: rename rsize -> rasize
    ceph: make readpages fully async

    Linus Torvalds
     
  • * 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/hch/vfs-queue: (21 commits)
    leases: fix write-open/read-lease race
    nfs: drop unnecessary locking in llseek
    ext4: replace cut'n'pasted llseek code with generic_file_llseek_size
    vfs: add generic_file_llseek_size
    vfs: do (nearly) lockless generic_file_llseek
    direct-io: merge direct_io_walker into __blockdev_direct_IO
    direct-io: inline the complete submission path
    direct-io: separate map_bh from dio
    direct-io: use a slab cache for struct dio
    direct-io: rearrange fields in dio/dio_submit to avoid holes
    direct-io: fix a wrong comment
    direct-io: separate fields only used in the submission path from struct dio
    vfs: fix spinning prevention in prune_icache_sb
    vfs: add a comment to inode_permission()
    vfs: pass all mask flags check_acl and posix_acl_permission
    vfs: add hex format for MAY_* flag values
    vfs: indicate that the permission functions take all the MAY_* flags
    compat: sync compat_stats with statfs.
    vfs: add "device" tag to /proc/self/mountstats
    cleanup: vfs: small comment fix for block_invalidatepage
    ...

    Fix up trivial conflict in fs/gfs2/file.c (llseek changes)

    Linus Torvalds
     
  • * http://sucs.org/~rohan/git/gfs2-3.0-nmw: (24 commits)
    GFS2: Move readahead of metadata during deallocation into its own function
    GFS2: Remove two unused variables
    GFS2: Misc fixes
    GFS2: rewrite fallocate code to write blocks directly
    GFS2: speed up delete/unlink performance for large files
    GFS2: Fix off-by-one in gfs2_blk2rgrpd
    GFS2: Clean up ->page_mkwrite
    GFS2: Correctly set goal block after allocation
    GFS2: Fix AIL flush issue during fsync
    GFS2: Use cached rgrp in gfs2_rlist_add()
    GFS2: Call do_strip() directly from recursive_scan()
    GFS2: Remove obsolete assert
    GFS2: Cache the most recently used resource group in the inode
    GFS2: Make resource groups "append only" during life of fs
    GFS2: Use rbtree for resource groups and clean up bitmap buffer ref count scheme
    GFS2: Fix lseek after SEEK_DATA, SEEK_HOLE have been added
    GFS2: Clean up gfs2_create
    GFS2: Use ->dirty_inode()
    GFS2: Fix bug trap and journaled data fsync
    GFS2: Fix inode allocation error path
    ...

    Linus Torvalds
     
  • * '3.2-without-smb2' of git://git.samba.org/sfrench/cifs-2.6: (52 commits)
    Fix build break when freezer not configured
    Add definition for share encryption
    CIFS: Make cifs_push_locks send as many locks at once as possible
    CIFS: Send as many mandatory unlock ranges at once as possible
    CIFS: Implement caching mechanism for posix brlocks
    CIFS: Implement caching mechanism for mandatory brlocks
    CIFS: Fix DFS handling in cifs_get_file_info
    CIFS: Fix error handling in cifs_readv_complete
    [CIFS] Fixup trivial checkpatch warning
    [CIFS] Show nostrictsync and noperm mount options in /proc/mounts
    cifs, freezer: add wait_event_freezekillable and have cifs use it
    cifs: allow cifs_max_pending to be readable under /sys/module/cifs/parameters
    cifs: tune bdi.ra_pages in accordance with the rsize
    cifs: allow for larger rsize= options and change defaults
    cifs: convert cifs_readpages to use async reads
    cifs: add cifs_async_readv
    cifs: fix protocol definition for READ_RSP
    cifs: add a callback function to receive the rest of the frame
    cifs: break out 3rd receive phase into separate function
    cifs: find mid earlier in receive codepath
    ...

    Linus Torvalds
     
  • * 'for-linus' of git://oss.sgi.com/xfs/xfs: (69 commits)
    xfs: add AIL pushing tracepoints
    xfs: put in missed fix for merge problem
    xfs: do not flush data workqueues in xfs_flush_buftarg
    xfs: remove XFS_bflush
    xfs: remove xfs_buf_target_name
    xfs: use xfs_ioerror_alert in xfs_buf_iodone_callbacks
    xfs: clean up xfs_ioerror_alert
    xfs: clean up buffer allocation
    xfs: remove buffers from the delwri list in xfs_buf_stale
    xfs: remove XFS_BUF_STALE and XFS_BUF_SUPER_STALE
    xfs: remove XFS_BUF_SET_VTYPE and XFS_BUF_SET_VTYPE_REF
    xfs: remove XFS_BUF_FINISH_IOWAIT
    xfs: remove xfs_get_buftarg_list
    xfs: fix buffer flushing during unmount
    xfs: optimize fsync on directories
    xfs: reduce the number of log forces from tail pushing
    xfs: Don't allocate new buffers on every call to _xfs_buf_find
    xfs: simplify xfs_trans_ijoin* again
    xfs: unlock the inode before log force in xfs_change_file_space
    xfs: unlock the inode before log force in xfs_fs_nfs_commit_metadata
    ...

    Linus Torvalds
     

28 Oct, 2011

20 commits

  • In setlease, we use i_writecount to decide whether we can give out a
    read lease.

    In open, we break leases before incrementing i_writecount.

    There is therefore a window between the break lease and the i_writecount
    increment when setlease could add a new read lease.

    This would leave us with a simultaneous write open and read lease, which
    shouldn't happen.

    Signed-off-by: J. Bruce Fields
    Signed-off-by: Christoph Hellwig

    J. Bruce Fields
     
  • This makes NFS follow the standard generic_file_llseek locking scheme.

    Cc: Trond.Myklebust@netapp.com
    Signed-off-by: Andi Kleen
    Signed-off-by: Christoph Hellwig

    Andi Kleen
     
  • This gives ext4 the benefits of unlocked llseek.

    Cc: tytso@mit.edu
    Signed-off-by: Andi Kleen
    Signed-off-by: Christoph Hellwig

    Andi Kleen
     
  • Add a generic_file_llseek variant to the VFS that allows passing in
    the maximum file size of the file system, instead of always
    using maxbytes from the superblock.

    This can be used to eliminate some cut'n'paste seek code in ext4.

    Signed-off-by: Andi Kleen
    Signed-off-by: Christoph Hellwig

    Andi Kleen
     
  • The i_mutex lock use of generic _file_llseek hurts. Independent processes
    accessing the same file synchronize over a single lock, even though
    they have no need for synchronization at all.

    Under high utilization this can cause llseek to scale very poorly on larger
    systems.

    This patch does some rethinking of the llseek locking model:

    First the 64bit f_pos is not necessarily atomic without locks
    on 32bit systems. This can already cause races with read() today.
    This was discussed on linux-kernel in the past and deemed acceptable.
    The patch does not change that.

    Let's look at the different seek variants:

    SEEK_SET: Doesn't really need any locking.
    If there's a race one writer wins, the other loses.

    For 32bit the non atomic update races against read()
    stay the same. Without a lock they can also happen
    against write() now. The read() race was deemed
    acceptable in past discussions, and I think if it's
    ok for read it's ok for write too.

    => Don't need a lock.

    SEEK_END: This behaves like SEEK_SET plus it reads
    the maximum size too. Reading the maximum size would have the
    32bit atomic problem. But luckily we already have a way to read
    the maximum size without locking (i_size_read), so we
    can just use that instead.

    Without i_mutex there is no synchronization with write() anymore,
    however since the write() update is atomic on 64bit it just behaves
    like another racy SEEK_SET. On non atomic 32bit it's the same
    as SEEK_SET.

    => Don't need a lock, but need to use i_size_read()

    SEEK_CUR: This has a read-modify-write race window
    on the same file. One could argue that any application
    doing unsynchronized seeks on the same file is already broken.
    But for the sake of not adding a regression here I'm
    using the file->f_lock to synchronize this. Using this
    lock is much better than the inode mutex because it doesn't
    synchronize between processes.

    => So still need a lock, but can use a f_lock.

    This patch implements this new scheme in generic_file_llseek.
    I dropped generic_file_llseek_unlocked and changed all callers.

    Signed-off-by: Andi Kleen
    Signed-off-by: Christoph Hellwig

    Andi Kleen
     
  • This doesn't change anything for the compiler, but hch thought it would
    make the code clearer.

    I moved the reference counting into its own little inline.

    Signed-off-by: Andi Kleen
    Acked-by: Jeff Moyer
    Signed-off-by: Christoph Hellwig

    Andi Kleen
     
  • Add inlines to all the submission path functions. While this increases
    code size it also gives gcc a lot of optimization opportunities
    in this critical hotpath.

    In particular -- together with some other changes -- this
    allows gcc to get rid of the unnecessary clearing of
    sdio at the beginning and optimize the messy parameter passing.
    Any non inlining of a function which takes a sdio parameter
    would break this optimization because they cannot be done if the
    address of a structure is taken.

    Note that benefits are only seen with CONFIG_OPTIMIZE_INLINING
    and CONFIG_CC_OPTIMIZE_FOR_SIZE both set to off.

    This gives about 2.2% improvement on a large database benchmark
    with a high IOPS rate.

    Signed-off-by: Andi Kleen
    Signed-off-by: Christoph Hellwig

    Andi Kleen
     
  • Only a single b_private field in the map_bh buffer head is needed after
    the submission path. Move map_bh separately to avoid storing
    this information in the long term slab.

    This avoids the weird 104 byte hole in struct dio_submit which also needed
    to be memseted early.

    Signed-off-by: Andi Kleen
    Signed-off-by: Christoph Hellwig

    Andi Kleen
     
  • A direct slab call is slightly faster than kmalloc and can be better cached
    per CPU. It also avoids rounding to the next kmalloc slab.

    In addition this enforces cache line alignment for struct dio to avoid
    any false sharing.

    Signed-off-by: Andi Kleen
    Acked-by: Jeff Moyer
    Signed-off-by: Christoph Hellwig

    Andi Kleen
     
  • Fix most problems reported by pahole.

    There is still a weird 104 byte hole after map_bh. I'm not sure what
    causes this.

    Signed-off-by: Andi Kleen
    Acked-by: Jeff Moyer
    Signed-off-by: Christoph Hellwig

    Andi Kleen
     
  • There's nothing on the stack, even before my changes.

    Signed-off-by: Andi Kleen
    Acked-by: Jeff Moyer
    Signed-off-by: Christoph Hellwig

    Andi Kleen
     
  • This large, but largely mechanic, patch moves all fields in struct dio
    that are only used in the submission path into a separate on stack
    data structure. This has the advantage that the memory is very likely
    cache hot, which is not guaranteed for memory fresh out of kmalloc.

    This also gives gcc more optimization potential because it can easier
    determine that there are no external aliases for these variables.

    The sdio initialization is a initialization now instead of memset.
    This allows gcc to break sdio into individual fields and optimize
    away unnecessary zeroing (after all the functions are inlined)

    Signed-off-by: Andi Kleen
    Acked-by: Jeff Moyer
    Signed-off-by: Christoph Hellwig

    Andi Kleen
     
  • We need to move the inode to the end of the list to actually make the
    spinning prevention explained in the comment above it work. With a
    plain list_move it will simply stay in place as we're always reclaiming
    from the head of the list.

    Signed-off-by: Christoph Hellwig

    Christoph Hellwig
     
  • Acked-by: J. Bruce Fields
    Acked-by: David Howells
    Signed-off-by: Andreas Gruenbacher
    Signed-off-by: Aneesh Kumar K.V
    Signed-off-by: Christoph Hellwig

    Andreas Gruenbacher
     
  • Acked-by: J. Bruce Fields
    Acked-by: David Howells
    Signed-off-by: Andreas Gruenbacher
    Signed-off-by: Aneesh Kumar K.V
    Signed-off-by: Christoph Hellwig

    Andreas Gruenbacher
     
  • Acked-by: J. Bruce Fields
    Acked-by: David Howells
    Signed-off-by: Andreas Gruenbacher
    Signed-off-by: Aneesh Kumar K.V
    Signed-off-by: Christoph Hellwig

    Andreas Gruenbacher
     
  • This was found by inspection while tracking a similar
    bug in compat_statfs64, that has been fixed in mainline
    since decemeber.

    - This fixes a bug where not all of the f_spare fields
    were cleared on mips and s390.
    - Add the f_flags field to struct compat_statfs
    - Copy f_flags to userspace in case someone cares.
    - Use __clear_user to copy the f_spare field to userspace
    to ensure that all of the elements of f_spare are cleared.
    On some architectures f_spare is has 5 ints and on some
    architectures f_spare only has 4 ints. Which makes
    the previous technique of clearing each int individually
    broken.

    I don't expect anyone actually uses the old statfs system
    call anymore but if they do let them benefit from having
    the compat and the native version working the same.

    Signed-off-by: Eric W. Biederman
    Signed-off-by: Christoph Hellwig

    Eric W. Biederman
     
  • nfsiostat was failing to find mounted filesystems on kernels after
    2.6.38 because of changes to show_vfsstat() by commit
    c7f404b40a3665d9f4e9a927cc5c1ee0479ed8f9. This patch adds back the
    "device" tag before the nfs server entry so scripts can parse the
    mountstats file correctly.

    Signed-off-by: Bryan Schumaker
    CC: stable@kernel.org [>=2.6.39]
    Signed-off-by: Christoph Hellwig

    Bryan Schumaker
     
  • The patch is aganist 3.1-rc3.

    Signed-off-by: Wang Sheng-Hui
    Signed-off-by: Christoph Hellwig

    Wang Sheng-Hui
     
  • Samba supports a setfs info level to negotiate encrypted
    shares. This patch adds the defines so we recognize
    this info level. Later patches will add the enablement
    for it.

    Acked-by: Jeremy Allison
    Signed-off-by: Steve French

    Steve French
     

27 Oct, 2011

2 commits

  • In my last patch I did a stupid mistake and broke the exofs
    compilation completely. Fix it ASAP.

    Instead of obj-y I did obj-$(y)

    Really Really sorry. Me totally blushing :-{|

    Signed-off-by: Boaz Harrosh
    Signed-off-by: Linus Torvalds

    Boaz Harrosh
     
  • * 'for-linus' of git://git.open-osd.org/linux-open-osd: (21 commits)
    ore: Enable RAID5 mounts
    exofs: Support for RAID5 read-4-write interface.
    ore: RAID5 Write
    ore: RAID5 read
    fs/Makefile: Always inspect exofs/
    ore: Make ore_calc_stripe_info EXPORT_SYMBOL
    ore/exofs: Change ore_check_io API
    ore/exofs: Define new ore_verify_layout
    ore: Support for partial component table
    ore: Support for short read/writes
    exofs: Support for short read/writes
    ore: Remove check for ios->kern_buff in _prepare_for_striping to later
    ore: cleanup: Embed an ore_striping_info inside ore_io_state
    ore: Only IO one group at a time (API change)
    ore/exofs: Change the type of the devices array (API change)
    ore: Make ore_striping_info and ore_calc_stripe_info public
    exofs: Remove unused data_map member from exofs_sb_info
    exofs: Rename struct ore_components comps => oc
    exofs/super.c: local functions should be static
    exofs/ore.c: local functions should be static
    ...

    Linus Torvalds
     

26 Oct, 2011

13 commits

  • * 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (23 commits)
    time, s390: Get rid of compile warning
    dw_apb_timer: constify clocksource name
    time: Cleanup old CONFIG_GENERIC_TIME references that snuck in
    time: Change jiffies_to_clock_t() argument type to unsigned long
    alarmtimers: Fix error handling
    clocksource: Make watchdog reset lockless
    posix-cpu-timers: Cure SMP accounting oddities
    s390: Use direct ktime path for s390 clockevent device
    clockevents: Add direct ktime programming function
    clockevents: Make minimum delay adjustments configurable
    nohz: Remove "Switched to NOHz mode" debugging messages
    proc: Consider NO_HZ when printing idle and iowait times
    nohz: Make idle/iowait counter update conditional
    nohz: Fix update_ts_time_stat idle accounting
    cputime: Clean up cputime_to_usecs and usecs_to_cputime macros
    alarmtimers: Rework RTC device selection using class interface
    alarmtimers: Add try_to_cancel functionality
    alarmtimers: Add more refined alarm state tracking
    alarmtimers: Remove period from alarm structure
    alarmtimers: Remove interval cap limit hack
    ...

    Linus Torvalds
     
  • * 'for-linus' of git://github.com/ericvh/linux:
    9p: fix 9p.txt to advertise msize instead of maxdata
    net/9p: Convert net/9p protocol dumps to tracepoints
    fs/9p: change an int to unsigned int
    fs/9p: Cleanup option parsing in 9p
    9p: move dereference after NULL check
    fs/9p: inode file operation is properly initialized init_special_inode
    fs/9p: Update zero-copy implementation in 9p

    Linus Torvalds
     
  • ceph_release_page_vector() kfrees the vector; we shouldn't do it here too.

    Reported-by: Jeff Wu
    Signed-off-by: Sage Weil

    Sage Weil
     
  • Fix 32-bit ino generation to not always be 1.

    Signed-off-by: Amon Ott

    Amon Ott
     
  • Previously we were validating the passed-in stripe unit, object size,
    and stripe count against each other (and not testing most other stuff).
    Instead, make sure that the composed previous layout and new values are valid,
    and only send the new values to the MDS. This lets users change the
    pool without setting the whole layout, for instance.

    Signed-off-by: Greg Farnum

    Greg Farnum
     
  • This reverts commit c9af9fb68e01eb2c2165e1bc45cfeeed510c64e6.

    We need to block and truncate all pages in order to reliably invalidate
    them. Otherwise, we could:

    - have some uptodate pages in the cache
    - queue an invalidate
    - write(2) locks some pages
    - invalidate_work skips them
    - write(2) only overwrites part of the page
    - page now dirty and uptodate
    -> partial leakage of invalidated data

    It's not entirely clear why we started skipping locked pages in the first
    place. I just ran this through fsx and didn't see any problems.

    Signed-off-by: Sage Weil

    Sage Weil
     
  • Trivial formatting fix.

    Signed-off-by: Noah Watkins
    Signed-off-by: Sage Weil

    Noah Watkins
     
  • The pool allocation failures are masked by the pool; there is no need to
    spam the console about them. (That's the whole point of having the pool
    in the first place.)

    Mark msg allocations whose failure is safely handled as such.

    Signed-off-by: Sage Weil

    Sage Weil
     
  • This simplifies the init/shutdown paths, and makes client->msgr available
    during the rest of the setup process.

    Signed-off-by: Sage Weil

    Sage Weil
     
  • ...after some prodding by Christoph.

    Signed-off-by: Sage Weil

    Sage Weil
     
  • The 'rsize' mount option limits the maximum size of an individual
    read(ahead) operation that is sent off to an OSD. This is distinct from
    'rasize', which controls the size of the readahead window.

    Signed-off-by: Sage Weil

    Sage Weil
     
  • It controls readahead.

    Signed-off-by: Sage Weil

    Sage Weil
     
  • When we get a ->readpages() aop, submit async reads for all page ranges
    in the provided page list. Lock the pages immediately, so that VFS/MM
    will block until the reads complete.

    Signed-off-by: Sage Weil

    Sage Weil