19 Sep, 2013

1 commit

  • Pull vfs fixes from Al Viro:
    "atomic_open-related fixes (Miklos' series, with EEXIST-related parts
    replaced with fix in fs/namei.c:atomic_open() instead of messing with
    the instances) + race fix in autofs + leak on failure exit in 9p"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    9p: don't forget to destroy inode cache if fscache registration fails
    atomic_open: take care of EEXIST in no-open case with O_CREAT|O_EXCL in fs/namei.c
    vfs: don't set FILE_CREATED before calling ->atomic_open()
    nfs: set FILE_CREATED
    gfs2: set FILE_CREATED
    cifs: fix filp leak in cifs_atomic_open()
    vfs: improve i_op->atomic_open() documentation
    autofs4: close the races around autofs4_notify_daemon()

    Linus Torvalds
     

17 Sep, 2013

1 commit

  • Fix documentation of ->atomic_open() and related functions: finish_open()
    and finish_no_open(). Also add details that seem to be unclear and a
    source of bugs (some of which are fixed in the following series).

    Cc-ing maintainers of all filesystems implementing ->atomic_open().

    Signed-off-by: Miklos Szeredi
    Cc: Eric Van Hensbergen
    Cc: Sage Weil
    Cc: Steve French
    Cc: Steven Whitehouse
    Cc: Trond Myklebust
    Signed-off-by: Al Viro

    Miklos Szeredi
     

14 Sep, 2013

1 commit


13 Sep, 2013

1 commit

  • Pull vfs pile 4 from Al Viro:
    "list_lru pile, mostly"

    This came out of Andrew's pile, Al ended up doing the merge work so that
    Andrew didn't have to.

    Additionally, a few fixes.

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (42 commits)
    super: fix for destroy lrus
    list_lru: dynamically adjust node arrays
    shrinker: Kill old ->shrink API.
    shrinker: convert remaining shrinkers to count/scan API
    staging/lustre/libcfs: cleanup linux-mem.h
    staging/lustre/ptlrpc: convert to new shrinker API
    staging/lustre/obdclass: convert lu_object shrinker to count/scan API
    staging/lustre/ldlm: convert to shrinkers to count/scan API
    hugepage: convert huge zero page shrinker to new shrinker API
    i915: bail out earlier when shrinker cannot acquire mutex
    drivers: convert shrinkers to new count/scan API
    fs: convert fs shrinkers to new scan/count API
    xfs: fix dquot isolation hang
    xfs-convert-dquot-cache-lru-to-list_lru-fix
    xfs: convert dquot cache lru to list_lru
    xfs: rework buffer dispose list tracking
    xfs-convert-buftarg-lru-to-generic-code-fix
    xfs: convert buftarg LRU to generic code
    fs: convert inode and dentry shrinking to be node aware
    vmscan: per-node deferred work
    ...

    Linus Torvalds
     

12 Sep, 2013

3 commits

  • Pull CIFS fixes from Steve French:
    "CIFS update including case insensitive file name matching improvements
    for UTF-8 to Unicode, various small cifs fixes, SMB2/SMB3 leasing
    improvements, support for following SMB2 symlinks, SMB3 packet signing
    improvements"

    * 'for-next' of git://git.samba.org/sfrench/cifs-2.6: (25 commits)
    CIFS: Respect epoch value from create lease context v2
    CIFS: Add create lease v2 context for SMB3
    CIFS: Move parsing lease buffer to ops struct
    CIFS: Move creating lease buffer to ops struct
    CIFS: Store lease state itself rather than a mapped oplock value
    CIFS: Replace clientCanCache* bools with an integer
    [CIFS] quiet sparse compile warning
    cifs: Start using per session key for smb2/3 for signature generation
    cifs: Add a variable specific to NTLMSSP for key exchange.
    cifs: Process post session setup code in respective dialect functions.
    CIFS: convert to use le32_add_cpu()
    CIFS: Fix missing lease break
    CIFS: Fix a memory leak when a lease break comes
    cifs: add winucase_convert.pl to Documentation/ directory
    cifs: convert case-insensitive dentry ops to use new case conversion routines
    cifs: add new case-insensitive conversion routines that are based on wchar_t's
    [CIFS] Add Scott to list of cifs contributors
    cifs: Move and expand MAX_SERVER_SIZE definition
    cifs: Expand max share name length to 256
    cifs: Move string length definitions to uapi
    ...

    Linus Torvalds
     
  • Command line option rootfstype=ramfs to obtain old initramfs behavior, and
    use ramfs instead of tmpfs for stub when root= defined (for cosmetic
    reasons).

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Rob Landley
    Cc: Jeff Layton
    Cc: Jens Axboe
    Cc: Stephen Warren
    Cc: Rusty Russell
    Cc: Jim Cromie
    Cc: Sam Ravnborg
    Cc: Greg Kroah-Hartman
    Cc: "Eric W. Biederman"
    Cc: Alexander Viro
    Cc: "H. Peter Anvin"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rob Landley
     
  • Fix mistake in the description of Committed_AS in kernel documentation.

    Signed-off-by: Minto Joseph
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minto Joseph
     

11 Sep, 2013

1 commit

  • For a long time no filesystem has been using vfs_follow_link, and as seen
    by recent filesystem submissions any new use is accidental as well.

    Remove vfs_follow_link, document the replacement in
    Documentation/filesystems/porting and also rename __vfs_follow_link
    to match its only caller better.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Christoph Hellwig
     

10 Sep, 2013

1 commit

  • Pull ceph updates from Sage Weil:
    "This includes both the first pile of Ceph patches (which I sent to
    torvalds@vger, sigh) and a few new patches that add support for
    fscache for Ceph. That includes a few fscache core fixes that David
    Howells asked go through the Ceph tree. (Thanks go to Milosz Tanski
    for putting this feature together)

    This first batch of patches (included here) had (has) several
    important RBD bug fixes, hole punch support, several different
    cleanups in the page cache interactions, improvements in the truncate
    code (new truncate mutex to avoid shenanigans with i_mutex), and a
    series of fixes in the synchronous striping read/write code.

    On top of that is a random collection of small fixes all across the
    tree (error code checks and error path cleanup, obsolete wq flags,
    etc)"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (43 commits)
    ceph: use d_invalidate() to invalidate aliases
    ceph: remove ceph_lookup_inode()
    ceph: trivial buildbot warnings fix
    ceph: Do not do invalidate if the filesystem is mounted nofsc
    ceph: page still marked private_2
    ceph: ceph_readpage_to_fscache didn't check if marked
    ceph: clean PgPrivate2 on returning from readpages
    ceph: use fscache as a local presisent cache
    fscache: Netfs function for cleanup post readpages
    FS-Cache: Fix heading in documentation
    CacheFiles: Implement interface to check cache consistency
    FS-Cache: Add interface to check consistency of a cached object
    rbd: fix null dereference in dout
    rbd: fix buffer size for writes to images with snapshots
    libceph: use pg_num_mask instead of pgp_num_mask for pg.seed calc
    rbd: fix I/O error propagation for reads
    ceph: use vfs __set_page_dirty_nobuffers interface instead of doing it inside filesystem
    ceph: allow sync_read/write return partial successed size of read/write.
    ceph: fix bugs about handling short-read for sync read mode.
    ceph: remove useless variable revoked_rdcache
    ...

    Linus Torvalds
     

09 Sep, 2013

3 commits


07 Sep, 2013

3 commits

  • Pull trivial tree from Jiri Kosina:
    "The usual trivial updates all over the tree -- mostly typo fixes and
    documentation updates"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (52 commits)
    doc: Documentation/cputopology.txt fix typo
    treewide: Convert retrun typos to return
    Fix comment typo for init_cma_reserved_pageblock
    Documentation/trace: Correcting and extending tracepoint documentation
    mm/hotplug: fix a typo in Documentation/memory-hotplug.txt
    power: Documentation: Update s2ram link
    doc: fix a typo in Documentation/00-INDEX
    Documentation/printk-formats.txt: No casts needed for u64/s64
    doc: Fix typo "is is" in Documentations
    treewide: Fix printks with 0x%#
    zram: doc fixes
    Documentation/kmemcheck: update kmemcheck documentation
    doc: documentation/hwspinlock.txt fix typo
    PM / Hibernate: add section for resume options
    doc: filesystems : Fix typo in Documentations/filesystems
    scsi/megaraid fixed several typos in comments
    ppc: init_32: Fix error typo "CONFIG_START_KERNEL"
    treewide: Add __GFP_NOWARN to k.alloc calls with v.alloc fallbacks
    page_isolation: Fix a comment typo in test_pages_isolated()
    doc: fix a typo about irq affinity
    ...

    Linus Torvalds
     
  • Pull ext3, reiserfs, udf & isofs fixes from Jan Kara:
    "The contains a bunch of ext3 cleanups and minor improvements, major
    reiserfs locking changes which should hopefully fix deadlocks
    introduced by BKL removal, and udf/isofs changes to refuse mounting fs
    rw instead of mounting it ro automatically which makes eject button
    work as expected for all media (see the changelog for why userspace
    should be ok with this change)"

    * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
    jbd: use a single printk for jbd_debug()
    reiserfs: locking, release lock around quota operations
    reiserfs: locking, handle nested locks properly
    reiserfs: locking, push write lock out of xattr code
    jbd: relocate assert after state lock in journal_commit_transaction()
    udf: Refuse RW mount of the filesystem instead of making it RO
    udf: Standardize return values in mount sequence
    isofs: Refuse RW mount of the filesystem instead of making it RO
    ext3: allow specifying external journal by pathname mount option
    jbd: remove unneeded semicolon

    Linus Torvalds
     
  • Pull f2fs updates from Jaegeuk Kim:
    "This patch-set includes the following major enhancement patches:
    - support inline xattrs
    - add sysfs support to control GCs explicitly
    - add proc entry to show the current segment usage information
    - improve the GC/SSR performance

    The other bug fixes are as follows:
    - avoid the overflow on status calculation
    - fix some error handling routines
    - fix inconsistent xattr states after power-off-recovery
    - fix incorrect xattr node offset definition
    - fix deadlock condition in fsync
    - fix the fdatasync routine for power-off-recovery"

    * tag 'for-f2fs-3.12' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (40 commits)
    f2fs: optimize gc for better performance
    f2fs: merge more bios of node block writes
    f2fs: avoid an overflow during utilization calculation
    f2fs: trigger GC when there are prefree segments
    f2fs: use strncasecmp() simplify the string comparison
    f2fs: fix omitting to update inode page
    f2fs: support the inline xattrs
    f2fs: add the truncate_xattr_node function
    f2fs: introduce __find_xattr for readability
    f2fs: reserve the xattr space dynamically
    f2fs: add flags for inline xattrs
    f2fs: fix error return code in init_f2fs_fs()
    f2fs: fix wrong BUG_ON condition
    f2fs: fix memory leak when init f2fs filesystem fail
    f2fs: fix a compound statement label error
    f2fs: avoid writing inode redundantly when creating a file
    f2fs: alloc_page() doesn't return an ERR_PTR
    f2fs: should cover i_xattr_nid with its xattr node page lock
    f2fs: check the free space first in new_node_page
    f2fs: clean up the needless end 'return' of void function
    ...

    Linus Torvalds
     

06 Sep, 2013

3 commits

  • Currently the fscache code expect the netfs to call fscache_readpages_or_alloc
    inside the aops readpages callback. It marks all the pages in the list
    provided by readahead with PG_private_2. In the cases that the netfs fails to
    read all the pages (which is legal) it ends up returning to the readahead and
    triggering a BUG. This happens because the page list still contains marked
    pages.

    This patch implements a simple fscache_readpages_cancel function that the netfs
    should call before returning from readpages. It will revoke the pages from the
    underlying cache backend and unmark them.

    The problem was originally worked out in the Ceph devel tree, but it also
    occurs in CIFS. It appears that NFS, AFS and 9P are okay as read_cache_pages()
    will clean up the unprocessed pages in the case of an error.

    This can be used to address the following oops:

    [12410647.597278] BUG: Bad page state in process petabucket pfn:3d504e
    [12410647.597292] page:ffffea000f541380 count:0 mapcount:0 mapping:
    (null) index:0x0
    [12410647.597298] page flags: 0x200000000001000(private_2)

    ...

    [12410647.597334] Call Trace:
    [12410647.597345] [] dump_stack+0x19/0x1b
    [12410647.597356] [] bad_page+0xc7/0x120
    [12410647.597359] [] free_pages_prepare+0x10e/0x120
    [12410647.597361] [] free_hot_cold_page+0x40/0x170
    [12410647.597363] [] __put_single_page+0x27/0x30
    [12410647.597365] [] put_page+0x25/0x40
    [12410647.597376] [] ceph_readpages+0x2e9/0x6e0 [ceph]
    [12410647.597379] [] __do_page_cache_readahead+0x1af/0x260
    [12410647.597382] [] ra_submit+0x21/0x30
    [12410647.597384] [] filemap_fault+0x254/0x490
    [12410647.597387] [] __do_fault+0x6f/0x4e0
    [12410647.597391] [] ? __switch_to+0x16d/0x4a0
    [12410647.597395] [] ? finish_task_switch+0x5a/0xc0
    [12410647.597398] [] handle_pte_fault+0xf6/0x930
    [12410647.597401] [] ? pte_mfn_to_pfn+0x93/0x110
    [12410647.597403] [] ? xen_pmd_val+0xe/0x10
    [12410647.597405] [] ? __raw_callee_save_xen_pmd_val+0x11/0x1e
    [12410647.597407] [] handle_mm_fault+0x251/0x370
    [12410647.597411] [] ? call_rwsem_down_read_failed+0x14/0x30
    [12410647.597414] [] __do_page_fault+0x1aa/0x550
    [12410647.597418] [] ? up_write+0x1d/0x20
    [12410647.597422] [] ? vm_mmap_pgoff+0xbc/0xe0
    [12410647.597425] [] ? SyS_mmap_pgoff+0xd8/0x240
    [12410647.597427] [] do_page_fault+0xe/0x10
    [12410647.597431] [] page_fault+0x28/0x30

    Signed-off-by: Milosz Tanski
    Signed-off-by: David Howells

    Milosz Tanski
     
  • Fix a heading in the documentation to make it consistent with the contents
    list.

    Signed-off-by: David Howells

    David Howells
     
  • Extend the fscache netfs API so that the netfs can ask as to whether a cache
    object is up to date with respect to its corresponding netfs object:

    int fscache_check_consistency(struct fscache_cookie *cookie)

    This will call back to the netfs to check whether the auxiliary data associated
    with a cookie is correct. It returns 0 if it is and -ESTALE if it isn't; it
    may also return -ENOMEM and -ERESTARTSYS.

    The backends now have to implement a mandatory operation pointer:

    int (*check_consistency)(struct fscache_object *object)

    that corresponds to the above API call. FS-Cache takes care of pinning the
    object and the cookie in memory and managing this call with respect to the
    object state.

    Original-author: Hongyi Jia
    Signed-off-by: David Howells
    cc: Hongyi Jia
    cc: Milosz Tanski

    David Howells
     

29 Aug, 2013

1 commit

  • It's always been a hassle that if an external journal's
    device number changes, the filesystem won't mount.
    And since boot-time enumeration can change, device number
    changes aren't unusual.

    The current mechanism to update the journal location is by
    passing in a mount option w/ a new devnum, but that's a hassle;
    it's a manual approach, fixing things after the fact.

    Adding a mount option, "-o journal_path=/dev/$DEVICE" would
    help, since then we can do i.e.

    # mount -o journal_path=/dev/disk/by-label/$JOURNAL_LABEL ...

    and it'll mount even if the devnum has changed, as shown here:

    # losetup /dev/loop0 journalfile
    # mke2fs -L mylabel-journal -O journal_dev /dev/loop0
    # mkfs.ext4 -L mylabel -J device=/dev/loop0 /dev/sdb1

    Change the journal device number:

    # losetup -d /dev/loop0
    # losetup /dev/loop1 journalfile

    And today it will fail:

    # mount /dev/sdb1 /mnt/test
    mount: wrong fs type, bad option, bad superblock on /dev/sdb1,
    missing codepage or helper program, or other error
    In some cases useful info is found in syslog - try
    dmesg | tail or so

    # dmesg | tail -n 1
    [17343.240702] EXT4-fs (sdb1): error: couldn't read superblock of external journal

    But with this new mount option, we can specify the new path:

    # mount -o journal_path=/dev/loop1 /dev/sdb1 /mnt/test
    #

    (which does update the encoded device number, incidentally):

    # umount /dev/sdb1
    # dumpe2fs -h /dev/sdb1 | grep "Journal device"
    dumpe2fs 1.41.12 (17-May-2010)
    Journal device: 0x0701

    But best of all we can just always mount by journal-path, and
    it'll always work:

    # mount -o journal_path=/dev/disk/by-label/mylabel-journal /dev/sdb1 /mnt/test
    #

    So the journal_path option can be specified in fstab, and as long as
    the disk is available somewhere, and findable by label (or by UUID),
    we can mount.

    Signed-off-by: Eric Sandeen
    Signed-off-by: "Theodore Ts'o"
    Reviewed-by: Jan Kara
    Reviewed-by: Carlos Maiolino

    Eric Sandeen
     

20 Aug, 2013

1 commit


06 Aug, 2013

2 commits

  • Add sysfs entry gc_idle to control the gc policy. Where
    gc_idle = 1 corresponds to selecting a cost benefit approach,
    while gc_idle = 2 corresponds to selecting a greedy approach
    to garbage collection. The selection is mutually exclusive one
    approach will work at any point. If gc_idle = 0, then this
    option is disabled.

    Cc: Gu Zheng
    Signed-off-by: Namjae Jeon
    Signed-off-by: Pankaj Kumar
    Reviewed-by: Gu Zheng
    [Jaegeuk Kim: change the select_gc_type() flow slightly]
    Signed-off-by: Jaegeuk Kim

    Namjae Jeon
     
  • Add sysfs entries to control the timing parameters for
    f2fs gc thread.

    Various Sysfs options introduced are:
    gc_min_sleep_time: Min Sleep time for GC in ms
    gc_max_sleep_time: Max Sleep time for GC in ms
    gc_no_gc_sleep_time: Default Sleep time for GC in ms

    Cc: Gu Zheng
    Signed-off-by: Namjae Jeon
    Signed-off-by: Pankaj Kumar
    Reviewed-by: Gu Zheng
    [Jaegeuk Kim: fix an umount bug and some minor changes]
    Signed-off-by: Jaegeuk Kim

    Namjae Jeon
     

01 Aug, 2013

1 commit

  • It's always been a hassle that if an external journal's
    device number changes, the filesystem won't mount.
    And since boot-time enumeration can change, device number
    changes aren't unusual.

    The current mechanism to update the journal location is by
    passing in a mount option w/ a new devnum, but that's a hassle;
    it's a manual approach, fixing things after the fact.

    Adding a mount option, "-o journal_path=/dev/$DEVICE" would
    help, since then we can do i.e.

    # mount -o journal_path=/dev/disk/by-label/$JOURNAL_LABEL ...

    and it'll mount even if the devnum has changed, as shown here:

    # losetup /dev/loop0 journalfile
    # mke2fs -L mylabel-journal -O journal_dev /dev/loop0
    # mkfs.ext3 -L mylabel -J device=/dev/loop0 /dev/sdb1

    Change the journal device number:

    # losetup -d /dev/loop0
    # losetup /dev/loop1 journalfile

    And today it will fail:

    # mount /dev/sdb1 /mnt/test
    mount: wrong fs type, bad option, bad superblock on /dev/sdb1,
    missing codepage or helper program, or other error
    In some cases useful info is found in syslog - try
    dmesg | tail or so

    # dmesg | tail -n 1
    [17343.240702] EXT3-fs (sdb1): error: couldn't read superblock of external journal

    But with this new mount option, we can specify the new path:

    # mount -o journal_path=/dev/loop1 /dev/sdb1 /mnt/test
    #

    (which does update the encoded device number, incidentally):

    # umount /dev/sdb1
    # dumpe2fs -h /dev/sdb1 | grep "Journal device"
    dumpe2fs 1.41.12 (17-May-2010)
    Journal device: 0x0701

    But best of all we can just always mount by journal-path, and
    it'll always work:

    # mount -o journal_path=/dev/disk/by-label/mylabel-journal /dev/sdb1 /mnt/test
    #

    So the journal_path option can be specified in fstab, and as long as
    the disk is available somewhere, and findable by label (or by UUID),
    we can mount.

    Signed-off-by: Eric Sandeen
    Signed-off-by: Jan Kara

    Eric Sandeen
     

30 Jul, 2013

1 commit


25 Jul, 2013

1 commit


14 Jul, 2013

1 commit

  • Pull more xfs updates from Ben Myers:
    "Here are a fix for xfs_fsr, a cleanup in bulkstat, a cleanup in
    xfs_open_by_handle, updated mount options documentation, a cleanup in
    xfs_bmapi_write, a fix for the size of dquot log reservations, a fix
    for sgid inheritance when acls are in use, a fix for cleaning up
    quotainfo structures, and some more of the work which allows group and
    project quotas to be used together.

    We had a few more in this last quota category that we might have liked
    to get in, but it looks there are still a few items that need to be
    addressed.

    - fix for xfs_fsr returning -EINVAL
    - cleanup in xfs_bulkstat
    - cleanup in xfs_open_by_handle
    - update mount options documentation
    - clean up local format handling in xfs_bmapi_write
    - fix dquot log reservations which were too small
    - fix sgid inheritance for subdirectories when default acls are in use
    - add project quota fields to various structures
    - fix teardown of quotainfo structures when quotas are turned off"

    * tag 'for-linus-v3.11-rc1-2' of git://oss.sgi.com/xfs/xfs:
    xfs: Fix the logic check for all quotas being turned off
    xfs: Add pquota fields where gquota is used.
    xfs: fix sgid inheritance for subdirectories inheriting default acls [V3]
    xfs: dquot log reservations are too small
    xfs: remove local fork format handling from xfs_bmapi_write()
    xfs: update mount options documentation
    xfs: use get_unused_fd_flags(0) instead of get_unused_fd()
    xfs: clean up unused codes at xfs_bulkstat()
    xfs: use XFS_BMAP_BMDR_SPACE vs. XFS_BROOT_SIZE_ADJ

    Linus Torvalds
     

10 Jul, 2013

1 commit


05 Jul, 2013

1 commit

  • Pull trivial tree updates from Jiri Kosina:
    "The usual stuff from trivial tree"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (34 commits)
    treewide: relase -> release
    Documentation/cgroups/memory.txt: fix stat file documentation
    sysctl/net.txt: delete reference to obsolete 2.4.x kernel
    spinlock_api_smp.h: fix preprocessor comments
    treewide: Fix typo in printk
    doc: device tree: clarify stuff in usage-model.txt.
    open firmware: "/aliasas" -> "/aliases"
    md: bcache: Fixed a typo with the word 'arithmetic'
    irq/generic-chip: fix a few kernel-doc entries
    frv: Convert use of typedef ctl_table to struct ctl_table
    sgi: xpc: Convert use of typedef ctl_table to struct ctl_table
    doc: clk: Fix incorrect wording
    Documentation/arm/IXP4xx fix a typo
    Documentation/networking/ieee802154 fix a typo
    Documentation/DocBook/media/v4l fix a typo
    Documentation/video4linux/si476x.txt fix a typo
    Documentation/virtual/kvm/api.txt fix a typo
    Documentation/early-userspace/README fix a typo
    Documentation/video4linux/soc-camera.txt fix a typo
    lguest: fix CONFIG_PAE -> CONFIG_x86_PAE in comment
    ...

    Linus Torvalds
     

04 Jul, 2013

4 commits

  • Signed-off-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • The documentation for address_space_operations is partially out of date.

    Signed-off-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • The soft-dirty is a bit on a PTE which helps to track which pages a task
    writes to. In order to do this tracking one should

    1. Clear soft-dirty bits from PTEs ("echo 4 > /proc/PID/clear_refs)
    2. Wait some time.
    3. Read soft-dirty bits (55'th in /proc/PID/pagemap2 entries)

    To do this tracking, the writable bit is cleared from PTEs when the
    soft-dirty bit is. Thus, after this, when the task tries to modify a
    page at some virtual address the #PF occurs and the kernel sets the
    soft-dirty bit on the respective PTE.

    Note, that although all the task's address space is marked as r/o after
    the soft-dirty bits clear, the #PF-s that occur after that are processed
    fast. This is so, since the pages are still mapped to physical memory,
    and thus all the kernel does is finds this fact out and puts back
    writable, dirty and soft-dirty bits on the PTE.

    Another thing to note, is that when mremap moves PTEs they are marked
    with soft-dirty as well, since from the user perspective mremap modifies
    the virtual memory at mremap's new address.

    Signed-off-by: Pavel Emelyanov
    Cc: Matt Mackall
    Cc: Xiao Guangrong
    Cc: Glauber Costa
    Cc: Marcelo Tosatti
    Cc: KOSAKI Motohiro
    Cc: Stephen Rothwell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pavel Emelyanov
     
  • Pull second set of VFS changes from Al Viro:
    "Assorted f_pos race fixes, making do_splice_direct() safe to call with
    i_mutex on parent, O_TMPFILE support, Jeff's locks.c series,
    ->d_hash/->d_compare calling conventions changes from Linus, misc
    stuff all over the place."

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (63 commits)
    Document ->tmpfile()
    ext4: ->tmpfile() support
    vfs: export lseek_execute() to modules
    lseek_execute() doesn't need an inode passed to it
    block_dev: switch to fixed_size_llseek()
    cpqphp_sysfs: switch to fixed_size_llseek()
    tile-srom: switch to fixed_size_llseek()
    proc_powerpc: switch to fixed_size_llseek()
    ubi/cdev: switch to fixed_size_llseek()
    pci/proc: switch to fixed_size_llseek()
    isapnp: switch to fixed_size_llseek()
    lpfc: switch to fixed_size_llseek()
    locks: give the blocked_hash its own spinlock
    locks: add a new "lm_owner_key" lock operation
    locks: turn the blocked_list into a hashtable
    locks: convert fl_link to a hlist_node
    locks: avoid taking global lock if possible when waking up blocked waiters
    locks: protect most of the file_lock handling with i_lock
    locks: encapsulate the fl_link list handling
    locks: make "added" in __posix_lock_file a bool
    ...

    Linus Torvalds
     

03 Jul, 2013

3 commits

  • Signed-off-by: Al Viro

    Al Viro
     
  • Pull f2fs updates from Jaegeuk Kim:
    "This patch-set includes the following major enhancement patches:
    - remount_fs callback function
    - restore parent inode number to enhance the fsync performance
    - xattr security labels
    - reduce the number of redundant lock/unlock data pages
    - avoid frequent write_inode calls

    The other minor bug fixes are as follows.
    - endian conversion bugs
    - various bugs in the roll-forward recovery routine"

    * tag 'for-f2fs-3.11' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (56 commits)
    f2fs: fix to recover i_size from roll-forward
    f2fs: remove the unused argument "sbi" of func destroy_fsync_dnodes()
    f2fs: remove reusing any prefree segments
    f2fs: code cleanup and simplify in func {find/add}_gc_inode
    f2fs: optimize the init_dirty_segmap function
    f2fs: fix an endian conversion bug detected by sparse
    f2fs: fix crc endian conversion
    f2fs: add remount_fs callback support
    f2fs: recover wrong pino after checkpoint during fsync
    f2fs: optimize do_write_data_page()
    f2fs: make locate_dirty_segment() as static
    f2fs: remove unnecessary parameter "offset" from __add_sum_entry()
    f2fs: avoid freqeunt write_inode calls
    f2fs: optimise the truncate_data_blocks_range() range
    f2fs: use the F2FS specific flags in f2fs_ioctl()
    f2fs: sync dir->i_size with its block allocation
    f2fs: fix i_blocks translation on various types of files
    f2fs: set sb->s_fs_info before calling parse_options()
    f2fs: support xattr security labels
    f2fs: fix iget/iput of dir during recovery
    ...

    Linus Torvalds
     
  • Pull ext4 update from Ted Ts'o:
    "Lots of bug fixes, cleanups and optimizations. In the bug fixes
    category, of note is a fix for on-line resizing file systems where the
    block size is smaller than the page size (i.e., file systems 1k blocks
    on x86, or more interestingly file systems with 4k blocks on Power or
    ia64 systems.)

    In the cleanup category, the ext4's punch hole implementation was
    significantly improved by Lukas Czerner, and now supports bigalloc
    file systems. In addition, Jan Kara significantly cleaned up the
    write submission code path. We also improved error checking and added
    a few sanity checks.

    In the optimizations category, two major optimizations deserve
    mention. The first is that ext4_writepages() is now used for
    nodelalloc and ext3 compatibility mode. This allows writes to be
    submitted much more efficiently as a single bio request, instead of
    being sent as individual 4k writes into the block layer (which then
    relied on the elevator code to coalesce the requests in the block
    queue). Secondly, the extent cache shrink mechanism, which was
    introduce in 3.9, no longer has a scalability bottleneck caused by the
    i_es_lru spinlock. Other optimizations include some changes to reduce
    CPU usage and to avoid issuing empty commits unnecessarily."

    * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (86 commits)
    ext4: optimize starting extent in ext4_ext_rm_leaf()
    jbd2: invalidate handle if jbd2_journal_restart() fails
    ext4: translate flag bits to strings in tracepoints
    ext4: fix up error handling for mpage_map_and_submit_extent()
    jbd2: fix theoretical race in jbd2__journal_restart
    ext4: only zero partial blocks in ext4_zero_partial_blocks()
    ext4: check error return from ext4_write_inline_data_end()
    ext4: delete unnecessary C statements
    ext3,ext4: don't mess with dir_file->f_pos in htree_dirblock_to_tree()
    jbd2: move superblock checksum calculation to jbd2_write_superblock()
    ext4: pass inode pointer instead of file pointer to punch hole
    ext4: improve free space calculation for inline_data
    ext4: reduce object size when !CONFIG_PRINTK
    ext4: improve extent cache shrink mechanism to avoid to burn CPU time
    ext4: implement error handling of ext4_mb_new_preallocation()
    ext4: fix corruption when online resizing a fs with 1K block size
    ext4: delete unused variables
    ext4: return FIEMAP_EXTENT_UNKNOWN for delalloc extents
    jbd2: remove debug dependency on debug_fs and update Kconfig help text
    jbd2: use a single printk for jbd_debug()
    ...

    Linus Torvalds
     

29 Jun, 2013

5 commits

  • There's no reason we have to protect the blocked_hash and file_lock_list
    with the same spinlock. With the tests I have, breaking it in two gives
    a barely measurable performance benefit, but it seems reasonable to make
    this locking as granular as possible.

    Signed-off-by: Jeff Layton
    Signed-off-by: Al Viro

    Jeff Layton
     
  • Currently, the hashing that the locking code uses to add these values
    to the blocked_hash is simply calculated using fl_owner field. That's
    valid in most cases except for server-side lockd, which validates the
    owner of a lock based on fl_owner and fl_pid.

    In the case where you have a small number of NFS clients doing a lot
    of locking between different processes, you could end up with all
    the blocked requests sitting in a very small number of hash buckets.

    Add a new lm_owner_key operation to the lock_manager_operations that
    will generate an unsigned long to use as the key in the hashtable.
    That function is only implemented for server-side lockd, and simply
    XORs the fl_owner and fl_pid.

    Signed-off-by: Jeff Layton
    Acked-by: J. Bruce Fields
    Signed-off-by: Al Viro

    Jeff Layton
     
  • Having a global lock that protects all of this code is a clear
    scalability problem. Instead of doing that, move most of the code to be
    protected by the i_lock instead. The exceptions are the global lists
    that the ->fl_link sits on, and the ->fl_block list.

    ->fl_link is what connects these structures to the
    global lists, so we must ensure that we hold those locks when iterating
    over or updating these lists.

    Furthermore, sound deadlock detection requires that we hold the
    blocked_list state steady while checking for loops. We also must ensure
    that the search and update to the list are atomic.

    For the checking and insertion side of the blocked_list, push the
    acquisition of the global lock into __posix_lock_file and ensure that
    checking and update of the blocked_list is done without dropping the
    lock in between.

    On the removal side, when waking up blocked lock waiters, take the
    global lock before walking the blocked list and dequeue the waiters from
    the global list prior to removal from the fl_block list.

    With this, deadlock detection should be race free while we minimize
    excessive file_lock_lock thrashing.

    Finally, in order to avoid a lock inversion problem when handling
    /proc/locks output we must ensure that manipulations of the fl_block
    list are also protected by the file_lock_lock.

    Signed-off-by: Jeff Layton
    Signed-off-by: Al Viro

    Jeff Layton
     
  • Instances either don't look at it at all (the majority of cases) or
    only want it to find the superblock (which can be had as dentry->d_sb).
    A few cases that want more are actually safe with dentry->d_inode -
    the only precaution needed is the check that it hadn't been replaced with
    NULL by rmdir() or by overwriting rename(), which case should be simply
    treated as cache miss.

    Signed-off-by: Linus Torvalds
    Signed-off-by: Al Viro

    Linus Torvalds
     
  • everything's converted to ->iterate()

    Signed-off-by: Al Viro

    Al Viro