26 Sep, 2014

12 commits

  • commit 2457aec63745e235bcafb7ef312b182d8682f0fc upstream.

    aops->write_begin may allocate a new page and make it visible only to have
    mark_page_accessed called almost immediately after. Once the page is
    visible the atomic operations are necessary which is noticable overhead
    when writing to an in-memory filesystem like tmpfs but should also be
    noticable with fast storage. The objective of the patch is to initialse
    the accessed information with non-atomic operations before the page is
    visible.

    The bulk of filesystems directly or indirectly use
    grab_cache_page_write_begin or find_or_create_page for the initial
    allocation of a page cache page. This patch adds an init_page_accessed()
    helper which behaves like the first call to mark_page_accessed() but may
    called before the page is visible and can be done non-atomically.

    The primary APIs of concern in this care are the following and are used
    by most filesystems.

    find_get_page
    find_lock_page
    find_or_create_page
    grab_cache_page_nowait
    grab_cache_page_write_begin

    All of them are very similar in detail to the patch creates a core helper
    pagecache_get_page() which takes a flags parameter that affects its
    behavior such as whether the page should be marked accessed or not. Then
    old API is preserved but is basically a thin wrapper around this core
    function.

    Each of the filesystems are then updated to avoid calling
    mark_page_accessed when it is known that the VM interfaces have already
    done the job. There is a slight snag in that the timing of the
    mark_page_accessed() has now changed so in rare cases it's possible a page
    gets to the end of the LRU as PageReferenced where as previously it might
    have been repromoted. This is expected to be rare but it's worth the
    filesystem people thinking about it in case they see a problem with the
    timing change. It is also the case that some filesystems may be marking
    pages accessed that previously did not but it makes sense that filesystems
    have consistent behaviour in this regard.

    The test case used to evaulate this is a simple dd of a large file done
    multiple times with the file deleted on each iterations. The size of the
    file is 1/10th physical memory to avoid dirty page balancing. In the
    async case it will be possible that the workload completes without even
    hitting the disk and will have variable results but highlight the impact
    of mark_page_accessed for async IO. The sync results are expected to be
    more stable. The exception is tmpfs where the normal case is for the "IO"
    to not hit the disk.

    The test machine was single socket and UMA to avoid any scheduling or NUMA
    artifacts. Throughput and wall times are presented for sync IO, only wall
    times are shown for async as the granularity reported by dd and the
    variability is unsuitable for comparison. As async results were variable
    do to writback timings, I'm only reporting the maximum figures. The sync
    results were stable enough to make the mean and stddev uninteresting.

    The performance results are reported based on a run with no profiling.
    Profile data is based on a separate run with oprofile running.

    async dd
    3.15.0-rc3 3.15.0-rc3
    vanilla accessed-v2
    ext3 Max elapsed 13.9900 ( 0.00%) 11.5900 ( 17.16%)
    tmpfs Max elapsed 0.5100 ( 0.00%) 0.4900 ( 3.92%)
    btrfs Max elapsed 12.8100 ( 0.00%) 12.7800 ( 0.23%)
    ext4 Max elapsed 18.6000 ( 0.00%) 13.3400 ( 28.28%)
    xfs Max elapsed 12.5600 ( 0.00%) 2.0900 ( 83.36%)

    The XFS figure is a bit strange as it managed to avoid a worst case by
    sheer luck but the average figures looked reasonable.

    samples percentage
    ext3 86107 0.9783 vmlinux-3.15.0-rc4-vanilla mark_page_accessed
    ext3 23833 0.2710 vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed
    ext3 5036 0.0573 vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
    ext4 64566 0.8961 vmlinux-3.15.0-rc4-vanilla mark_page_accessed
    ext4 5322 0.0713 vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed
    ext4 2869 0.0384 vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
    xfs 62126 1.7675 vmlinux-3.15.0-rc4-vanilla mark_page_accessed
    xfs 1904 0.0554 vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
    xfs 103 0.0030 vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed
    btrfs 10655 0.1338 vmlinux-3.15.0-rc4-vanilla mark_page_accessed
    btrfs 2020 0.0273 vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
    btrfs 587 0.0079 vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed
    tmpfs 59562 3.2628 vmlinux-3.15.0-rc4-vanilla mark_page_accessed
    tmpfs 1210 0.0696 vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
    tmpfs 94 0.0054 vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed

    [akpm@linux-foundation.org: don't run init_page_accessed() against an uninitialised pointer]
    Signed-off-by: Mel Gorman
    Cc: Johannes Weiner
    Cc: Vlastimil Babka
    Cc: Jan Kara
    Cc: Michal Hocko
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Theodore Ts'o
    Cc: "Paul E. McKenney"
    Cc: Oleg Nesterov
    Cc: Rik van Riel
    Cc: Peter Zijlstra
    Tested-by: Prabhakar Lad
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Mel Gorman
    Signed-off-by: Jiri Slaby

    Mel Gorman
     
  • commit e7470ee89f003634a88e7b5e5a7b65b3025987de upstream.

    Discarding buffers uses a bunch of atomic operations when discarding
    buffers because ...... I can't think of a reason. Use a cmpxchg loop to
    clear all the necessary flags. In most (all?) cases this will be a single
    atomic operations.

    [akpm@linux-foundation.org: move BUFFER_FLAGS_DISCARD into the .c file]
    Signed-off-by: Mel Gorman
    Cc: Johannes Weiner
    Cc: Vlastimil Babka
    Cc: Jan Kara
    Cc: Michal Hocko
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Theodore Ts'o
    Cc: "Paul E. McKenney"
    Cc: Oleg Nesterov
    Cc: Rik van Riel
    Cc: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Mel Gorman
    Signed-off-by: Jiri Slaby

    Mel Gorman
     
  • commit b745bc85f21ea707e4ea1a91948055fa3e72c77b upstream.

    cold is a bool, make it one. Make the likely case the "if" part of the
    block instead of the else as according to the optimisation manual this is
    preferred.

    Signed-off-by: Mel Gorman
    Acked-by: Rik van Riel
    Cc: Johannes Weiner
    Cc: Vlastimil Babka
    Cc: Jan Kara
    Cc: Michal Hocko
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Theodore Ts'o
    Cc: "Paul E. McKenney"
    Cc: Oleg Nesterov
    Cc: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Mel Gorman
    Signed-off-by: Jiri Slaby

    Mel Gorman
     
  • commit d23da150a37c9fe3cc83dbaf71b3e37fd434ed52 upstream.

    We remove the call to grab_super_passive in call to super_cache_count.
    This becomes a scalability bottleneck as multiple threads are trying to do
    memory reclamation, e.g. when we are doing large amount of file read and
    page cache is under pressure. The cached objects quickly got reclaimed
    down to 0 and we are aborting the cache_scan() reclaim. But counting
    creates a log jam acquiring the sb_lock.

    We are holding the shrinker_rwsem which ensures the safety of call to
    list_lru_count_node() and s_op->nr_cached_objects. The shrinker is
    unregistered now before ->kill_sb() so the operation is safe when we are
    doing unmount.

    The impact will depend heavily on the machine and the workload but for a
    small machine using postmark tuned to use 4xRAM size the results were

    3.15.0-rc5 3.15.0-rc5
    vanilla shrinker-v1r1
    Ops/sec Transactions 21.00 ( 0.00%) 24.00 ( 14.29%)
    Ops/sec FilesCreate 39.00 ( 0.00%) 44.00 ( 12.82%)
    Ops/sec CreateTransact 10.00 ( 0.00%) 12.00 ( 20.00%)
    Ops/sec FilesDeleted 6202.00 ( 0.00%) 6202.00 ( 0.00%)
    Ops/sec DeleteTransact 11.00 ( 0.00%) 12.00 ( 9.09%)
    Ops/sec DataRead/MB 25.97 ( 0.00%) 29.10 ( 12.05%)
    Ops/sec DataWrite/MB 49.99 ( 0.00%) 56.02 ( 12.06%)

    ffsb running in a configuration that is meant to simulate a mail server showed

    3.15.0-rc5 3.15.0-rc5
    vanilla shrinker-v1r1
    Ops/sec readall 9402.63 ( 0.00%) 9567.97 ( 1.76%)
    Ops/sec create 4695.45 ( 0.00%) 4735.00 ( 0.84%)
    Ops/sec delete 173.72 ( 0.00%) 179.83 ( 3.52%)
    Ops/sec Transactions 14271.80 ( 0.00%) 14482.81 ( 1.48%)
    Ops/sec Read 37.00 ( 0.00%) 37.60 ( 1.62%)
    Ops/sec Write 18.20 ( 0.00%) 18.30 ( 0.55%)

    Signed-off-by: Tim Chen
    Signed-off-by: Mel Gorman
    Cc: Johannes Weiner
    Cc: Hugh Dickins
    Cc: Dave Chinner
    Tested-by: Yuanhan Liu
    Cc: Bob Liu
    Cc: Jan Kara
    Acked-by: Rik van Riel
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Mel Gorman
    Signed-off-by: Jiri Slaby

    Tim Chen
     
  • commit 28f2cd4f6da24a1aa06c226618ed5ad69e13df64 upstream.

    This series is aimed at regressions noticed during reclaim activity. The
    first two patches are shrinker patches that were posted ages ago but never
    merged for reasons that are unclear to me. I'm posting them again to see
    if there was a reason they were dropped or if they just got lost. Dave?
    Time? The last patch adjusts proportional reclaim. Yuanhan Liu, can you
    retest the vm scalability test cases on a larger machine? Hugh, does this
    work for you on the memcg test cases?

    Based on ext4, I get the following results but unfortunately my larger
    test machines are all unavailable so this is based on a relatively small
    machine.

    postmark
    3.15.0-rc5 3.15.0-rc5
    vanilla proportion-v1r4
    Ops/sec Transactions 21.00 ( 0.00%) 25.00 ( 19.05%)
    Ops/sec FilesCreate 39.00 ( 0.00%) 45.00 ( 15.38%)
    Ops/sec CreateTransact 10.00 ( 0.00%) 12.00 ( 20.00%)
    Ops/sec FilesDeleted 6202.00 ( 0.00%) 6202.00 ( 0.00%)
    Ops/sec DeleteTransact 11.00 ( 0.00%) 12.00 ( 9.09%)
    Ops/sec DataRead/MB 25.97 ( 0.00%) 30.02 ( 15.59%)
    Ops/sec DataWrite/MB 49.99 ( 0.00%) 57.78 ( 15.58%)

    ffsb (mail server simulator)
    3.15.0-rc5 3.15.0-rc5
    vanilla proportion-v1r4
    Ops/sec readall 9402.63 ( 0.00%) 9805.74 ( 4.29%)
    Ops/sec create 4695.45 ( 0.00%) 4781.39 ( 1.83%)
    Ops/sec delete 173.72 ( 0.00%) 177.23 ( 2.02%)
    Ops/sec Transactions 14271.80 ( 0.00%) 14764.37 ( 3.45%)
    Ops/sec Read 37.00 ( 0.00%) 38.50 ( 4.05%)
    Ops/sec Write 18.20 ( 0.00%) 18.50 ( 1.65%)

    dd of a large file
    3.15.0-rc5 3.15.0-rc5
    vanilla proportion-v1r4
    WallTime DownloadTar 75.00 ( 0.00%) 61.00 ( 18.67%)
    WallTime DD 423.00 ( 0.00%) 401.00 ( 5.20%)
    WallTime Delete 2.00 ( 0.00%) 5.00 (-150.00%)

    stutter (times mmap latency during large amounts of IO)

    3.15.0-rc5 3.15.0-rc5
    vanilla proportion-v1r4
    Unit >5ms Delays 80252.0000 ( 0.00%) 81523.0000 ( -1.58%)
    Unit Mmap min 8.2118 ( 0.00%) 8.3206 ( -1.33%)
    Unit Mmap mean 17.4614 ( 0.00%) 17.2868 ( 1.00%)
    Unit Mmap stddev 24.9059 ( 0.00%) 34.6771 (-39.23%)
    Unit Mmap max 2811.6433 ( 0.00%) 2645.1398 ( 5.92%)
    Unit Mmap 90% 20.5098 ( 0.00%) 18.3105 ( 10.72%)
    Unit Mmap 93% 22.9180 ( 0.00%) 20.1751 ( 11.97%)
    Unit Mmap 95% 25.2114 ( 0.00%) 22.4988 ( 10.76%)
    Unit Mmap 99% 46.1430 ( 0.00%) 43.5952 ( 5.52%)
    Unit Ideal Tput 85.2623 ( 0.00%) 78.8906 ( 7.47%)
    Unit Tput min 44.0666 ( 0.00%) 43.9609 ( 0.24%)
    Unit Tput mean 45.5646 ( 0.00%) 45.2009 ( 0.80%)
    Unit Tput stddev 0.9318 ( 0.00%) 1.1084 (-18.95%)
    Unit Tput max 46.7375 ( 0.00%) 46.7539 ( -0.04%)

    This patch (of 3):

    We will like to unregister the sb shrinker before ->kill_sb(). This will
    allow cached objects to be counted without call to grab_super_passive() to
    update ref count on sb. We want to avoid locking during memory
    reclamation especially when we are skipping the memory reclaim when we are
    out of cached objects.

    This is safe because grab_super_passive does a try-lock on the
    sb->s_umount now, and so if we are in the unmount process, it won't ever
    block. That means what used to be a deadlock and races we were avoiding
    by using grab_super_passive() is now:

    shrinker umount

    down_read(shrinker_rwsem)
    down_write(sb->s_umount)
    shrinker_unregister
    down_write(shrinker_rwsem)

    grab_super_passive(sb)
    down_read_trylock(sb->s_umount)


    ....

    up_read(shrinker_rwsem)


    up_write(shrinker_rwsem)
    ->kill_sb()
    ....

    So it is safe to deregister the shrinker before ->kill_sb().

    Signed-off-by: Tim Chen
    Signed-off-by: Mel Gorman
    Cc: Johannes Weiner
    Cc: Hugh Dickins
    Cc: Dave Chinner
    Tested-by: Yuanhan Liu
    Cc: Bob Liu
    Cc: Jan Kara
    Acked-by: Rik van Riel
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Mel Gorman
    Signed-off-by: Jiri Slaby

    Dave Chinner
     
  • commit 9e8c2af96e0d2d5fe298dd796fb6bc16e888a48d upstream.

    ... it does that itself (via kmap_atomic())

    Signed-off-by: Al Viro
    Signed-off-by: Mel Gorman
    Signed-off-by: Jiri Slaby

    Al Viro
     
  • commit 67f9fd91f93c582b7de2ab9325b6e179db77e4d5 upstream.

    This patch removes read_cache_page_async() which wasn't really needed
    anywhere and simplifies the code around it a bit.

    read_cache_page_async() is useful when we want to read a page into the
    cache without waiting for it to complete. This happens when the
    appropriate callback 'filler' doesn't complete its read operation and
    releases the page lock immediately, and instead queues a different
    completion routine to do that. This never actually happened anywhere in
    the code.

    read_cache_page_async() had 3 different callers:

    - read_cache_page() which is the sync version, it would just wait for
    the requested read to complete using wait_on_page_read().

    - JFFS2 would call it from jffs2_gc_fetch_page(), but the filler
    function it supplied doesn't do any async reads, and would complete
    before the filler function returns - making it actually a sync read.

    - CRAMFS would call it using the read_mapping_page_async() wrapper, with
    a similar story to JFFS2 - the filler function doesn't do anything that
    reminds async reads and would always complete before the filler function
    returns.

    To sum it up, the code in mm/filemap.c never took advantage of having
    read_cache_page_async(). While there are filler callbacks that do async
    reads (such as the block one), we always called it with the
    read_cache_page().

    This patch adds a mandatory wait for read to complete when adding a new
    page to the cache, and removes read_cache_page_async() and its wrappers.

    Signed-off-by: Sasha Levin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Mel Gorman
    Signed-off-by: Jiri Slaby

    Sasha Levin
     
  • commit 0cd6144aadd2afd19d1aca880153530c52957604 upstream.

    shmem mappings already contain exceptional entries where swap slot
    information is remembered.

    To be able to store eviction information for regular page cache, prepare
    every site dealing with the radix trees directly to handle entries other
    than pages.

    The common lookup functions will filter out non-page entries and return
    NULL for page cache holes, just as before. But provide a raw version of
    the API which returns non-page entries as well, and switch shmem over to
    use it.

    Signed-off-by: Johannes Weiner
    Reviewed-by: Rik van Riel
    Reviewed-by: Minchan Kim
    Cc: Andrea Arcangeli
    Cc: Bob Liu
    Cc: Christoph Hellwig
    Cc: Dave Chinner
    Cc: Greg Thelen
    Cc: Hugh Dickins
    Cc: Jan Kara
    Cc: KOSAKI Motohiro
    Cc: Luigi Semenzato
    Cc: Mel Gorman
    Cc: Metin Doslu
    Cc: Michel Lespinasse
    Cc: Ozgun Erdogan
    Cc: Peter Zijlstra
    Cc: Roman Gushchin
    Cc: Ryan Mallon
    Cc: Tejun Heo
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Mel Gorman
    Signed-off-by: Jiri Slaby

    Johannes Weiner
     
  • commit e7b563bb2a6f4d974208da46200784b9c5b5a47e upstream.

    The radix tree hole searching code is only used for page cache, for
    example the readahead code trying to get a a picture of the area
    surrounding a fault.

    It sufficed to rely on the radix tree definition of holes, which is
    "empty tree slot". But this is about to change, though, as shadow page
    descriptors will be stored in the page cache after the actual pages get
    evicted from memory.

    Move the functions over to mm/filemap.c and make them native page cache
    operations, where they can later be adapted to handle the new definition
    of "page cache hole".

    Signed-off-by: Johannes Weiner
    Reviewed-by: Rik van Riel
    Reviewed-by: Minchan Kim
    Acked-by: Mel Gorman
    Cc: Andrea Arcangeli
    Cc: Bob Liu
    Cc: Christoph Hellwig
    Cc: Dave Chinner
    Cc: Greg Thelen
    Cc: Hugh Dickins
    Cc: Jan Kara
    Cc: KOSAKI Motohiro
    Cc: Luigi Semenzato
    Cc: Metin Doslu
    Cc: Michel Lespinasse
    Cc: Ozgun Erdogan
    Cc: Peter Zijlstra
    Cc: Roman Gushchin
    Cc: Ryan Mallon
    Cc: Tejun Heo
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Mel Gorman
    Signed-off-by: Jiri Slaby

    Johannes Weiner
     
  • commit 615d6e8756c87149f2d4c1b93d471bca002bd849 upstream.

    This patch is a continuation of efforts trying to optimize find_vma(),
    avoiding potentially expensive rbtree walks to locate a vma upon faults.
    The original approach (https://lkml.org/lkml/2013/11/1/410), where the
    largest vma was also cached, ended up being too specific and random,
    thus further comparison with other approaches were needed. There are
    two things to consider when dealing with this, the cache hit rate and
    the latency of find_vma(). Improving the hit-rate does not necessarily
    translate in finding the vma any faster, as the overhead of any fancy
    caching schemes can be too high to consider.

    We currently cache the last used vma for the whole address space, which
    provides a nice optimization, reducing the total cycles in find_vma() by
    up to 250%, for workloads with good locality. On the other hand, this
    simple scheme is pretty much useless for workloads with poor locality.
    Analyzing ebizzy runs shows that, no matter how many threads are
    running, the mmap_cache hit rate is less than 2%, and in many situations
    below 1%.

    The proposed approach is to replace this scheme with a small per-thread
    cache, maximizing hit rates at a very low maintenance cost.
    Invalidations are performed by simply bumping up a 32-bit sequence
    number. The only expensive operation is in the rare case of a seq
    number overflow, where all caches that share the same address space are
    flushed. Upon a miss, the proposed replacement policy is based on the
    page number that contains the virtual address in question. Concretely,
    the following results are seen on an 80 core, 8 socket x86-64 box:

    1) System bootup: Most programs are single threaded, so the per-thread
    scheme does improve ~50% hit rate by just adding a few more slots to
    the cache.

    +----------------+----------+------------------+
    | caching scheme | hit-rate | cycles (billion) |
    +----------------+----------+------------------+
    | baseline | 50.61% | 19.90 |
    | patched | 73.45% | 13.58 |
    +----------------+----------+------------------+

    2) Kernel build: This one is already pretty good with the current
    approach as we're dealing with good locality.

    +----------------+----------+------------------+
    | caching scheme | hit-rate | cycles (billion) |
    +----------------+----------+------------------+
    | baseline | 75.28% | 11.03 |
    | patched | 88.09% | 9.31 |
    +----------------+----------+------------------+

    3) Oracle 11g Data Mining (4k pages): Similar to the kernel build workload.

    +----------------+----------+------------------+
    | caching scheme | hit-rate | cycles (billion) |
    +----------------+----------+------------------+
    | baseline | 70.66% | 17.14 |
    | patched | 91.15% | 12.57 |
    +----------------+----------+------------------+

    4) Ebizzy: There's a fair amount of variation from run to run, but this
    approach always shows nearly perfect hit rates, while baseline is just
    about non-existent. The amounts of cycles can fluctuate between
    anywhere from ~60 to ~116 for the baseline scheme, but this approach
    reduces it considerably. For instance, with 80 threads:

    +----------------+----------+------------------+
    | caching scheme | hit-rate | cycles (billion) |
    +----------------+----------+------------------+
    | baseline | 1.06% | 91.54 |
    | patched | 99.97% | 14.18 |
    +----------------+----------+------------------+

    [akpm@linux-foundation.org: fix nommu build, per Davidlohr]
    [akpm@linux-foundation.org: document vmacache_valid() logic]
    [akpm@linux-foundation.org: attempt to untangle header files]
    [akpm@linux-foundation.org: add vmacache_find() BUG_ON]
    [hughd@google.com: add vmacache_valid_mm() (from Oleg)]
    [akpm@linux-foundation.org: coding-style fixes]
    [akpm@linux-foundation.org: adjust and enhance comments]
    Signed-off-by: Davidlohr Bueso
    Reviewed-by: Rik van Riel
    Acked-by: Linus Torvalds
    Reviewed-by: Michel Lespinasse
    Cc: Oleg Nesterov
    Tested-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Mel Gorman
    Signed-off-by: Jiri Slaby

    Davidlohr Bueso
     
  • commit 457c1b27ed56ec472d202731b12417bff023594a upstream.

    Currently, I am seeing the following when I `mount -t hugetlbfs /none
    /dev/hugetlbfs`, and then simply do a `ls /dev/hugetlbfs`. I think it's
    related to the fact that hugetlbfs is properly not correctly setting
    itself up in this state?:

    Unable to handle kernel paging request for data at address 0x00000031
    Faulting instruction address: 0xc000000000245710
    Oops: Kernel access of bad area, sig: 11 [#1]
    SMP NR_CPUS=2048 NUMA pSeries
    ....

    In KVM guests on Power, in a guest not backed by hugepages, we see the
    following:

    AnonHugePages: 0 kB
    HugePages_Total: 0
    HugePages_Free: 0
    HugePages_Rsvd: 0
    HugePages_Surp: 0
    Hugepagesize: 64 kB

    HPAGE_SHIFT == 0 in this configuration, which indicates that hugepages
    are not supported at boot-time, but this is only checked in
    hugetlb_init(). Extract the check to a helper function, and use it in a
    few relevant places.

    This does make hugetlbfs not supported (not registered at all) in this
    environment. I believe this is fine, as there are no valid hugepages
    and that won't change at runtime.

    [akpm@linux-foundation.org: use pr_info(), per Mel]
    [akpm@linux-foundation.org: fix build when HPAGE_SHIFT is undefined]
    Signed-off-by: Nishanth Aravamudan
    Reviewed-by: Aneesh Kumar K.V
    Acked-by: Mel Gorman
    Cc: Randy Dunlap
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Mel Gorman
    Signed-off-by: Jiri Slaby

    Nishanth Aravamudan
     
  • commit 2ff396be602f10b5eab8e73b24f20348fa2de159 upstream.

    We ran into a case on ppc64 running mariadb where io_getevents would
    return zeroed out I/O events. After adding instrumentation, it became
    clear that there was some missing synchronization between reading the
    tail pointer and the events themselves. This small patch fixes the
    problem in testing.

    Thanks to Zach for helping to look into this, and suggesting the fix.

    Signed-off-by: Jeff Moyer
    Signed-off-by: Benjamin LaHaise
    Signed-off-by: Jiri Slaby

    Jeff Moyer
     

18 Sep, 2014

3 commits

  • Commit a07d322059db66b84c9eb4f98959df468e88b34b upstream.

    CIFS servers process nlink counts differently for files and directories.
    In cifs_rename() if we the request fails on the existing target, we
    try to remove it through cifs_unlink() but this is not what we want
    to do for directories. As the result the following sequence of commands

    mkdir {1,2}; mv -T 1 2; rmdir {1,2}; mkdir {1,2}; echo foo > 2/bar

    and XFS test generic/023 fail with -ENOENT error. That's why the second
    mkdir reuses the existing inode (target inode of the mv -T command) with
    S_DEAD flag.

    Fix this by checking whether the target is directory or not and
    calling cifs_rmdir() rather than cifs_unlink() for directories.

    Cc:
    Signed-off-by: Pavel Shilovsky
    Signed-off-by: Steve French
    Signed-off-by: Jiri Slaby

    Pavel Shilovsky
     
  • commit f736906a7669a77cf8cabdcbcf1dc8cb694e12ef upstream.

    The existing code calls server->ops->close() that is not
    right. This causes XFS test generic/310 to fail. Fix this
    by using server->ops->closedir() function.

    Signed-off-by: Dan Carpenter
    Signed-off-by: Pavel Shilovsky
    Signed-off-by: Steve French
    Signed-off-by: Jiri Slaby

    Pavel Shilovsky
     
  • commit 1bbe4997b13de903c421c1cc78440e544b5f9064 upstream.

    The existing code uses the old MAX_NAME constant. This causes
    XFS test generic/013 to fail. Fix it by replacing MAX_NAME with
    PATH_MAX that SMB1 uses. Also remove an unused MAX_NAME constant
    definition.

    Signed-off-by: Pavel Shilovsky
    Signed-off-by: Steve French
    Signed-off-by: Jiri Slaby

    Pavel Shilovsky
     

17 Sep, 2014

22 commits

  • commit b46799a8f28c43c5264ac8d8ffa28b311b557e03 upstream.

    When we requests rename we also need to update attributes
    of both source and target parent directories. Not doing it
    causes generic/309 xfstest to fail on SMB2 mounts. Fix this
    by marking these directories for force revalidating.

    Signed-off-by: Pavel Shilovsky
    Signed-off-by: Steve French
    Signed-off-by: Jiri Slaby

    Pavel Shilovsky
     
  • commit 18f39e7be0121317550d03e267e3ebd4dbfbb3ce upstream.

    As Raphael Geissert pointed out, tcon_error_exit can dereference tcon
    and there is one path in which tcon can be null.

    Signed-off-by: Steve French
    Reported-by: Raphael Geissert
    Signed-off-by: Jiri Slaby

    Steve French
     
  • commit 038bc961c31b070269ecd07349a7ee2e839d4fec upstream.

    If we get into read_into_pages() from cifs_readv_receive() and then
    loose a network, we issue cifs_reconnect that moves all mids to
    a private list and issue their callbacks. The callback of the async
    read request sets a mid to retry, frees it and wakes up a process
    that waits on the rdata completion.

    After the connection is established we return from read_into_pages()
    with a short read, use the mid that was freed before and try to read
    the remaining data from the a newly created socket. Both actions are
    not what we want to do. In reconnect cases (-EAGAIN) we should not
    mask off the error with a short read but should return the error
    code instead.

    Acked-by: Jeff Layton
    Signed-off-by: Pavel Shilovsky
    Signed-off-by: Steve French
    Signed-off-by: Jiri Slaby

    Pavel Shilovsky
     
  • commit 21496687a79424572f46a84c690d331055f4866f upstream.

    The existing mapping causes unlink() call to return error after delete
    operation. Changing the mapping to -EACCES makes the client process
    the call like CIFS protocol does - reset dos attributes with ATTR_READONLY
    flag masked off and retry the operation.

    Signed-off-by: Pavel Shilovsky
    Signed-off-by: Steve French
    Signed-off-by: Jiri Slaby

    Pavel Shilovsky
     
  • commit 85e584da3212140ee80fd047f9058bbee0bc00d5 upstream.

    xfs is using truncate_pagecache_range to invalidate the page cache
    during DIO reads. This is different from the other filesystems who
    only invalidate pages during DIO writes.

    truncate_pagecache_range is meant to be used when we are freeing the
    underlying data structs from disk, so it will zero any partial
    ranges in the page. This means a DIO read can zero out part of the
    page cache page, and it is possible the page will stay in cache.

    buffered reads will find an up to date page with zeros instead of
    the data actually on disk.

    This patch fixes things by using invalidate_inode_pages2_range
    instead. It preserves the page cache invalidation, but won't zero
    any pages.

    [dchinner: catch error and warn if it fails. Comment.]

    Signed-off-by: Chris Mason
    Reviewed-by: Dave Chinner
    Reviewed-by: Brian Foster
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Dave Chinner
    Signed-off-by: Jiri Slaby

    Chris Mason
     
  • commit 834ffca6f7e345a79f6f2e2d131b0dfba8a4b67a upstream.

    Similar to direct IO reads, direct IO writes are using
    truncate_pagecache_range to invalidate the page cache. This is
    incorrect due to the sub-block zeroing in the page cache that
    truncate_pagecache_range() triggers.

    This patch fixes things by using invalidate_inode_pages2_range
    instead. It preserves the page cache invalidation, but won't zero
    any pages.

    Signed-off-by: Dave Chinner
    Reviewed-by: Brian Foster
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Dave Chinner
    Signed-off-by: Jiri Slaby

    Dave Chinner
     
  • commit 22e757a49cf010703fcb9c9b4ef793248c39b0c2 upstream.

    generic/263 is failing fsx at this point with a page spanning
    EOF that cannot be invalidated. The operations are:

    1190 mapwrite 0x52c00 thru 0x5e569 (0xb96a bytes)
    1191 mapread 0x5c000 thru 0x5d636 (0x1637 bytes)
    1192 write 0x5b600 thru 0x771ff (0x1bc00 bytes)

    where 1190 extents EOF from 0x54000 to 0x5e569. When the direct IO
    write attempts to invalidate the cached page over this range, it
    fails with -EBUSY and so any attempt to do page invalidation fails.

    The real question is this: Why can't that page be invalidated after
    it has been written to disk and cleaned?

    Well, there's data on the first two buffers in the page (1k block
    size, 4k page), but the third buffer on the page (i.e. beyond EOF)
    is failing drop_buffers because it's bh->b_state == 0x3, which is
    BH_Uptodate | BH_Dirty. IOWs, there's dirty buffers beyond EOF. Say
    what?

    OK, set_buffer_dirty() is called on all buffers from
    __set_page_buffers_dirty(), regardless of whether the buffer is
    beyond EOF or not, which means that when we get to ->writepage,
    we have buffers marked dirty beyond EOF that we need to clean.
    So, we need to implement our own .set_page_dirty method that
    doesn't dirty buffers beyond EOF.

    This is messy because the buffer code is not meant to be shared
    and it has interesting locking issues on the buffer dirty bits.
    So just copy and paste it and then modify it to suit what we need.

    Note: the solutions the other filesystems and generic block code use
    of marking the buffers clean in ->writepage does not work for XFS.
    It still leaves dirty buffers beyond EOF and invalidations still
    fail. Hence rather than play whack-a-mole, this patch simply
    prevents those buffers from being dirtied in the first place.

    Signed-off-by: Dave Chinner
    Reviewed-by: Brian Foster
    Signed-off-by: Dave Chinner
    Signed-off-by: Jiri Slaby

    Dave Chinner
     
  • commit 5fd364fee81a7888af806e42ed8a91c845894f2d upstream.

    When running xfs/305, I noticed that quotacheck was flushing dquot
    buffers that did not have the xfs_dquot_buf_ops verifiers attached:

    XFS (vdb): _xfs_buf_ioapply: no ops on block 0x1dc8/0x1dc8
    ffff880052489000: 44 51 01 04 00 00 65 b8 00 00 00 00 00 00 00 00 DQ....e.........
    ffff880052489010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
    ffff880052489020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
    ffff880052489030: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
    CPU: 1 PID: 2376 Comm: mount Not tainted 3.16.0-rc2-dgc+ #306
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
    ffff88006fe38000 ffff88004a0ffae8 ffffffff81cf1cca 0000000000000001
    ffff88004a0ffb88 ffffffff814d50ca 000010004a0ffc70 0000000000000000
    ffff88006be56dc4 0000000000000021 0000000000001dc8 ffff88007c773d80
    Call Trace:
    [] dump_stack+0x45/0x56
    [] _xfs_buf_ioapply+0x3ca/0x3d0
    [] ? wake_up_state+0x20/0x20
    [] ? xfs_bdstrat_cb+0x55/0xb0
    [] xfs_buf_iorequest+0x6b/0xd0
    [] xfs_bdstrat_cb+0x55/0xb0
    [] __xfs_buf_delwri_submit+0x15b/0x220
    [] ? xfs_buf_delwri_submit+0x30/0x90
    [] xfs_buf_delwri_submit+0x30/0x90
    [] xfs_qm_quotacheck+0x17d/0x3c0
    [] xfs_qm_mount_quotas+0x151/0x1e0
    [] xfs_mountfs+0x56c/0x7d0
    [] xfs_fs_fill_super+0x2c2/0x340
    [] mount_bdev+0x194/0x1d0
    [] ? xfs_finish_flags+0x170/0x170
    [] xfs_fs_mount+0x15/0x20
    [] mount_fs+0x39/0x1b0
    [] vfs_kern_mount+0x67/0x120
    [] do_mount+0x23e/0xad0
    [] ? __get_free_pages+0xe/0x50
    [] ? copy_mount_options+0x36/0x150
    [] SyS_mount+0x83/0xc0
    [] tracesys+0xdd/0xe2

    This was caused by dquot buffer readahead not attaching a verifier
    structure to the buffer when readahead was issued, resulting in the
    followup read of the buffer finding a valid buffer and so not
    attaching new verifiers to the buffer as part of the read.

    Also, when a verifier failure occurs, we then read the buffer
    without verifiers. Attach the verifiers manually after this read so
    that if the buffer is then written it will be verified that the
    corruption has been repaired.

    Further, when flushing a dquot we don't ask for a verifier when
    reading in the dquot buffer the dquot belongs to. Most of the time
    this isn't an issue because the buffer is still cached, but when it
    is not cached it will result in writing the dquot buffer without
    having the verfier attached.

    Signed-off-by: Dave Chinner
    Reviewed-by: Brian Foster
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Dave Chinner
    Signed-off-by: Jiri Slaby

    Dave Chinner
     
  • commit 67dc288c21064b31a98a53dc64f6b9714b819fd6 upstream.

    Crash testing of CRC enabled filesystems has resulted in a number of
    reports of bad CRCs being detected after the filesystem was mounted.
    Errors such as the following were being seen:

    XFS (sdb3): Mounting V5 Filesystem
    XFS (sdb3): Starting recovery (logdev: internal)
    XFS (sdb3): Metadata CRC error detected at xfs_agf_read_verify+0x5a/0x100 [xfs], block 0x1
    XFS (sdb3): Unmount and run xfs_repair
    XFS (sdb3): First 64 bytes of corrupted metadata buffer:
    ffff880136ffd600: 58 41 47 46 00 00 00 01 00 00 00 00 00 0f aa 40 XAGF...........@
    ffff880136ffd610: 00 02 6d 53 00 02 77 f8 00 00 00 00 00 00 00 01 ..mS..w.........
    ffff880136ffd620: 00 00 00 01 00 00 00 00 00 00 00 00 00 00 00 03 ................
    ffff880136ffd630: 00 00 00 04 00 08 81 d0 00 08 81 a7 00 00 00 00 ................
    XFS (sdb3): metadata I/O error: block 0x1 ("xfs_trans_read_buf_map") error 74 numblks 1

    The errors were typically being seen in AGF, AGI and their related
    btree block buffers some time after log recovery had run. Often it
    wasn't until later subsequent mounts that the problem was
    discovered. The common symptom was a buffer with the correct
    contents, but a CRC and an LSN that matched an older version of the
    contents.

    Some debug added to _xfs_buf_ioapply() indicated that buffers were
    being written without verifiers attached to them from log recovery,
    and Jan Kara isolated the cause to log recovery readahead an dit's
    interactions with buffers that had a more recent LSN on disk than
    the transaction being recovered. In this case, the buffer did not
    get a verifier attached, and os when the second phase of log
    recovery ran and recovered EFIs and unlinked inodes, the buffers
    were modified and written without the verifier running. Hence they
    had up to date contents, but stale LSNs and CRCs.

    Fix it by attaching verifiers to buffers we skip due to future LSN
    values so they don't escape into the buffer cache without the
    correct verifier attached.

    This patch is based on analysis and a patch from Jan Kara.

    Reported-by: Jan Kara
    Reported-by: Fanael Linithien
    Reported-by: Grozdan
    Signed-off-by: Dave Chinner
    Reviewed-by: Brian Foster
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Dave Chinner
    Signed-off-by: Jiri Slaby

    Dave Chinner
     
  • commit ffbc6f0ead47fa5a1dc9642b0331cb75c20a640e upstream.

    Since March 2009 the kernel has treated the state that if no
    MS_..ATIME flags are passed then the kernel defaults to relatime.

    Defaulting to relatime instead of the existing atime state during a
    remount is silly, and causes problems in practice for people who don't
    specify any MS_...ATIME flags and to get the default filesystem atime
    setting. Those users may encounter a permission error because the
    default atime setting does not work.

    A default that does not work and causes permission problems is
    ridiculous, so preserve the existing value to have a default
    atime setting that is always guaranteed to work.

    Using the default atime setting in this way is particularly
    interesting for applications built to run in restricted userspace
    environments without /proc mounted, as the existing atime mount
    options of a filesystem can not be read from /proc/mounts.

    In practice this fixes user space that uses the default atime
    setting on remount that are broken by the permission checks
    keeping less privileged users from changing more privileged users
    atime settings.

    Acked-by: Serge E. Hallyn
    Signed-off-by: "Eric W. Biederman"
    Signed-off-by: Jiri Slaby

    Eric W. Biederman
     
  • commit 7d8b6c63751cfbbe5eef81a48c22978b3407a3ad upstream.

    This is effectively a revert of 7b9a7ec565505699f503b4fcf61500dceb36e744
    plus fixing it a different way...

    We found, when trying to run an application from an application which
    had dropped privs that the kernel does security checks on undefined
    capability bits. This was ESPECIALLY difficult to debug as those
    undefined bits are hidden from /proc/$PID/status.

    Consider a root application which drops all capabilities from ALL 4
    capability sets. We assume, since the application is going to set
    eff/perm/inh from an array that it will clear not only the defined caps
    less than CAP_LAST_CAP, but also the higher 28ish bits which are
    undefined future capabilities.

    The BSET gets cleared differently. Instead it is cleared one bit at a
    time. The problem here is that in security/commoncap.c::cap_task_prctl()
    we actually check the validity of a capability being read. So any task
    which attempts to 'read all things set in bset' followed by 'unset all
    things set in bset' will not even attempt to unset the undefined bits
    higher than CAP_LAST_CAP.

    So the 'parent' will look something like:
    CapInh: 0000000000000000
    CapPrm: 0000000000000000
    CapEff: 0000000000000000
    CapBnd: ffffffc000000000

    All of this 'should' be fine. Given that these are undefined bits that
    aren't supposed to have anything to do with permissions. But they do...

    So lets now consider a task which cleared the eff/perm/inh completely
    and cleared all of the valid caps in the bset (but not the invalid caps
    it couldn't read out of the kernel). We know that this is exactly what
    the libcap-ng library does and what the go capabilities library does.
    They both leave you in that above situation if you try to clear all of
    you capapabilities from all 4 sets. If that root task calls execve()
    the child task will pick up all caps not blocked by the bset. The bset
    however does not block bits higher than CAP_LAST_CAP. So now the child
    task has bits in eff which are not in the parent. These are
    'meaningless' undefined bits, but still bits which the parent doesn't
    have.

    The problem is now in cred_cap_issubset() (or any operation which does a
    subset test) as the child, while a subset for valid cap bits, is not a
    subset for invalid cap bits! So now we set durring commit creds that
    the child is not dumpable. Given it is 'more priv' than its parent. It
    also means the parent cannot ptrace the child and other stupidity.

    The solution here:
    1) stop hiding capability bits in status
    This makes debugging easier!

    2) stop giving any task undefined capability bits. it's simple, it you
    don't put those invalid bits in CAP_FULL_SET you won't get them in init
    and you won't get them in any other task either.
    This fixes the cap_issubset() tests and resulting fallout (which
    made the init task in a docker container untraceable among other
    things)

    3) mask out undefined bits when sys_capset() is called as it might use
    ~0, ~0 to denote 'all capabilities' for backward/forward compatibility.
    This lets 'capsh --caps="all=eip" -- -c /bin/bash' run.

    4) mask out undefined bit when we read a file capability off of disk as
    again likely all bits are set in the xattr for forward/backward
    compatibility.
    This lets 'setcap all+pe /bin/bash; /bin/bash' run

    Signed-off-by: Eric Paris
    Reviewed-by: Kees Cook
    Cc: Andrew Vagin
    Cc: Andrew G. Morgan
    Cc: Serge E. Hallyn
    Cc: Kees Cook
    Cc: Steve Grubb
    Cc: Dan Walsh
    Signed-off-by: James Morris
    Signed-off-by: Jiri Slaby

    Eric Paris
     
  • commit aee7af356e151494d5014f57b33460b162f181b5 upstream.

    In the presence of delegations, we can no longer assume that the
    state->n_rdwr, state->n_rdonly, state->n_wronly reflect the open
    stateid share mode, and so we need to calculate the initial value
    for calldata->arg.fmode using the state->flags.

    Reported-by: James Drews
    Fixes: 88069f77e1ac5 (NFSv41: Fix a potential state leakage when...)
    Signed-off-by: Trond Myklebust
    Signed-off-by: Jiri Slaby

    Trond Myklebust
     
  • commit 3c45ddf823d679a820adddd53b52c6699c9a05ac upstream.

    The current code always selects XPRT_TRANSPORT_BC_TCP for the back
    channel, even when the forward channel was not TCP (eg, RDMA). When
    a 4.1 mount is attempted with RDMA, the server panics in the TCP BC
    code when trying to send CB_NULL.

    Instead, construct the transport protocol number from the forward
    channel transport or'd with XPRT_TRANSPORT_BC. Transports that do
    not support bi-directional RPC will not have registered a "BC"
    transport, causing create_backchannel_client() to fail immediately.

    Fixes: https://bugzilla.linux-nfs.org/show_bug.cgi?id=265
    Signed-off-by: Chuck Lever
    Signed-off-by: J. Bruce Fields
    Signed-off-by: Jiri Slaby

    Chuck Lever
     
  • commit d9499a95716db0d4bc9b67e88fd162133e7d6b08 upstream.

    A memory allocation failure could cause nfsd_startup_generic to fail, in
    which case nfsd_users wouldn't be incorrectly left elevated.

    After nfsd restarts nfsd_startup_generic will then succeed without doing
    anything--the first consequence is likely nfs4_start_net finding a bad
    laundry_wq and crashing.

    Signed-off-by: Kinglong Mee
    Fixes: 4539f14981ce "nfsd: replace boolean nfsd_up flag by users counter"
    Signed-off-by: J. Bruce Fields
    Signed-off-by: Jiri Slaby

    Kinglong Mee
     
  • commit db9ee220361de03ee86388f9ea5e529eaad5323c upstream.

    It turns out that there are some serious problems with the on-disk
    format of journal checksum v2. The foremost is that the function to
    calculate descriptor tag size returns sizes that are too big. This
    causes alignment issues on some architectures and is compounded by the
    fact that some parts of jbd2 use the structure size (incorrectly) to
    determine the presence of a 64bit journal instead of checking the
    feature flags.

    Therefore, introduce journal checksum v3, which enlarges the
    descriptor block tag format to allow for full 32-bit checksums of
    journal blocks, fix the journal tag function to return the correct
    sizes, and fix the jbd2 recovery code to use feature flags to
    determine 64bitness.

    Add a few function helpers so we don't have to open-code quite so
    many pieces.

    Switching to a 16-byte block size was found to increase journal size
    overhead by a maximum of 0.1%, to convert a 32-bit journal with no
    checksumming to a 32-bit journal with checksum v3 enabled.

    Signed-off-by: Darrick J. Wong
    Reported-by: TR Reardon
    Signed-off-by: Theodore Ts'o
    Signed-off-by: Jiri Slaby

    Darrick J. Wong
     
  • commit 022eaa7517017efe4f6538750c2b59a804dc7df7 upstream.

    When recovering the journal, don't fall into an infinite loop if we
    encounter a corrupt journal block. Instead, just skip the block and
    return an error, which fails the mount and thus forces the user to run
    a full filesystem fsck.

    Signed-off-by: Darrick J. Wong
    Signed-off-by: Theodore Ts'o
    Signed-off-by: Jiri Slaby

    Darrick J. Wong
     
  • commit 6603120e96eae9a5d6228681ae55c7fdc998d1bb upstream.

    In case of delalloc block i_disksize may be less than i_size. So we
    have to update i_disksize each time we allocated and submitted some
    blocks beyond i_disksize. We weren't doing this on the error paths,
    so fix this.

    testcase: xfstest generic/019

    Signed-off-by: Dmitry Monakhov
    Signed-off-by: Theodore Ts'o
    Signed-off-by: Jiri Slaby

    Dmitry Monakhov
     
  • commit 38c1c2e44bacb37efd68b90b3f70386a8ee370ee upstream.

    The crash is

    ------------[ cut here ]------------
    kernel BUG at fs/btrfs/extent_io.c:2124!
    [...]
    Workqueue: btrfs-endio normal_work_helper [btrfs]
    RIP: 0010:[] [] end_bio_extent_readpage+0xb45/0xcd0 [btrfs]

    This is in fact a regression.

    It is because we forgot to increase @offset properly in reading corrupted block,
    so that the @offset remains, and this leads to checksum errors while reading
    left blocks queued up in the same bio, and then ends up with hiting the above
    BUG_ON.

    Reported-by: Chris Murphy
    Signed-off-by: Liu Bo
    Signed-off-by: Chris Mason
    Signed-off-by: Jiri Slaby

    Liu Bo
     
  • commit ce62003f690dff38d3164a632ec69efa15c32cbf upstream.

    When failing to allocate space for the whole compressed extent, we'll
    fallback to uncompressed IO, but we've forgotten to redirty the pages
    which belong to this compressed extent, and these 'clean' pages will
    simply skip 'submit' part and go to endio directly, at last we got data
    corruption as we write nothing.

    Signed-off-by: Liu Bo
    Tested-By: Martin Steigerwald
    Signed-off-by: Chris Mason
    Signed-off-by: Jiri Slaby

    Liu Bo
     
  • commit 6f7ff6d7832c6be13e8c95598884dbc40ad69fb7 upstream.

    Before processing the extent buffer, acquire a read lock on it, so
    that we're safe against concurrent updates on the extent buffer.

    Signed-off-by: Filipe Manana
    Signed-off-by: Chris Mason
    Signed-off-by: Jiri Slaby

    Filipe Manana
     
  • commit 27b9a8122ff71a8cadfbffb9c4f0694300464f3b upstream.

    Under rare circumstances we can end up leaving 2 versions of a checksum
    for the same file extent range.

    The reason for this is that after calling btrfs_next_leaf we process
    slot 0 of the leaf it returns, instead of processing the slot set in
    path->slots[0]. Most of the time (by far) path->slots[0] is 0, but after
    btrfs_next_leaf() releases the path and before it searches for the next
    leaf, another task might cause a split of the next leaf, which migrates
    some of its keys to the leaf we were processing before calling
    btrfs_next_leaf(). In this case btrfs_next_leaf() returns again the
    same leaf but with path->slots[0] having a slot number corresponding
    to the first new key it got, that is, a slot number that didn't exist
    before calling btrfs_next_leaf(), as the leaf now has more keys than
    it had before. So we must really process the returned leaf starting at
    path->slots[0] always, as it isn't always 0, and the key at slot 0 can
    have an offset much lower than our search offset/bytenr.

    For example, consider the following scenario, where we have:

    sums->bytenr: 40157184, sums->len: 16384, sums end: 40173568
    four 4kb file data blocks with offsets 40157184, 40161280, 40165376, 40169472

    Leaf N:

    slot = 0 slot = btrfs_header_nritems() - 1
    |-------------------------------------------------------------------|
    | [(CSUM CSUM 39239680), size 8] ... [(CSUM CSUM 40116224), size 4] |
    |-------------------------------------------------------------------|

    Leaf N + 1:

    slot = 0 slot = btrfs_header_nritems() - 1
    |--------------------------------------------------------------------|
    | [(CSUM CSUM 40161280), size 32] ... [((CSUM CSUM 40615936), size 8 |
    |--------------------------------------------------------------------|

    Because we are at the last slot of leaf N, we call btrfs_next_leaf() to
    find the next highest key, which releases the current path and then searches
    for that next key. However after releasing the path and before finding that
    next key, the item at slot 0 of leaf N + 1 gets moved to leaf N, due to a call
    to ctree.c:push_leaf_left() (via ctree.c:split_leaf()), and therefore
    btrfs_next_leaf() will returns us a path again with leaf N but with the slot
    pointing to its new last key (CSUM CSUM 40161280). This new version of leaf N
    is then:

    slot = 0 slot = btrfs_header_nritems() - 2 slot = btrfs_header_nritems() - 1
    |----------------------------------------------------------------------------------------------------|
    | [(CSUM CSUM 39239680), size 8] ... [(CSUM CSUM 40116224), size 4] [(CSUM CSUM 40161280), size 32] |
    |----------------------------------------------------------------------------------------------------|

    And incorrecly using slot 0, makes us set next_offset to 39239680 and we jump
    into the "insert:" label, which will set tmp to:

    tmp = min((sums->len - total_bytes) >> blocksize_bits,
    (next_offset - file_key.offset) >> blocksize_bits) =
    min((16384 - 0) >> 12, (39239680 - 40157184) >> 12) =
    min(4, (u64)-917504 = 18446744073708634112 >> 12) = 4

    and

    ins_size = csum_size * tmp = 4 * 4 = 16 bytes.

    In other words, we insert a new csum item in the tree with key
    (CSUM_OBJECTID CSUM_KEY 40157184 = sums->bytenr) that contains the checksums
    for all the data (4 blocks of 4096 bytes each = sums->len). Which is wrong,
    because the item with key (CSUM CSUM 40161280) (the one that was moved from
    leaf N + 1 to the end of leaf N) contains the old checksums of the last 12288
    bytes of our data and won't get those old checksums removed.

    So this leaves us 2 different checksums for 3 4kb blocks of data in the tree,
    and breaks the logical rule:

    Key_N+1.offset >= Key_N.offset + length_of_data_its_checksums_cover

    An obvious bad effect of this is that a subsequent csum tree lookup to get
    the checksum of any of the blocks with logical offset of 40161280, 40165376
    or 40169472 (the last 3 4kb blocks of file data), will get the old checksums.

    Signed-off-by: Filipe Manana
    Signed-off-by: Chris Mason
    Signed-off-by: Jiri Slaby

    Filipe Manana
     
  • commit 4eb1f66dce6c4dc28dd90a7ffbe6b2b1cb08aa4e upstream.

    We've got bug reports that btrfs crashes when quota is enabled on
    32bit kernel, typically with the Oops like below:
    BUG: unable to handle kernel NULL pointer dereference at 00000004
    IP: [] find_parent_nodes+0x360/0x1380 [btrfs]
    *pde = 00000000
    Oops: 0000 [#1] SMP
    CPU: 0 PID: 151 Comm: kworker/u8:2 Tainted: G S W 3.15.2-1.gd43d97e-default #1
    Workqueue: btrfs-qgroup-rescan normal_work_helper [btrfs]
    task: f1478130 ti: f147c000 task.ti: f147c000
    EIP: 0060:[] EFLAGS: 00010213 CPU: 0
    EIP is at find_parent_nodes+0x360/0x1380 [btrfs]
    EAX: f147dda8 EBX: f147ddb0 ECX: 00000011 EDX: 00000000
    ESI: 00000000 EDI: f147dda4 EBP: f147ddf8 ESP: f147dd38
    DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
    CR0: 8005003b CR2: 00000004 CR3: 00bf3000 CR4: 00000690
    Stack:
    00000000 00000000 f147dda4 00000050 00000001 00000000 00000001 00000050
    00000001 00000000 d3059000 00000001 00000022 000000a8 00000000 00000000
    00000000 000000a1 00000000 00000000 00000001 00000000 00000000 11800000
    Call Trace:
    [] __btrfs_find_all_roots+0x9d/0xf0 [btrfs]
    [] btrfs_qgroup_rescan_worker+0x401/0x760 [btrfs]
    [] normal_work_helper+0xc8/0x270 [btrfs]
    [] process_one_work+0x11b/0x390
    [] worker_thread+0x101/0x340
    [] kthread+0x9b/0xb0
    [] ret_from_kernel_thread+0x21/0x30
    [] kthread_create_on_node+0x110/0x110

    This indicates a NULL corruption in prefs_delayed list. The further
    investigation and bisection pointed that the call of ulist_add_merge()
    results in the corruption.

    ulist_add_merge() takes u64 as aux and writes a 64bit value into
    old_aux. The callers of this function in backref.c, however, pass a
    pointer of a pointer to old_aux. That is, the function overwrites
    64bit value on 32bit pointer. This caused a NULL in the adjacent
    variable, in this case, prefs_delayed.

    Here is a quick attempt to band-aid over this: a new function,
    ulist_add_merge_ptr() is introduced to pass/store properly a pointer
    value instead of u64. There are still ugly void ** cast remaining
    in the callers because void ** cannot be taken implicitly. But, it's
    safer than explicit cast to u64, anyway.

    Bugzilla: https://bugzilla.novell.com/show_bug.cgi?id=887046
    Signed-off-by: Takashi Iwai
    Signed-off-by: Chris Mason
    Signed-off-by: Jiri Slaby

    Takashi Iwai
     

16 Sep, 2014

2 commits

  • commit 99d263d4c5b2f541dfacb5391e22e8c91ea982a6 upstream.

    Josef Bacik found a performance regression between 3.2 and 3.10 and
    narrowed it down to commit bfcfaa77bdf0 ("vfs: use 'unsigned long'
    accesses for dcache name comparison and hashing"). He reports:

    "The test case is essentially

    for (i = 0; i < 1000000; i++)
    mkdir("a$i");

    On xfs on a fio card this goes at about 20k dir/sec with 3.2, and 12k
    dir/sec with 3.10. This is because we spend waaaaay more time in
    __d_lookup on 3.10 than in 3.2.

    The new hashing function for strings is suboptimal for <
    sizeof(unsigned long) string names (and hell even > sizeof(unsigned
    long) string names that I've tested). I broke out the old hashing
    function and the new one into a userspace helper to get real numbers
    and this is what I'm getting:

    Old hash table had 1000000 entries, 0 dupes, 0 max dupes
    New hash table had 12628 entries, 987372 dupes, 900 max dupes
    We had 11400 buckets with a p50 of 30 dupes, p90 of 240 dupes, p99 of 567 dupes for the new hash

    My test does the hash, and then does the d_hash into a integer pointer
    array the same size as the dentry hash table on my system, and then
    just increments the value at the address we got to see how many
    entries we overlap with.

    As you can see the old hash function ended up with all 1 million
    entries in their own bucket, whereas the new one they are only
    distributed among ~12.5k buckets, which is why we're using so much
    more CPU in __d_lookup".

    The reason for this hash regression is two-fold:

    - On 64-bit architectures the down-mixing of the original 64-bit
    word-at-a-time hash into the final 32-bit hash value is very
    simplistic and suboptimal, and just adds the two 32-bit parts
    together.

    In particular, because there is no bit shuffling and the mixing
    boundary is also a byte boundary, similar character patterns in the
    low and high word easily end up just canceling each other out.

    - the old byte-at-a-time hash mixed each byte into the final hash as it
    hashed the path component name, resulting in the low bits of the hash
    generally being a good source of hash data. That is not true for the
    word-at-a-time case, and the hash data is distributed among all the
    bits.

    The fix is the same in both cases: do a better job of mixing the bits up
    and using as much of the hash data as possible. We already have the
    "hash_32|64()" functions to do that.

    Reported-by: Josef Bacik
    Cc: Al Viro
    Cc: Christoph Hellwig
    Cc: Chris Mason
    Cc: linux-fsdevel@vger.kernel.org
    Signed-off-by: Linus Torvalds
    Signed-off-by: Jiri Slaby

    Linus Torvalds
     
  • commit 482db9066199813d6b999b65a3171afdbec040b6 upstream.

    D_HASH{MASK,BITS} are used once each, both in the same function (d_hash()).
    At this point they are actively misguiding - they imply that values are
    compiler constants, which is no longer true.

    Signed-off-by: Al Viro
    Signed-off-by: Jiri Slaby

    Al Viro
     

15 Sep, 2014

1 commit

  • commit c99d1e6e83b06744c75d9f5e491ed495a7086b7b upstream.

    If we suffer a block allocation failure (for example due to a memory
    allocation failure), it's possible that we will call
    ext4_discard_allocated_blocks() before we've actually allocated any
    blocks. In that case, fe_len and fe_start in ac->ac_f_ex will still
    be zero, and this will result in mb_free_blocks(inode, e4b, 0, 0)
    triggering the BUG_ON on mb_free_blocks():

    BUG_ON(last >= (sb->s_blocksize << 3));

    Fix this by bailing out of ext4_discard_allocated_blocks() if fs_len
    is zero.

    Also fix a missing ext4_mb_unload_buddy() call in
    ext4_discard_allocated_blocks().

    Google-Bug-Id: 16844242

    Fixes: 86f0afd463215fc3e58020493482faa4ac3a4d69
    Signed-off-by: Theodore Ts'o
    Cc: stable@vger.kernel.org
    Signed-off-by: Jiri Slaby

    Theodore Ts'o