01 Sep, 2017

1 commit

  • We are doing a last second memory allocation attempt before calling
    out_of_memory(). But since slab shrinker functions might indirectly
    wait for other thread's __GFP_DIRECT_RECLAIM && !__GFP_NORETRY memory
    allocations via sleeping locks, calling slab shrinker functions from
    node_reclaim() from get_page_from_freelist() with oom_lock held has
    possibility of deadlock. Therefore, make sure that last second memory
    allocation attempt does not call slab shrinker functions.

    Link: http://lkml.kernel.org/r/1503577106-9196-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp
    Signed-off-by: Tetsuo Handa
    Acked-by: Michal Hocko
    Cc: Mel Gorman
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tetsuo Handa
     

31 Aug, 2017

5 commits

  • Pull libnvdimm fix from Dan Williams:
    "A single patch removing some structure definitions from a uapi header
    file. These payloads are never processed directly by the kernel they
    are simply passed through an ioctl as opaque blobs to the ACPI _DSM
    (Device Specific Method) interface.

    Userspace should not be depending on the kernel to define these
    payloads. We will instead provide these definitions via the existing
    libndctl (https://github.com/pmem/ndctl) project that has NVDIMM
    command helpers and other definitions"

    * 'libnvdimm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
    libnvdimm: clean up command definitions

    Linus Torvalds
     
  • Pull drm fixes from Dave Airlie:
    "Two fixes (a vmwgfx and core drm fix) in the queue for 4.13 final,
    hopefully that is it"

    * tag 'drm-fixes-for-v4.13-rc8' of git://people.freedesktop.org/~airlied/linux:
    drm/vmwgfx: Fix F26 Wayland screen update issue
    drm/bridge/sii8620: Fix memory corruption

    Linus Torvalds
     
  • Pull SCSI fixes from James Bottomley:
    "Three minor fixes: a NULL deref in qedf, an off by one in sg and a fix
    to IPR to prevent an error on initialisation"

    * tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
    scsi: qedf: Fix a potential NULL pointer dereference
    scsi: sg: off by one in sg_ioctl()
    scsi: ipr: Set no_report_opcodes for RAID arrays

    Linus Torvalds
     
  • Pull UML fix from Richard Weinberger:
    "This contains a single fix for a regression which was introduced while
    the merge window"

    * 'for-linus-4.13-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/uml:
    um: Fix check for _xstate for older hosts

    Linus Torvalds
     
  • Pull alpha update from Matt Turner:
    "A few fixes and wires up some additional syscalls."

    [ Some of this is technically not really rc7 material, but it's alpha,
    and it all looks safe anyway. Matt explains: "My alpha has been
    offline, hence the very late-in-cycle pull request" and hasn't caused
    problems before, so he gets to slide. - Linus ]

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mattst88/alpha:
    alpha: uapi: Add support for __SANE_USERSPACE_TYPES__
    alpha: Define ioremap_wc
    alpha: Fix section mismatches
    alpha: support R_ALPHA_REFLONG relocations for module loading
    alpha: Fix typo in ev6-copy_user.S
    alpha: Package string routines together
    alpha: Update for new syscalls
    alpha: Fix build error without CONFIG_VGA_HOSE.

    Linus Torvalds
     

30 Aug, 2017

14 commits

  • Single vmwgfx fix.

    * 'drm-vmwgfx-fixes' of git://people.freedesktop.org/~syeh/repos_linux:
    drm/vmwgfx: Fix F26 Wayland screen update issue

    Dave Airlie
     
  • vmwgfx currently cannot support non-blocking commit because when
    vmw_*_crtc_page_flip is called, drm_atomic_nonblocking_commit()
    schedules the update on a thread. This means vmw_*_crtc_page_flip
    cannot rely on the new surface being bound before the subsequent
    dirty and flush operations happen.

    Cc: # 4.12.x

    Signed-off-by: Sinclair Yeh
    Reviewed-by: Thomas Hellstrom
    Reviewed-by: Charmaine Lee

    Sinclair Yeh
     
  • Driver Changes:
    - bridge/sii8620: Fix out-of-bounds write to incorrect register

    Cc: Maciej Purski
    Cc: Andrzej Hajda

    * tag 'drm-misc-fixes-2017-08-28' of git://anongit.freedesktop.org/git/drm-misc:
    drm/bridge/sii8620: Fix memory corruption

    Dave Airlie
     
  • This fixes compiler errors in perf such as:

    tests/attr.c: In function 'store_event':
    tests/attr.c:66:27: error: format '%llu' expects argument of type 'long long unsigned int', but argument 6 has type '__u64 {aka long unsigned int}' [-Werror=format=]
    snprintf(path, PATH_MAX, "%s/event-%d-%llu-%d", dir,
    ^

    Signed-off-by: Ben Hutchings
    Tested-by: Michael Cree
    Cc: stable@vger.kernel.org
    Signed-off-by: Matt Turner

    Ben Hutchings
     
  • Commit 3cc2dac5be3f ("drivers/video/fbdev/atyfb: Replace MTRR UC hole
    with strong UC") introduces calls to ioremap_wc and ioremap_uc. This
    causes build failures with alpha:allmodconfig. Map the missing functions
    to ioremap_nocache.

    Fixes: 3cc2dac5be3f ("drivers/video/fbdev/atyfb:
    Replace MTRR UC hole with strong UC")
    Cc: Paul Gortmaker
    Cc: Luis R. Rodriguez
    Signed-off-by: Guenter Roeck
    Signed-off-by: Matt Turner

    Guenter Roeck
     
  • Signed-off-by: Matt Turner

    Matt Turner
     
  • Since commit 71810db27c1c853b33 (modversions: treat symbol CRCs
    as 32 bit quantities) R_ALPHA_REFLONG relocations can be required
    to load modules. This implements it.

    Tested-by: Bob Tracy
    Reviewed-by: Richard Henderson
    Signed-off-by: Michael Cree
    Signed-off-by: Matt Turner

    Michael Cree
     
  • Patch 8525023121de4848b5f0a7d867ffeadbc477774d introduced a typo.

    That said, the identity AND insns added by that patch are more
    clearly written as MOV. At the same time, re-schedule the ev6
    version so that the first dispatch can execute in parallel.

    Signed-off-by: Richard Henderson
    Signed-off-by: Matt Turner

    Richard Henderson
     
  • There are direct branches between {str*cpy,str*cat} and stx*cpy.
    Ensure the branches are within range by merging these objects.

    Signed-off-by: Richard Henderson
    Signed-off-by: Matt Turner

    Richard Henderson
     
  • Signed-off-by: Richard Henderson
    Signed-off-by: Matt Turner

    Richard Henderson
     
  • pci_vga_hose is #defined to 0 in include/asm/vga.h if CONFIG_VGA_HOSE is
    not set.

    Signed-off-by: Matt Turner

    Matt Turner
     
  • Pull cgroup fix from Tejun Heo:
    "A late but obvious fix for cgroup.

    I broke the 'cpuset.memory_pressure' file a long time ago (v4.4) by
    accidentally deleting its file index, which made it a duplicate of the
    'cpuset.memory_migrate' file. Spotted and fixed by Waiman"

    * 'for-4.13-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
    cpuset: Fix incorrect memory_pressure control file mapping

    Linus Torvalds
     
  • Pull libata fixes from Tejun Heo:
    "Late fixes for libata. There's a minor platform driver fix but the
    important one is READ LOG PAGE.

    This is a new ATA command which is used to test some optional features
    but it broke probing of some devices - they locked up instead of
    failing the unknown command.

    Christoph tried blacklisting, but, after finding out there are
    multiple devices which fail this way, backed off to testing feature
    bit in IDENTIFY data first, which is a bit lossy (we can miss features
    on some devices) but should be a lot safer"

    * 'for-4.13-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/libata:
    Revert "libata: quirk read log on no-name M.2 SSD"
    libata: check for trusted computing in IDENTIFY DEVICE data
    libata: quirk read log on no-name M.2 SSD
    sata: ahci-da850: Fix some error handling paths in 'ahci_da850_probe()'

    Linus Torvalds
     
  • This reverts commit aac2fea94f7a3df8ad1eeb477eb2643f81fd5393.

    It turns out that that patch was complete and utter garbage, and broke
    KVM, resulting in odd oopses.

    Quoting Andrea Arcangeli:
    "The aforementioned commit has 3 bugs.

    1) mmu_notifier_invalidate_range cannot be used in replacement of
    mmu_notifier_invalidate_range_start/end.

    For KVM mmu_notifier_invalidate_range is a noop and rightfully so.

    A MMU notifier implementation has to implement either
    ->invalidate_range method or the invalidate_range_start/end
    methods, not both. And if you implement invalidate_range_start/end
    like KVM is forced to do, calling mmu_notifier_invalidate_range in
    common code is a noop for KVM.

    For those MMU notifiers that can get away only implementing
    ->invalidate_range, the ->invalidate_range is implicitly called by
    mmu_notifier_invalidate_range_end(). And only those secondary MMUs
    that share the same pagetable with the primary MMU (like AMD
    iommuv2) can get away only implementing ->invalidate_range.

    So all cases (THP on/off) are broken right now.

    To fix this is enough to replace mmu_notifier_invalidate_range with
    mmu_notifier_invalidate_range_start;mmu_notifier_invalidate_range_end.
    Either that or call multiple mmu_notifier_invalidate_page like
    before.

    2) address + (1UL << compound_order(page) is buggy, it should be
    PAGE_SIZE << compound_order(page), it's bytes not pages, 2M not
    512.

    3) The whole invalidate_range thing was an attempt to call a single
    invalidate while walking multiple 4k ptes that maps the same THP
    (after a pmd virtual split without physical compound page THP
    split).

    It's unclear if the rmap_walk will always provide an address that
    is 2M aligned as parameter to try_to_unmap_one, in presence of THP.
    I think it needs also an address &= (PAGE_SIZE <<
    compound_order(page)) - 1 to be safe"

    In general, we should stop making excuses for horrible MMU notifier
    users. It's much more important that the core VM is sane and safe, than
    letting MMU notifiers sleep.

    So if some MMU notifier is sleeping under a spinlock, we need to fix the
    notifier, not try to make excuses for that garbage in the core VM.

    Reported-and-tested-by: Bernhard Held
    Reported-and-tested-by: Adam Borowski
    Cc: Andrea Arcangeli
    Cc: Radim Krčmář
    Cc: Wanpeng Li
    Cc: Paolo Bonzini
    Cc: Takashi Iwai
    Cc: Nadav Amit
    Cc: Mike Galbraith
    Cc: Kirill A. Shutemov
    Cc: Jérôme Glisse
    Cc: axie
    Cc: Andrew Morton
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

29 Aug, 2017

8 commits

  • This reverts commit 35f0b6a779b8b7a98faefd7c1c660b4dac9a5c26.

    We now conditionalize issuing of READ LOG PAGE on the TRUSTED
    COMPUTING SUPPORTED bit in the identity data and this shouldn't be
    necessary.

    Signed-off-by: Tejun Heo

    Tejun Heo
     
  • ATA-8 and later mirrors the TRUSTED COMPUTING SUPPORTED bit in word 48 of
    the IDENTIFY DEVICE data. Check this before issuing a READ LOG PAGE
    command to avoid issues with buggy devices. The only downside is that
    we can't support Security Send / Receive for a device with an older
    revision due to the conflicting use of this field in earlier
    specifications.

    tj: The reason we need this is because some devices which don't
    support READ LOG PAGE lock up after getting issued that command.

    Signed-off-by: Christoph Hellwig
    Tested-by: David Ahern
    Signed-off-by: Tejun Heo

    Christoph Hellwig
     
  • Commit 3510ca20ece0 ("Minor page waitqueue cleanups") made the page
    queue code always add new waiters to the back of the queue, which helps
    upcoming patches to batch the wakeups for some horrid loads where the
    wait queues grow to thousands of entries.

    However, I forgot about the nasrt add_page_wait_queue() special case
    code that is only used by the cachefiles code. That one still continued
    to add the new wait queue entries at the beginning of the list.

    Fix it, because any sane batched wakeup will require that we don't
    suddenly start getting new entries at the beginning of the list that we
    already handled in a previous batch.

    [ The current code always does the whole list while holding the lock, so
    wait queue ordering doesn't matter for correctness, but even then it's
    better to add later entries at the end from a fairness standpoint ]

    Signed-off-by: Linus Torvalds

    Linus Torvalds
     
  • When !NUMA, cpumask_of_node(@node) equals cpu_online_mask regardless of
    @node. The assumption seems that if !NUMA, there shouldn't be more than
    one node and thus reporting cpu_online_mask regardless of @node is
    correct. However, that assumption was broken years ago to support
    DISCONTIGMEM and whether a system has multiple nodes or not is
    separately controlled by NEED_MULTIPLE_NODES.

    This means that, on a system with !NUMA && NEED_MULTIPLE_NODES,
    cpumask_of_node() will report cpu_online_mask for all possible nodes,
    indicating that the CPUs are associated with multiple nodes which is an
    impossible configuration.

    This bug has been around forever but doesn't look like it has caused any
    noticeable symptoms. However, it triggers a WARN recently added to
    workqueue to verify NUMA affinity configuration.

    Fix it by reporting empty cpumask on non-zero nodes if !NUMA.

    Signed-off-by: Tejun Heo
    Reported-and-tested-by: Geert Uytterhoeven
    Cc: stable@vger.kernel.org
    Signed-off-by: Linus Torvalds

    Tejun Heo
     
  • Recent commit a8ec3ee861b6 "arc: Mask individual IRQ lines during core
    INTC init" breaks interrupt handling on ARCv2 SMP systems.

    That commit masked all interrupts at onset, as some controllers on some
    boards (customer as well as internal), would assert interrutps early
    before any handlers were installed. For SMP systems, the masking was
    done at each cpu's core-intc. Later, when the IRQ was actually
    requested, it was unmasked, but only on the requesting cpu.

    For "common" interrupts, which were wired up from the 2nd level IDU
    intc, this was as issue as they needed to be enabled on ALL the cpus
    (given that IDU IRQs are by default served Round Robin across cpus)

    So fix that by NOT masking "common" interrupts at core-intc, but instead
    at the 2nd level IDU intc (latter already being done in idu_of_init())

    Fixes: a8ec3ee861b6 ("arc: Mask individual IRQ lines during core INTC init")
    Signed-off-by: Alexey Brodkin
    [vgupta: reworked changelog, removed the extraneous idu_irq_mask_raw()]
    Signed-off-by: Vineet Gupta
    Signed-off-by: Linus Torvalds

    Alexey Brodkin
     
  • Commit 464d62421cb8 ("select: switch compat_{get,put}_fd_set() to
    compat_{get,put}_bitmap()") changed the calculation on how many bytes
    need to be zeroed when userspace handed over a NULL pointer for a fdset
    array in the select syscall.

    The calculation was changed in compat_get_fd_set() wrongly from
    memset(fdset, 0, ((nr + 1) & ~1)*sizeof(compat_ulong_t));
    to
    memset(fdset, 0, ALIGN(nr, BITS_PER_LONG));

    The ALIGN(nr, BITS_PER_LONG) calculates the number of _bits_ which need
    to be zeroed in the target fdset array (rounded up to the next full bits
    for an unsigned long).

    But the memset() call expects the number of _bytes_ to be zeroed.

    This leads to clearing more memory than wanted (on the stack area or
    even at kmalloc()ed memory areas) and to random kernel crashes as we
    have seen them on the parisc platform.

    The correct change should have been

    memset(fdset, 0, (ALIGN(nr, BITS_PER_LONG) / BITS_PER_LONG) * BYTES_PER_LONG);

    which is the same as can be archieved with a call to

    zero_fd_set(nr, fdset).

    Fixes: 464d62421cb8 ("select: switch compat_{get,put}_fd_set() to compat_{get,put}_bitmap()"
    Acked-by:: Al Viro
    Signed-off-by: Helge Deller
    Signed-off-by: Linus Torvalds

    Helge Deller
     
  • Pull c6x tweaks from Mark Salter.

    * tag 'for-linus' of git://linux-c6x.org/git/projects/linux-c6x-upstreaming:
    c6x: Convert to using %pOF instead of full_name
    c6x: defconfig: Cleanup from old Kconfig options

    Linus Torvalds
     
  • Ido reported that reading the log page on his systems fails,
    so quirk it as it won't support ZBC or security protocols.

    Signed-off-by: Christoph Hellwig
    Reported-by: Ido Schimmel
    Tested-by: Ido Schimmel
    Signed-off-by: Tejun Heo

    Christoph Hellwig
     

28 Aug, 2017

9 commits

  • Remove the command payloads that do not have an associated libnvdimm
    ioctl. I.e. remove the payloads that would only ever be carried in the
    ND_CMD_CALL envelope. This prevents userspace from growing unnecessary
    dependencies on this kernel header when userspace already has everything
    it needs to craft and send these commands.

    Cc: Jerry Hoemann
    Reported-by: Yasunori Goto
    Signed-off-by: Dan Williams

    Dan Williams
     
  • Linus Torvalds
     
  • Pull IOMMU fix from Joerg Roedel:
    "Another fix, this time in common IOMMU sysfs code.

    In the conversion from the old iommu sysfs-code to the
    iommu_device_register interface, I missed to update the release path
    for the struct device associated with an IOMMU. It freed the 'struct
    device', which was a pointer before, but is now embedded in another
    struct.

    Freeing from the middle of allocated memory had all kinds of nasty
    side effects when an IOMMU was unplugged. Unfortunatly nobody
    unplugged and IOMMU until now, so this was not discovered earlier. The
    fix is to make the 'struct device' a pointer again"

    * tag 'iommu-fixes-v4.13-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu:
    iommu: Fix wrong freeing of iommu_device->dev

    Linus Torvalds
     
  • Pull char/misc fix from Greg KH:
    "Here is a single misc driver fix for 4.13-rc7. It resolves a reported
    problem in the Android binder driver due to previous patches in
    4.13-rc.

    It's been in linux-next with no reported issues"

    * tag 'char-misc-4.13-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc:
    ANDROID: binder: fix proc->tsk check.

    Linus Torvalds
     
  • Pull staging/iio fixes from Greg KH:
    "Here are few small staging driver fixes, and some more IIO driver
    fixes for 4.13-rc7. Nothing major, just resolutions for some reported
    problems.

    All of these have been in linux-next with no reported problems"

    * tag 'staging-4.13-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging:
    iio: magnetometer: st_magn: remove ihl property for LSM303AGR
    iio: magnetometer: st_magn: fix status register address for LSM303AGR
    iio: hid-sensor-trigger: Fix the race with user space powering up sensors
    iio: trigger: stm32-timer: fix get trigger mode
    iio: imu: adis16480: Fix acceleration scale factor for adis16480
    PATCH] iio: Fix some documentation warnings
    staging: rtl8188eu: add RNX-N150NUB support
    Revert "staging: fsl-mc: be consistent when checking strcmp() return"
    iio: adc: stm32: fix common clock rate
    iio: adc: ina219: Avoid underflow for sleeping time
    iio: trigger: stm32-timer: add enable attribute
    iio: trigger: stm32-timer: fix get/set down count direction
    iio: trigger: stm32-timer: fix write_raw return value
    iio: trigger: stm32-timer: fix quadrature mode get routine
    iio: bmp280: properly initialize device for humidity reading

    Linus Torvalds
     
  • Pull NTB fixes from Jon Mason:
    "NTB bug fixes to address an incorrect ntb_mw_count reference in the
    NTB transport, improperly bringing down the link if SPADs are
    corrupted, and an out-of-order issue regarding link negotiation and
    data passing"

    * tag 'ntb-4.13-bugfixes' of git://github.com/jonmason/ntb:
    ntb: ntb_test: ensure the link is up before trying to configure the mws
    ntb: transport shouldn't disable link due to bogus values in SPADs
    ntb: use correct mw_count function in ntb_tool and ntb_transport

    Linus Torvalds
     
  • The "lock_page_killable()" function waits for exclusive access to the
    page lock bit using the WQ_FLAG_EXCLUSIVE bit in the waitqueue entry
    set.

    That means that if it gets woken up, other waiters may have been
    skipped.

    That, in turn, means that if it sees the page being unlocked, it *must*
    take that lock and return success, even if a lethal signal is also
    pending.

    So instead of checking for lethal signals first, we need to check for
    them after we've checked the actual bit that we were waiting for. Even
    if that might then delay the killing of the process.

    This matches the order of the old "wait_on_bit_lock()" infrastructure
    that the page locking used to use (and is still used in a few other
    areas).

    Note that if we still return an error after having unsuccessfully tried
    to acquire the page lock, that is ok: that means that some other thread
    was able to get ahead of us and lock the page, and when that other
    thread then unlocks the page, the wakeup event will be repeated. So any
    other pending waiters will now get properly woken up.

    Fixes: 62906027091f ("mm: add PageWaiters indicating tasks are waiting for a page bit")
    Cc: Nick Piggin
    Cc: Peter Zijlstra
    Cc: Mel Gorman
    Cc: Jan Kara
    Cc: Davidlohr Bueso
    Cc: Andi Kleen
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     
  • Tim Chen and Kan Liang have been battling a customer load that shows
    extremely long page wakeup lists. The cause seems to be constant NUMA
    migration of a hot page that is shared across a lot of threads, but the
    actual root cause for the exact behavior has not been found.

    Tim has a patch that batches the wait list traversal at wakeup time, so
    that we at least don't get long uninterruptible cases where we traverse
    and wake up thousands of processes and get nasty latency spikes. That
    is likely 4.14 material, but we're still discussing the page waitqueue
    specific parts of it.

    In the meantime, I've tried to look at making the page wait queues less
    expensive, and failing miserably. If you have thousands of threads
    waiting for the same page, it will be painful. We'll need to try to
    figure out the NUMA balancing issue some day, in addition to avoiding
    the excessive spinlock hold times.

    That said, having tried to rewrite the page wait queues, I can at least
    fix up some of the braindamage in the current situation. In particular:

    (a) we don't want to continue walking the page wait list if the bit
    we're waiting for already got set again (which seems to be one of
    the patterns of the bad load). That makes no progress and just
    causes pointless cache pollution chasing the pointers.

    (b) we don't want to put the non-locking waiters always on the front of
    the queue, and the locking waiters always on the back. Not only is
    that unfair, it means that we wake up thousands of reading threads
    that will just end up being blocked by the writer later anyway.

    Also add a comment about the layout of 'struct wait_page_key' - there is
    an external user of it in the cachefiles code that means that it has to
    match the layout of 'struct wait_bit_key' in the two first members. It
    so happens to match, because 'struct page *' and 'unsigned long *' end
    up having the same values simply because the page flags are the first
    member in struct page.

    Cc: Tim Chen
    Cc: Kan Liang
    Cc: Mel Gorman
    Cc: Christopher Lameter
    Cc: Andi Kleen
    Cc: Davidlohr Bueso
    Cc: Peter Zijlstra
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     
  • We have a MAX_LFS_FILESIZE macro that is meant to be filled in by
    filesystems (and other IO targets) that know they are 64-bit clean and
    don't have any 32-bit limits in their IO path.

    It turns out that our 32-bit value for that limit was bogus. On 32-bit,
    the VM layer is limited by the page cache to only 32-bit index values,
    but our logic for that was confusing and actually wrong. We used to
    define that value to

    (((loff_t)PAGE_SIZE << (BITS_PER_LONG-1))-1)

    which is actually odd in several ways: it limits the index to 31 bits,
    and then it limits files so that they can't have data in that last byte
    of a page that has the highest 31-bit index (ie page index 0x7fffffff).

    Neither of those limitations make sense. The index is actually the full
    32 bit unsigned value, and we can use that whole full page. So the
    maximum size of the file would logically be "PAGE_SIZE << BITS_PER_LONG".

    However, we do wan tto avoid the maximum index, because we have code
    that iterates over the page indexes, and we don't want that code to
    overflow. So the maximum size of a file on a 32-bit host should
    actually be one page less than the full 32-bit index.

    So the actual limit is ULONG_MAX << PAGE_SHIFT. That means that we will
    not actually be using the page of that last index (ULONG_MAX), but we
    can grow a file up to that limit.

    The wrong value of MAX_LFS_FILESIZE actually caused problems for Doug
    Nazar, who was still using a 32-bit host, but with a 9.7TB 2 x RAID5
    volume. It turns out that our old MAX_LFS_FILESIZE was 8TiB (well, one
    byte less), but the actual true VM limit is one page less than 16TiB.

    This was invisible until commit c2a9737f45e2 ("vfs,mm: fix a dead loop
    in truncate_inode_pages_range()"), which started applying that
    MAX_LFS_FILESIZE limit to block devices too.

    NOTE! On 64-bit, the page index isn't a limiter at all, and the limit is
    actually just the offset type itself (loff_t), which is signed. But for
    clarity, on 64-bit, just use the maximum signed value, and don't make
    people have to count the number of 'f' characters in the hex constant.

    So just use LLONG_MAX for the 64-bit case. That was what the value had
    been before too, just written out as a hex constant.

    Fixes: c2a9737f45e2 ("vfs,mm: fix a dead loop in truncate_inode_pages_range()")
    Reported-and-tested-by: Doug Nazar
    Cc: Andreas Dilger
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Dave Kleikamp
    Cc: stable@kernel.org
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

27 Aug, 2017

3 commits