04 Feb, 2016

5 commits

  • Merge fixes from Andrew Morton:
    "18 fixes"

    [ The 18 fixes turned into 17 commits, because one of the fixes was a
    fix for another patch in the series that I just folded in by editing
    the patch manually - hopefully correctly - Linus ]

    * emailed patches from Andrew Morton :
    mm: fix memory leak in copy_huge_pmd()
    drivers/hwspinlock: fix race between radix tree insertion and lookup
    radix-tree: fix race in gang lookup
    mm/vmpressure.c: fix subtree pressure detection
    mm: polish virtual memory accounting
    mm: warn about VmData over RLIMIT_DATA
    Documentation: cgroup-v2: add memory.stat::sock description
    mm: memcontrol: drop superfluous entry in the per-memcg stats array
    drivers/scsi/sg.c: mark VMA as VM_IO to prevent migration
    proc: revert /proc//maps [stack:TID] annotation
    numa: fix /proc//numa_maps for hugetlbfs on s390
    MAINTAINERS: update Seth email
    ocfs2/cluster: fix memory leak in o2hb_region_release
    lib/test-string_helpers.c: fix and improve string_get_size() tests
    thp: limit number of object to scan on deferred_split_scan()
    thp: change deferred_split_count() to return number of THP in queue
    thp: make split_queue per-node

    Linus Torvalds
     
  • Pull NFS client bugfix and cleanup from Trond Myklebust:
    "Bugfix:
    - pNFS: Fix for missing layoutreturn calls

    Cleanup:
    - pNFS: rename NFS_LAYOUT_RETURN_BEFORE_CLOSE for code clarity"

    * tag 'nfs-for-4.5-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
    NFS: Cleanup - rename NFS_LAYOUT_RETURN_BEFORE_CLOSE
    pNFS: Fix missing layoutreturn calls

    Linus Torvalds
     
  • Commit b76437579d13 ("procfs: mark thread stack correctly in
    proc//maps") added [stack:TID] annotation to /proc//maps.

    Finding the task of a stack VMA requires walking the entire thread list,
    turning this into quadratic behavior: a thousand threads means a
    thousand stacks, so the rendering of /proc//maps needs to look at a
    million combinations.

    The cost is not in proportion to the usefulness as described in the
    patch.

    Drop the [stack:TID] annotation to make /proc//maps (and
    /proc//numa_maps) usable again for higher thread counts.

    The [stack] annotation inside /proc//task//maps is retained, as
    identifying the stack VMA there is an O(1) operation.

    Siddesh said:
    "The end users needed a way to identify thread stacks programmatically and
    there wasn't a way to do that. I'm afraid I no longer remember (or have
    access to the resources that would aid my memory since I changed
    employers) the details of their requirement. However, I did do this on my
    own time because I thought it was an interesting project for me and nobody
    really gave any feedback then as to its utility, so as far as I am
    concerned you could roll back the main thread maps information since the
    information is available in the thread-specific files"

    Signed-off-by: Johannes Weiner
    Cc: "Kirill A. Shutemov"
    Cc: Siddhesh Poyarekar
    Cc: Shaohua Li
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • When working with hugetlbfs ptes (which are actually pmds) is not valid to
    directly use pte functions like pte_present() because the hardware bit
    layout of pmds and ptes can be different. This is the case on s390.
    Therefore we have to convert the hugetlbfs ptes first into a valid pte
    encoding with huge_ptep_get().

    Currently the /proc//numa_maps code uses hugetlbfs ptes without
    huge_ptep_get(). On s390 this leads to the following two problems:

    1) The pte_present() function returns false (instead of true) for
    PROT_NONE hugetlb ptes. Therefore PROT_NONE vmas are missing
    completely in the "numa_maps" output.

    2) The pte_dirty() function always returns false for all hugetlb ptes.
    Therefore these pages are reported as "mapped=xxx" instead of
    "dirty=xxx".

    Therefore use huge_ptep_get() to correctly convert the hugetlb ptes.

    Signed-off-by: Michael Holzheu
    Reviewed-by: Gerald Schaefer
    Cc: [4.3+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michael Holzheu
     
  • o2hb_region_release currently doesn't free o2hb_debug_buf
    hr_db_elapsed_time and hr_db_pinned malloced in o2hb_debug_create. Also
    we should call debugfs_remove before freeing its data, to prevent the risk
    accessing debugfs rightly after its data has been freed.

    Signed-off-by: Joseph Qi
    Reviewed-by: Jiufei Xue
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Junxiao Bi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joseph Qi
     

02 Feb, 2016

2 commits

  • Pull networking fixes from David Miller:
    "This looks like a lot but it's a mixture of regression fixes as well
    as fixes for longer standing issues.

    1) Fix on-channel cancellation in mac80211, from Johannes Berg.

    2) Handle CHECKSUM_COMPLETE properly in xt_TCPMSS netfilter xtables
    module, from Eric Dumazet.

    3) Avoid infinite loop in UDP SO_REUSEPORT logic, also from Eric
    Dumazet.

    4) Avoid a NULL deref if we try to set SO_REUSEPORT after a socket is
    bound, from Craig Gallek.

    5) GRO key comparisons don't take lightweight tunnels into account,
    from Jesse Gross.

    6) Fix struct pid leak via SCM credentials in AF_UNIX, from Eric
    Dumazet.

    7) We need to set the rtnl_link_ops of ipv6 SIT tunnels before we
    register them, otherwise the NEWLINK netlink message is missing
    the proper attributes. From Thadeu Lima de Souza Cascardo.

    8) Several Spectrum chip bug fixes for mlxsw switch driver, from Ido
    Schimmel

    9) Handle fragments properly in ipv4 easly socket demux, from Eric
    Dumazet.

    10) Don't ignore the ifindex key specifier on ipv6 output route
    lookups, from Paolo Abeni"

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (128 commits)
    tcp: avoid cwnd undo after receiving ECN
    irda: fix a potential use-after-free in ircomm_param_request
    net: tg3: avoid uninitialized variable warning
    net: nb8800: avoid uninitialized variable warning
    net: vxge: avoid unused function warnings
    net: bgmac: clarify CONFIG_BCMA dependency
    net: hp100: remove unnecessary #ifdefs
    net: davinci_cpdma: use dma_addr_t for DMA address
    ipv6/udp: use sticky pktinfo egress ifindex on connect()
    ipv6: enforce flowi6_oif usage in ip6_dst_lookup_tail()
    netlink: not trim skb for mmaped socket when dump
    vxlan: fix a out of bounds access in __vxlan_find_mac
    net: dsa: mv88e6xxx: fix port VLAN maps
    fib_trie: Fix shift by 32 in fib_table_lookup
    net: moxart: use correct accessors for DMA memory
    ipv4: ipconfig: avoid unused ic_proto_used symbol
    bnxt_en: Fix crash in bnxt_free_tx_skbs() during tx timeout.
    bnxt_en: Exclude rx_drop_pkts hw counter from the stack's rx_dropped counter.
    bnxt_en: Ring free response from close path should use completion ring
    net_sched: drr: check for NULL pointer in drr_dequeue
    ...

    Linus Torvalds
     
  • Pull libnvdimm fixes from Dan Williams:
    "1/ Fixes to the libnvdimm 'pfn' device that establishes a reserved
    area for storing a struct page array.

    2/ Fixes for dax operations on a raw block device to prevent pagecache
    collisions with dax mappings.

    3/ A fix for pfn_t usage in vm_insert_mixed that lead to a null
    pointer de-reference.

    These have received build success notification from the kbuild robot
    across 153 configs and pass the latest ndctl tests"

    * 'libnvdimm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
    phys_to_pfn_t: use phys_addr_t
    mm: fix pfn_t to page conversion in vm_insert_mixed
    block: use DAX for partition table reads
    block: revert runtime dax control of the raw block device
    fs, block: force direct-I/O for dax-enabled block devices
    devm_memremap_pages: fix vmem_altmap lifetime + alignment handling
    libnvdimm, pfn: fix restoring memmap location
    libnvdimm: fix mode determination for e820 devices

    Linus Torvalds
     

01 Feb, 2016

1 commit

  • Pull timer fixes from Thomas Gleixner:
    "The timer departement delivers:

    - a regression fix for the NTP code along with a proper selftest
    - prevent a spurious timer interrupt in the NOHZ lowres code
    - a fix for user space interfaces returning the remaining time on
    architectures with CONFIG_TIME_LOW_RES=y
    - a few patches to fix COMPILE_TEST fallout"

    * 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    tick/nohz: Set the correct expiry when switching to nohz/lowres mode
    clocksource: Fix dependencies for archs w/o HAS_IOMEM
    clocksource: Select CLKSRC_MMIO where needed
    tick/sched: Hide unused oneshot timer code
    kselftests: timers: Add adjtimex SETOFFSET validity tests
    ntp: Fix ADJ_SETOFFSET being used w/ ADJ_NANO
    itimers: Handle relative timers with CONFIG_TIME_LOW_RES proper
    posix-timers: Handle relative timers with CONFIG_TIME_LOW_RES proper
    timerfd: Handle relative timers with CONFIG_TIME_LOW_RES proper
    hrtimer: Handle remaining time proper for TIME_LOW_RES
    clockevents/tcb_clksrc: Prevent disabling an already disabled clock

    Linus Torvalds
     

31 Jan, 2016

3 commits

  • Johan Hedberg says:

    ====================
    pull request: bluetooth 2016-01-30

    Here's a set of important Bluetooth fixes for the 4.5 kernel:

    - Two fixes to 6LoWPAN code (one fixing a potential crash)
    - Fix LE pairing with devices using both public and random addresses
    - Fix allocation of dynamic LE PSM values
    - Fix missing COMPATIBLE_IOCTL for UART line discipline

    Please let me know if there are any issues pulling. Thanks.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     
  • Avoid populating pagecache when the block device is in DAX mode.
    Otherwise these page cache entries collide with the fsync/msync
    implementation and break data durability guarantees.

    Cc: Jan Kara
    Cc: Jeff Moyer
    Cc: Christoph Hellwig
    Cc: Dave Chinner
    Cc: Andrew Morton
    Reported-by: Ross Zwisler
    Tested-by: Ross Zwisler
    Reviewed-by: Matthew Wilcox
    Signed-off-by: Dan Williams

    Dan Williams
     
  • Dynamically enabling DAX requires that the page cache first be flushed
    and invalidated. This must occur atomically with the change of DAX mode
    otherwise we confuse the fsync/msync tracking and violate data
    durability guarantees. Eliminate the possibilty of DAX-disabled to
    DAX-enabled transitions for now and revisit this for the next cycle.

    Cc: Jan Kara
    Cc: Jeff Moyer
    Cc: Christoph Hellwig
    Cc: Dave Chinner
    Cc: Matthew Wilcox
    Cc: Andrew Morton
    Cc: Ross Zwisler
    Signed-off-by: Dan Williams

    Dan Williams
     

30 Jan, 2016

2 commits

  • Pull btrfs fixes from Chris Mason:
    "Dave had a small collection of fixes to the new free space tree code,
    one of which was keeping our sysfs files more up to date with feature
    bits as different things get enabled (lzo, raid5/6, etc).

    I should have kept the sysfs stuff for rc3, since we always manage to
    trip over something. This time it was GFP_KERNEL from somewhere that
    is NOFS only. Instead of rebasing it out I've put a revert in, and
    we'll fix it properly for rc3.

    Otherwise, Filipe fixed a btrfs DIO race and Qu Wenruo fixed up a
    use-after-free in our tracepoints that Dave Jones reported"

    * 'for-linus-4.5' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
    Revert "btrfs: synchronize incompat feature bits with sysfs files"
    btrfs: don't use GFP_HIGHMEM for free-space-tree bitmap kzalloc
    btrfs: sysfs: check initialization state before updating features
    Revert "btrfs: clear PF_NOFREEZE in cleaner_kthread()"
    btrfs: async-thread: Fix a use-after-free error for trace
    Btrfs: fix race between fsync and lockless direct IO writes
    btrfs: add free space tree to the cow-only list
    btrfs: add free space tree to lockdep classes
    btrfs: tweak free space tree bitmap allocation
    btrfs: tests: switch to GFP_KERNEL
    btrfs: synchronize incompat feature bits with sysfs files
    btrfs: sysfs: introduce helper for syncing bits with sysfs files
    btrfs: sysfs: add free-space-tree bit attribute
    btrfs: sysfs: fix typo in compat_ro attribute definition

    Linus Torvalds
     
  • This reverts commit 14e46e04958df740c6c6a94849f176159a333f13.

    This ends up doing sysfs operations from deep in balance (where we
    should be GFP_NOFS) and under heavy balance load, we're making races
    against sysfs internals.

    Revert it for now while we figure things out.

    Signed-off-by: Chris Mason

    Chris Mason
     

28 Jan, 2016

1 commit


27 Jan, 2016

5 commits


26 Jan, 2016

4 commits

  • This reverts commit 696249132158014d594896df3a81390616069c5c. The
    cleaner thread can block freezing when there's a snapshot cleaning in
    progress and the other threads get suspended first. From the logs
    provided by Martin we're waiting for reading extent pages:

    kernel: PM: Syncing filesystems ... done.
    kernel: Freezing user space processes ... (elapsed 0.015 seconds) done.
    kernel: Freezing remaining freezable tasks ...
    kernel: Freezing of tasks failed after 20.003 seconds (1 tasks refusing to freeze, wq_busy=0):
    kernel: btrfs-cleaner D ffff88033dd13bc0 0 152 2 0x00000000
    kernel: ffff88032ebc2e00 ffff88032e750000 ffff88032e74fa50 7fffffffffffffff
    kernel: ffffffff814a58df 0000000000000002 ffffea000934d580 ffffffff814a5451
    kernel: 7fffffffffffffff ffffffff814a6e8f 0000000000000000 0000000000000020
    kernel: Call Trace:
    kernel: [] ? bit_wait+0x2c/0x2c
    kernel: [] ? schedule+0x6f/0x7c
    kernel: [] ? schedule_timeout+0x2f/0xd8
    kernel: [] ? timekeeping_get_ns+0xa/0x2e
    kernel: [] ? ktime_get+0x36/0x44
    kernel: [] ? io_schedule_timeout+0x94/0xf2
    kernel: [] ? io_schedule_timeout+0x94/0xf2
    kernel: [] ? bit_wait_io+0x2c/0x30
    kernel: [] ? __wait_on_bit+0x41/0x73
    kernel: [] ? wait_on_page_bit+0x6d/0x72
    kernel: [] ? autoremove_wake_function+0x2a/0x2a
    kernel: [] ? read_extent_buffer_pages+0x1bd/0x203
    kernel: [] ? free_root_pointers+0x4c/0x4c
    kernel: [] ? btree_read_extent_buffer_pages.constprop.57+0x5a/0xe9
    kernel: [] ? read_tree_block+0x2d/0x45
    kernel: [] ? read_block_for_search.isra.34+0x22a/0x26b
    kernel: [] ? btrfs_set_path_blocking+0x1e/0x4a
    kernel: [] ? btrfs_search_slot+0x648/0x736
    kernel: [] ? btrfs_lookup_extent_info+0xb7/0x2c7
    kernel: [] ? walk_down_proc+0x9c/0x1ae
    kernel: [] ? walk_down_tree+0x40/0xa4
    kernel: [] ? btrfs_drop_snapshot+0x2da/0x664
    kernel: [] ? finish_task_switch+0x126/0x167
    kernel: [] ? btrfs_clean_one_deleted_snapshot+0xa6/0xb0
    kernel: [] ? cleaner_kthread+0x13e/0x17b
    kernel: [] ? btrfs_item_end+0x33/0x33
    kernel: [] ? kthread+0x95/0x9d
    kernel: [] ? kthread_parkme+0x16/0x16
    kernel: [] ? ret_from_fork+0x3f/0x70
    kernel: [] ? kthread_parkme+0x16/0x16

    As this affects a released kernel (4.4) we need a minimal fix for
    stable kernels.

    Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=108361
    Reported-by: Martin Ziegler
    CC: stable@vger.kernel.org # 4.4
    CC: Jiri Kosina
    Signed-off-by: David Sterba
    Signed-off-by: Chris Mason

    David Sterba
     
  • Parameter of trace_btrfs_work_queued() can be freed in its workqueue.
    So no one use use that pointer after queue_work().

    Fix the user-after-free bug by move the trace line before queue_work().

    Reported-by: Dave Jones
    Signed-off-by: Qu Wenruo
    Reviewed-by: David Sterba
    Signed-off-by: Chris Mason

    Qu Wenruo
     
  • An fsync, using the fast path, can race with a concurrent lockless direct
    IO write and end up logging a file extent item that points to an extent
    that wasn't written to yet. This is because the fast fsync path collects
    ordered extents into a local list and then collects all the new extent
    maps to log file extent items based on them, while the direct IO write
    path creates the new extent map before it creates the corresponding
    ordered extent (and submitting the respective bio(s)).

    So fix this by making the direct IO write path create ordered extents
    before the extent maps and make the fast fsync path collect any new
    ordered extents after it collects the extent maps.
    Note that making the fsync handler call inode_dio_wait() (after acquiring
    the inode's i_mutex) would not work and lead to a deadlock when doing
    AIO, as through AIO we end up in a path where the fsync handler is called
    (through dio_aio_complete_work() -> dio_complete() -> vfs_fsync_range())
    before the inode's dio counter is decremented (inode_dio_wait() waits
    for this counter to have a value of zero).

    Signed-off-by: Filipe Manana
    Signed-off-by: Chris Mason

    Filipe Manana
     
  • …ave/linux into for-linus-4.5

    Signed-off-by: Chris Mason <clm@fb.com>

    Chris Mason
     

25 Jan, 2016

5 commits

  • Signed-off-by: David Sterba

    David Sterba
     
  • Signed-off-by: David Sterba

    David Sterba
     
  • Pull MIPS updates from Ralf Baechle:
    "This is the main pull request for MIPS for 4.5 plus some 4.4 fixes.

    The executive summary:

    - ATH79 platform improvments, use DT bindings for the ATH79 USB PHY.
    - Avoid useless rebuilds for zboot.
    - jz4780: Add NEMC, BCH and NAND device tree nodes
    - Initial support for the MicroChip's DT platform. As all the device
    drivers are missing this is still of limited use.
    - Some Loongson3 cleanups.
    - The unavoidable whitespace polishing.
    - Reduce clock skew when synchronizing the CPU cycle counters on CPU
    startup.
    - Add MIPS R6 fixes.
    - Lots of cleanups across arch/mips as fallout from KVM.
    - Lots of minor fixes and changes for IEEE 754-2008 support to the
    FPU emulator / fp-assist software.
    - Minor Ralink, BCM47xx and bcm963xx platform support improvments.
    - Support SMP on BCM63168"

    * 'upstream' of git://git.linux-mips.org/pub/scm/ralf/upstream-linus: (84 commits)
    MIPS: zboot: Add support for serial debug using the PROM
    MIPS: zboot: Avoid useless rebuilds
    MIPS: BMIPS: Enable ARCH_WANT_OPTIONAL_GPIOLIB
    MIPS: bcm63xx: nvram: Remove unused bcm63xx_nvram_get_psi_size() function
    MIPS: bcm963xx: Update bcm_tag field image_sequence
    MIPS: bcm963xx: Move extended flash address to bcm_tag header file
    MIPS: bcm963xx: Move Broadcom BCM963xx image tag data structure
    MIPS: bcm63xx: nvram: Use nvram structure definition from header file
    MIPS: bcm963xx: Add Broadcom BCM963xx board nvram data structure
    MAINTAINERS: Add KVM for MIPS entry
    MIPS: KVM: Add missing newline to kvm_err()
    MIPS: Move KVM specific opcodes into asm/inst.h
    MIPS: KVM: Use cacheops.h definitions
    MIPS: Break down cacheops.h definitions
    MIPS: Use EXCCODE_ constants with set_except_vector()
    MIPS: Update trap codes
    MIPS: Move Cause.ExcCode trap codes to mipsregs.h
    MIPS: KVM: Make kvm_mips_{init,exit}() static
    MIPS: KVM: Refactor added offsetof()s
    MIPS: KVM: Convert EXPORT_SYMBOL to _GPL
    ...

    Linus Torvalds
     
  • Pull Ceph updates from Sage Weil:
    "The two main changes are aio support in CephFS, and a series that
    fixes several issues in the authentication key timeout/renewal code.

    On top of that are a variety of cleanups and minor bug fixes"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
    libceph: remove outdated comment
    libceph: kill off ceph_x_ticket_handler::validity
    libceph: invalidate AUTH in addition to a service ticket
    libceph: fix authorizer invalidation, take 2
    libceph: clear messenger auth_retry flag if we fault
    libceph: fix ceph_msg_revoke()
    libceph: use list_for_each_entry_safe
    ceph: use i_size_{read,write} to get/set i_size
    ceph: re-send AIO write request when getting -EOLDSNAP error
    ceph: Asynchronous IO support
    ceph: Avoid to propagate the invalid page point
    ceph: fix double page_unlock() in page_mkwrite()
    rbd: delete an unnecessary check before rbd_dev_destroy()
    libceph: use list_next_entry instead of list_entry_next
    ceph: ceph_frag_contains_value can be boolean
    ceph: remove unused functions in ceph_frag.h

    Linus Torvalds
     
  • Pull SMB3 fixes from Steve French:
    "A collection of CIFS/SMB3 fixes.

    It includes a couple bug fixes, a few for improved debugging of
    cifs.ko and some improvements to the way cifs does key generation.

    I do have some additional bug fixes I expect in the next week or two
    (to address a problem found by xfstest, and some fixes for SMB3.11
    dialect, and a couple patches that just came in yesterday that I am
    reviewing)"

    * 'for-next' of git://git.samba.org/sfrench/cifs-2.6:
    cifs_dbg() outputs an uninitialized buffer in cifs_readdir()
    cifs: fix race between call_async() and reconnect()
    Prepare for encryption support (first part). Add decryption and encryption key generation. Thanks to Metze for helping with this.
    cifs: Allow using O_DIRECT with cache=loose
    cifs: Make echo interval tunable
    cifs: Check uniqueid for SMB2+ and return -ESTALE if necessary
    Print IP address of unresponsive server
    cifs: Ratelimit kernel log messages

    Linus Torvalds
     

24 Jan, 2016

2 commits

  • Pull final vfs updates from Al Viro:

    - The ->i_mutex wrappers (with small prereq in lustre)

    - a fix for too early freeing of symlink bodies on shmem (they need to
    be RCU-delayed) (-stable fodder)

    - followup to dedupe stuff merged this cycle

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    vfs: abort dedupe loop if fatal signals are pending
    make sure that freeing shmem fast symlinks is RCU-delayed
    wrappers for ->i_mutex access
    lustre: remove unused declaration

    Linus Torvalds
     
  • Pull NFS client bugfixes and cleanups from Trond Myklebust:
    "Bugfixes:
    - pNFS/flexfiles: Fix an XDR encoding bug in layoutreturn
    - pNFS/flexfiles: Improve merging of errors in LAYOUTRETURN

    Cleanups:
    - NFS: Simplify nfs_request_add_commit_list() arguments"

    * tag 'nfs-for-4.5-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
    pNFS/flexfiles: Fix an XDR encoding bug in layoutreturn
    NFS: Simplify nfs_request_add_commit_list() arguments
    pNFS/flexfiles: Improve merging of errors in LAYOUTRETURN

    Linus Torvalds
     

23 Jan, 2016

10 commits

  • If the program running dedupe receives a fatal signal during the
    dedupe loop, we should bail out to avoid tying up the system.

    Signed-off-by: Darrick J. Wong
    Signed-off-by: Al Viro

    Darrick J. Wong
     
  • There are many locations that do

    if (memory_was_allocated_by_vmalloc)
    vfree(ptr);
    else
    kfree(ptr);

    but kvfree() can handle both kmalloc()ed memory and vmalloc()ed memory
    using is_vmalloc_addr(). Unless callers have special reasons, we can
    replace this branch with kvfree(). Please check and reply if you found
    problems.

    Signed-off-by: Tetsuo Handa
    Acked-by: Michal Hocko
    Acked-by: Jan Kara
    Acked-by: Russell King
    Reviewed-by: Andreas Dilger
    Acked-by: "Rafael J. Wysocki"
    Acked-by: David Rientjes
    Cc: "Luck, Tony"
    Cc: Oleg Drokin
    Cc: Boris Petkov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tetsuo Handa
     
  • Previously in DAX we assumed that calls to get_block() would set
    bh.b_bdev, and we would then use that value even in error cases for
    debugging. This caused a NULL pointer dereference in __dax_dbg() which
    was fixed by a previous commit, but that commit only changed the one
    place where we were hitting an error.

    Instead, update dax.c so that we always initialize bh.b_bdev as best we
    can based on the information that DAX has. get_block() may or may not
    update to a new value, but this at least lets us get something helpful
    from bh.b_bdev for error messages and not have to worry about whether it
    was set by get_block() or not.

    Signed-off-by: Ross Zwisler
    Reported-by: Jan Kara
    Reviewed-by: Jan Kara
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ross Zwisler
     
  • To properly support the new DAX fsync/msync infrastructure filesystems
    need to call dax_pfn_mkwrite() so that DAX can track when user pages are
    dirtied.

    Signed-off-by: Ross Zwisler
    Cc: "H. Peter Anvin"
    Cc: "J. Bruce Fields"
    Cc: "Theodore Ts'o"
    Cc: Alexander Viro
    Cc: Andreas Dilger
    Cc: Dave Chinner
    Cc: Ingo Molnar
    Cc: Jan Kara
    Cc: Jeff Layton
    Cc: Matthew Wilcox
    Cc: Thomas Gleixner
    Cc: Dan Williams
    Cc: Matthew Wilcox
    Cc: Dave Hansen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ross Zwisler
     
  • To properly support the new DAX fsync/msync infrastructure filesystems
    need to call dax_pfn_mkwrite() so that DAX can track when user pages are
    dirtied.

    Signed-off-by: Ross Zwisler
    Cc: "H. Peter Anvin"
    Cc: "J. Bruce Fields"
    Cc: "Theodore Ts'o"
    Cc: Alexander Viro
    Cc: Andreas Dilger
    Cc: Dave Chinner
    Cc: Ingo Molnar
    Cc: Jan Kara
    Cc: Jeff Layton
    Cc: Matthew Wilcox
    Cc: Thomas Gleixner
    Cc: Dan Williams
    Cc: Matthew Wilcox
    Cc: Dave Hansen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ross Zwisler
     
  • To properly support the new DAX fsync/msync infrastructure filesystems
    need to call dax_pfn_mkwrite() so that DAX can track when user pages are
    dirtied.

    Signed-off-by: Ross Zwisler
    Cc: "H. Peter Anvin"
    Cc: "J. Bruce Fields"
    Cc: "Theodore Ts'o"
    Cc: Alexander Viro
    Cc: Andreas Dilger
    Cc: Dave Chinner
    Cc: Ingo Molnar
    Cc: Jan Kara
    Cc: Jeff Layton
    Cc: Matthew Wilcox
    Cc: Thomas Gleixner
    Cc: Dan Williams
    Cc: Matthew Wilcox
    Cc: Dave Hansen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ross Zwisler
     
  • To properly handle fsync/msync in an efficient way DAX needs to track
    dirty pages so it is able to flush them durably to media on demand.

    The tracking of dirty pages is done via the radix tree in struct
    address_space. This radix tree is already used by the page writeback
    infrastructure for tracking dirty pages associated with an open file,
    and it already has support for exceptional (non struct page*) entries.
    We build upon these features to add exceptional entries to the radix
    tree for DAX dirty PMD or PTE pages at fault time.

    [dan.j.williams@intel.com: fix dax_pmd_dbg build warning]
    Signed-off-by: Ross Zwisler
    Cc: "H. Peter Anvin"
    Cc: "J. Bruce Fields"
    Cc: "Theodore Ts'o"
    Cc: Alexander Viro
    Cc: Andreas Dilger
    Cc: Dave Chinner
    Cc: Ingo Molnar
    Cc: Jan Kara
    Cc: Jeff Layton
    Cc: Matthew Wilcox
    Cc: Thomas Gleixner
    Cc: Matthew Wilcox
    Cc: Dave Hansen
    Signed-off-by: Dan Williams
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ross Zwisler
     
  • Add support for tracking dirty DAX entries in the struct address_space
    radix tree. This tree is already used for dirty page writeback, and it
    already supports the use of exceptional (non struct page*) entries.

    In order to properly track dirty DAX pages we will insert new
    exceptional entries into the radix tree that represent dirty DAX PTE or
    PMD pages. These exceptional entries will also contain the writeback
    addresses for the PTE or PMD faults that we can use at fsync/msync time.

    There are currently two types of exceptional entries (shmem and shadow)
    that can be placed into the radix tree, and this adds a third. We rely
    on the fact that only one type of exceptional entry can be found in a
    given radix tree based on its usage. This happens for free with DAX vs
    shmem but we explicitly prevent shadow entries from being added to radix
    trees for DAX mappings.

    The only shadow entries that would be generated for DAX radix trees
    would be to track zero page mappings that were created for holes. These
    pages would receive minimal benefit from having shadow entries, and the
    choice to have only one type of exceptional entry in a given radix tree
    makes the logic simpler both in clear_exceptional_entry() and in the
    rest of DAX.

    Signed-off-by: Ross Zwisler
    Cc: "H. Peter Anvin"
    Cc: "J. Bruce Fields"
    Cc: "Theodore Ts'o"
    Cc: Alexander Viro
    Cc: Andreas Dilger
    Cc: Dave Chinner
    Cc: Ingo Molnar
    Cc: Jan Kara
    Cc: Jeff Layton
    Cc: Matthew Wilcox
    Cc: Thomas Gleixner
    Cc: Dan Williams
    Cc: Matthew Wilcox
    Cc: Dave Hansen
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ross Zwisler
     
  • When we get a DAX PMD fault for a write it is possible that there could
    be some number of 4k zero pages already present for the same range that
    were inserted to service reads from a hole. These 4k zero pages need to
    be unmapped from the VMAs and removed from the struct address_space
    radix tree before the real DAX PMD entry can be inserted.

    For PTE faults this same use case also exists and is handled by a
    combination of unmap_mapping_range() to unmap the VMAs and
    delete_from_page_cache() to remove the page from the address_space radix
    tree.

    For PMD faults we do have a call to unmap_mapping_range() (protected by
    a buffer_new() check), but nothing clears out the radix tree entry. The
    buffer_new() check is also incorrect as the current ext4 and XFS
    filesystem code will never return a buffer_head with BH_New set, even
    when allocating new blocks over a hole. Instead the filesystem will
    zero the blocks manually and return a buffer_head with only BH_Mapped
    set.

    Fix this situation by removing the buffer_new() check and adding a call
    to truncate_inode_pages_range() to clear out the radix tree entries
    before we insert the DAX PMD.

    Signed-off-by: Ross Zwisler
    Reported-by: Dan Williams
    Tested-by: Dan Williams
    Reviewed-by: Jan Kara
    Cc: "H. Peter Anvin"
    Cc: "J. Bruce Fields"
    Cc: "Theodore Ts'o"
    Cc: Alexander Viro
    Cc: Andreas Dilger
    Cc: Dave Chinner
    Cc: Ingo Molnar
    Cc: Jeff Layton
    Cc: Matthew Wilcox
    Cc: Thomas Gleixner
    Cc: Matthew Wilcox
    Cc: Dave Hansen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ross Zwisler
     
  • In __dax_pmd_fault() we currently assume that get_block() will always
    set bh.b_bdev and we unconditionally dereference it in __dax_dbg().

    This assumption isn't always true - when called for reads of holes
    ext4_dax_mmap_get_block() returns a buffer head where bh->b_bdev is
    never set. I hit this BUG while testing the DAX PMD fault path.

    Instead, initialize bh.b_bdev before passing bh into get_block(). It is
    possible that the filesystem's get_block() will update bh.b_bdev, and
    this is fine - we just want to initialize bh.b_bdev to something
    reasonable so that the calls to __dax_dbg() work and print something
    useful.

    Signed-off-by: Ross Zwisler
    Reported-by: Dan Williams
    Cc: Jan Kara
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ross Zwisler