31 Jul, 2017

1 commit

  • Two variables in ext4_inode_info, i_reserved_meta_blocks and
    i_allocated_meta_blocks, are unused. Removing them saves a little
    memory per in-memory inode and cleans up clutter in several tracepoints.
    Adjust tracepoint output from ext4_alloc_da_blocks() for consistency
    and fix a typo and whitespace near these changes.

    Signed-off-by: Eric Whitney
    Signed-off-by: Theodore Ts'o
    Reviewed-by: Jan Kara

    Eric Whitney
     

13 Jul, 2017

2 commits

  • __GFP_REPEAT was designed to allow retry-but-eventually-fail semantic to
    the page allocator. This has been true but only for allocations
    requests larger than PAGE_ALLOC_COSTLY_ORDER. It has been always
    ignored for smaller sizes. This is a bit unfortunate because there is
    no way to express the same semantic for those requests and they are
    considered too important to fail so they might end up looping in the
    page allocator for ever, similarly to GFP_NOFAIL requests.

    Now that the whole tree has been cleaned up and accidental or misled
    usage of __GFP_REPEAT flag has been removed for !costly requests we can
    give the original flag a better name and more importantly a more useful
    semantic. Let's rename it to __GFP_RETRY_MAYFAIL which tells the user
    that the allocator would try really hard but there is no promise of a
    success. This will work independent of the order and overrides the
    default allocator behavior. Page allocator users have several levels of
    guarantee vs. cost options (take GFP_KERNEL as an example)

    - GFP_KERNEL & ~__GFP_RECLAIM - optimistic allocation without _any_
    attempt to free memory at all. The most light weight mode which even
    doesn't kick the background reclaim. Should be used carefully because
    it might deplete the memory and the next user might hit the more
    aggressive reclaim

    - GFP_KERNEL & ~__GFP_DIRECT_RECLAIM (or GFP_NOWAIT)- optimistic
    allocation without any attempt to free memory from the current
    context but can wake kswapd to reclaim memory if the zone is below
    the low watermark. Can be used from either atomic contexts or when
    the request is a performance optimization and there is another
    fallback for a slow path.

    - (GFP_KERNEL|__GFP_HIGH) & ~__GFP_DIRECT_RECLAIM (aka GFP_ATOMIC) -
    non sleeping allocation with an expensive fallback so it can access
    some portion of memory reserves. Usually used from interrupt/bh
    context with an expensive slow path fallback.

    - GFP_KERNEL - both background and direct reclaim are allowed and the
    _default_ page allocator behavior is used. That means that !costly
    allocation requests are basically nofail but there is no guarantee of
    that behavior so failures have to be checked properly by callers
    (e.g. OOM killer victim is allowed to fail currently).

    - GFP_KERNEL | __GFP_NORETRY - overrides the default allocator behavior
    and all allocation requests fail early rather than cause disruptive
    reclaim (one round of reclaim in this implementation). The OOM killer
    is not invoked.

    - GFP_KERNEL | __GFP_RETRY_MAYFAIL - overrides the default allocator
    behavior and all allocation requests try really hard. The request
    will fail if the reclaim cannot make any progress. The OOM killer
    won't be triggered.

    - GFP_KERNEL | __GFP_NOFAIL - overrides the default allocator behavior
    and all allocation requests will loop endlessly until they succeed.
    This might be really dangerous especially for larger orders.

    Existing users of __GFP_REPEAT are changed to __GFP_RETRY_MAYFAIL
    because they already had their semantic. No new users are added.
    __alloc_pages_slowpath is changed to bail out for __GFP_RETRY_MAYFAIL if
    there is no progress and we have already passed the OOM point.

    This means that all the reclaim opportunities have been exhausted except
    the most disruptive one (the OOM killer) and a user defined fallback
    behavior is more sensible than keep retrying in the page allocator.

    [akpm@linux-foundation.org: fix arch/sparc/kernel/mdesc.c]
    [mhocko@suse.com: semantic fix]
    Link: http://lkml.kernel.org/r/20170626123847.GM11534@dhcp22.suse.cz
    [mhocko@kernel.org: address other thing spotted by Vlastimil]
    Link: http://lkml.kernel.org/r/20170626124233.GN11534@dhcp22.suse.cz
    Link: http://lkml.kernel.org/r/20170623085345.11304-3-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Cc: Alex Belits
    Cc: Chris Wilson
    Cc: Christoph Hellwig
    Cc: Darrick J. Wong
    Cc: David Daney
    Cc: Johannes Weiner
    Cc: Mel Gorman
    Cc: NeilBrown
    Cc: Ralf Baechle
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • Pull i2c updates from Wolfram Sang:
    "This pull request contains:

    - i2c core reorganization. One source file became too monolithic. It
    is now split up, yet we still have the same named object as the
    final output. This should ease maintenance.

    - new drivers: ZTE ZX2967 family, ASPEED 24XX/25XX

    - designware driver gained slave mode support

    - xgene-slimpro driver gained ACPI support

    - bigger overhaul for pca-platform driver

    - the algo-bit module now supports messages with enforced STOP

    - slightly bigger than usual set of driver updates and improvements

    and with much appreciated quality assurance from Andy Shevchenko"

    * 'i2c/for-4.13' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux: (51 commits)
    i2c: Provide a stub for i2c_detect_slave_mode()
    i2c: designware: Let slave adapter support be optional
    i2c: designware: Make HW init functions static
    i2c: designware: fix spelling mistakes
    i2c: pca-platform: propagate error from i2c_pca_add_numbered_bus
    i2c: pca-platform: correctly set algo_data.reset_chip
    i2c: acpi: Do not create i2c-clients for LNXVIDEO ACPI devices
    i2c: designware: enable SLAVE in platform module
    i2c: designware: add SLAVE mode functions
    i2c: zx2967: drop COMPILE_TEST dependency
    i2c: zx2967: always use the same device when printing errors
    i2c: pca-platform: use dev_warn/dev_info instead of printk
    i2c: pca-platform: use device managed allocations
    i2c: pca-platform: add devicetree awareness
    i2c: pca-platform: switch to struct gpio_desc
    dt-bindings: add bindings for i2c-pca-platform
    i2c: cadance: fix ctrl/addr reg write order
    i2c: zx2967: add i2c controller driver for ZTE's zx2967 family
    dt: bindings: add documentation for zx2967 family i2c controller
    i2c: algo-bit: add support for I2C_M_STOP
    ...

    Linus Torvalds
     

11 Jul, 2017

4 commits

  • Merge more updates from Andrew Morton:

    - most of the rest of MM

    - KASAN updates

    - lib/ updates

    - checkpatch updates

    - some binfmt_elf changes

    - various misc bits

    * emailed patches from Andrew Morton : (115 commits)
    kernel/exit.c: avoid undefined behaviour when calling wait4()
    kernel/signal.c: avoid undefined behaviour in kill_something_info
    binfmt_elf: safely increment argv pointers
    s390: reduce ELF_ET_DYN_BASE
    powerpc: move ELF_ET_DYN_BASE to 4GB / 4MB
    arm64: move ELF_ET_DYN_BASE to 4GB / 4MB
    arm: move ELF_ET_DYN_BASE to 4MB
    binfmt_elf: use ELF_ET_DYN_BASE only for PIE
    fs, epoll: short circuit fetching events if thread has been killed
    checkpatch: improve multi-line alignment test
    checkpatch: improve macro reuse test
    checkpatch: change format of --color argument to --color[=WHEN]
    checkpatch: silence perl 5.26.0 unescaped left brace warnings
    checkpatch: improve tests for multiple line function definitions
    checkpatch: remove false warning for commit reference
    checkpatch: fix stepping through statements with $stat and ctx_statement_block
    checkpatch: [HLP]LIST_HEAD is also declaration
    checkpatch: warn when a MAINTAINERS entry isn't [A-Z]:\t
    checkpatch: improve the unnecessary OOM message test
    lib/bsearch.c: micro-optimize pivot position calculation
    ...

    Linus Torvalds
     
  • During the debugging of the problem described in
    https://lkml.org/lkml/2017/5/17/542 and fixed by Tetsuo Handa in
    https://lkml.org/lkml/2017/5/19/383 , I've found that the existing debug
    output is not really useful to understand issues related to the oom
    reaper.

    So, I assume, that adding some tracepoints might help with debugging of
    similar issues.

    Trace the following events:
    1) a process is marked as an oom victim,
    2) a process is added to the oom reaper list,
    3) the oom reaper starts reaping process's mm,
    4) the oom reaper finished reaping,
    5) the oom reaper skips reaping.

    How it works in practice? Below is an example which show how the problem
    mentioned above can be found: one process is added twice to the
    oom_reaper list:

    $ cd /sys/kernel/debug/tracing
    $ echo "oom:mark_victim" > set_event
    $ echo "oom:wake_reaper" >> set_event
    $ echo "oom:skip_task_reaping" >> set_event
    $ echo "oom:start_task_reaping" >> set_event
    $ echo "oom:finish_task_reaping" >> set_event
    $ cat trace_pipe
    allocate-502 [001] .... 91.836405: mark_victim: pid=502
    allocate-502 [001] .N.. 91.837356: wake_reaper: pid=502
    allocate-502 [000] .N.. 91.871149: wake_reaper: pid=502
    oom_reaper-23 [000] .... 91.871177: start_task_reaping: pid=502
    oom_reaper-23 [000] .N.. 91.879511: finish_task_reaping: pid=502
    oom_reaper-23 [000] .... 91.879580: skip_task_reaping: pid=502

    Link: http://lkml.kernel.org/r/20170530185231.GA13412@castle
    Signed-off-by: Roman Gushchin
    Acked-by: Michal Hocko
    Cc: Tetsuo Handa
    Cc: Johannes Weiner
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     
  • After enabling CONFIG_TRACE_ENUM_MAP_FILE (which will soon be renamed to
    CONFIG_TRACE_EVAL_MAP_FILE), I am able to examine the enums that have
    been evaluated:

    # cat /sys/kernel/debug/tracing/enum_map

    (which will soon be renamed to eval_map)

    And it showed some interesting results:

    [..]
    ZONE_MOVABLE 3 (oom)
    ZONE_NORMAL 2 (oom)
    ZONE_DMA32 1 (oom)
    ZONE_DMA 0 (oom)
    3 3 (oom)
    2 2 (oom)
    1 1 (oom)
    COMPACT_PRIO_ASYNC 2 (oom)
    COMPACT_PRIO_SYNC_LIGHT 1 (oom)
    COMPACT_PRIO_SYNC_FULL 0 (oom)
    [..]
    ZONE_DMA 0 (vmscan)
    3 3 (vmscan)
    2 2 (vmscan)
    1 1 (vmscan)
    COMPACT_PRIO_ASYNC 2 (vmscan)
    [..]
    ZONE_DMA 0 (kmem)
    3 3 (kmem)
    2 2 (kmem)
    1 1 (kmem)
    COMPACT_PRIO_ASYNC 2 (kmem)
    [..]
    ZONE_DMA 0 (compaction)
    3 3 (compaction)
    2 2 (compaction)
    1 1 (compaction)
    COMPACT_PRIO_ASYNC 2 (compaction)
    [..]

    The name within the parenthesis are the trace systems that the enum/eval
    maps are associated with. When there's a number evaluated to another
    number, that tells me that the TRACE_DEFINE_ENUM() was used on a #define
    and not an enum. As #defines get converted normally, they are not needed
    to be evaluated.

    Each of the above trace systems with the number to number evaluation
    included the file include/trace/events/mmflags.h which has:

    /* High-level compaction status feedback */
    #define COMPACTION_FAILED 1
    #define COMPACTION_WITHDRAWN 2
    #define COMPACTION_PROGRESS 3

    [..]

    #define COMPACTION_FEEDBACK \
    EM(COMPACTION_FAILED, "failed") \
    EM(COMPACTION_WITHDRAWN, "withdrawn") \
    EMe(COMPACTION_PROGRESS, "progress")

    Which is still needed for the __print_symbolic() usage in the
    trace_event. But it is not needed to be evaluated.

    Removing the evaluation part removes the unnecessary evaluations of
    numbers to numbers.

    Link: http://lkml.kernel.org/r/20170615074944.7be9a647@gandalf.local.home
    Signed-off-by: Steven Rostedt (VMware)
    Cc: Michal Hocko
    Cc: Vlastimil Babka
    Cc: David Rientjes
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Steven Rostedt (VMware)
     
  • Pull f2fs updates from Jaegeuk Kim:
    "In this round, we've added new features such as disk quota and statx,
    and modified internal bio management flow to merge more IOs depending
    on block types. We've also made internal threads freezeable for
    Android battery life. In addition to them, there are some patches to
    avoid lock contention as well as a couple of deadlock conditions.

    Enhancements:
    - support usrquota, grpquota, and statx
    - manage DATA/NODE typed bios separately to serialize more IOs
    - modify f2fs_lock_op/wio_mutex to avoid lock contention
    - prevent lock contention in migratepage

    Bug fixes:
    - fix missing load of written inode flag
    - fix worst case victim selection in GC
    - freezeable GC and discard threads for Android battery life
    - sanitize f2fs metadata to deal with security hole
    - clean up sysfs-related code and docs"

    * tag 'for-f2fs-4.13' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (59 commits)
    f2fs: support plain user/group quota
    f2fs: avoid deadlock caused by lock order of page and lock_op
    f2fs: use spin_{,un}lock_irq{save,restore}
    f2fs: relax migratepage for atomic written page
    f2fs: don't count inode block in in-memory inode.i_blocks
    Revert "f2fs: fix to clean previous mount option when remount_fs"
    f2fs: do not set LOST_PINO for renamed dir
    f2fs: do not set LOST_PINO for newly created dir
    f2fs: skip ->writepages for {mete,node}_inode during recovery
    f2fs: introduce __check_sit_bitmap
    f2fs: stop gc/discard thread in prior during umount
    f2fs: introduce reserved_blocks in sysfs
    f2fs: avoid redundant f2fs_flush after remount
    f2fs: report # of free inodes more precisely
    f2fs: add ioctl to do gc with target block address
    f2fs: don't need to check encrypted inode for partial truncation
    f2fs: measure inode.i_blocks as generic filesystem
    f2fs: set CP_TRIMMED_FLAG correctly
    f2fs: require key for truncate(2) of encrypted file
    f2fs: move sysfs code from super.c to fs/f2fs/sysfs.c
    ...

    Linus Torvalds
     

08 Jul, 2017

1 commit

  • Pull Writeback error handling updates from Jeff Layton:
    "This pile represents the bulk of the writeback error handling fixes
    that I have for this cycle. Some of the earlier patches in this pile
    may look trivial but they are prerequisites for later patches in the
    series.

    The aim of this set is to improve how we track and report writeback
    errors to userland. Most applications that care about data integrity
    will periodically call fsync/fdatasync/msync to ensure that their
    writes have made it to the backing store.

    For a very long time, we have tracked writeback errors using two flags
    in the address_space: AS_EIO and AS_ENOSPC. Those flags are set when a
    writeback error occurs (via mapping_set_error) and are cleared as a
    side-effect of filemap_check_errors (as you noted yesterday). This
    model really sucks for userland.

    Only the first task to call fsync (or msync or fdatasync) will see the
    error. Any subsequent task calling fsync on a file will get back 0
    (unless another writeback error occurs in the interim). If I have
    several tasks writing to a file and calling fsync to ensure that their
    writes got stored, then I need to have them coordinate with one
    another. That's difficult enough, but in a world of containerized
    setups that coordination may even not be possible.

    But wait...it gets worse!

    The calls to filemap_check_errors can be buried pretty far down in the
    call stack, and there are internal callers of filemap_write_and_wait
    and the like that also end up clearing those errors. Many of those
    callers ignore the error return from that function or return it to
    userland at nonsensical times (e.g. truncate() or stat()). If I get
    back -EIO on a truncate, there is no reason to think that it was
    because some previous writeback failed, and a subsequent fsync() will
    (incorrectly) return 0.

    This pile aims to do three things:

    1) ensure that when a writeback error occurs that that error will be
    reported to userland on a subsequent fsync/fdatasync/msync call,
    regardless of what internal callers are doing

    2) report writeback errors on all file descriptions that were open at
    the time that the error occurred. This is a user-visible change,
    but I think most applications are written to assume this behavior
    anyway. Those that aren't are unlikely to be hurt by it.

    3) document what filesystems should do when there is a writeback
    error. Today, there is very little consistency between them, and a
    lot of cargo-cult copying. We need to make it very clear what
    filesystems should do in this situation.

    To achieve this, the set adds a new data type (errseq_t) and then
    builds new writeback error tracking infrastructure around that. Once
    all of that is in place, we change the filesystems to use the new
    infrastructure for reporting wb errors to userland.

    Note that this is just the initial foray into cleaning up this mess.
    There is a lot of work remaining here:

    1) convert the rest of the filesystems in a similar fashion. Once the
    initial set is in, then I think most other fs' will be fairly
    simple to convert. Hopefully most of those can in via individual
    filesystem trees.

    2) convert internal waiters on writeback to use errseq_t for
    detecting errors instead of relying on the AS_* flags. I have some
    draft patches for this for ext4, but they are not quite ready for
    prime time yet.

    This was a discussion topic this year at LSF/MM too. If you're
    interested in the gory details, LWN has some good articles about this:

    https://lwn.net/Articles/718734/
    https://lwn.net/Articles/724307/"

    * tag 'for-linus-v4.13-2' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux:
    btrfs: minimal conversion to errseq_t writeback error reporting on fsync
    xfs: minimal conversion to errseq_t writeback error reporting
    ext4: use errseq_t based error handling for reporting data writeback errors
    fs: convert __generic_file_fsync to use errseq_t based reporting
    block: convert to errseq_t based writeback error tracking
    dax: set errors in mapping when writeback fails
    Documentation: flesh out the section in vfs.txt on storing and reporting writeback errors
    mm: set both AS_EIO/AS_ENOSPC and errseq_t in mapping_set_error
    fs: new infrastructure for writeback error handling and reporting
    lib: add errseq_t type and infrastructure for handling it
    mm: don't TestClearPageError in __filemap_fdatawait_range
    mm: clear AS_EIO/AS_ENOSPC when writeback initiation fails
    jbd2: don't clear and reset errors after waiting on writeback
    buffer: set errors in mapping at the time that the error occurs
    fs: check for writeback errors after syncing out buffers in generic_file_fsync
    buffer: use mapping_set_error instead of setting the flag
    mm: fix mapping_set_error call in me_pagecache_dirty

    Linus Torvalds
     

07 Jul, 2017

1 commit

  • Pull tracing updates from Steven Rostedt:
    "The new features of this release:

    - Added TRACE_DEFINE_SIZEOF() which allows trace events that use
    sizeof() it the TP_printk() to be converted to the actual size such
    that trace-cmd and perf can parse them correctly.

    - Some rework of the TRACE_DEFINE_ENUM() such that the above
    TRACE_DEFINE_SIZEOF() could reuse the same code.

    - Recording of tgid (Thread Group ID). This is similar to how task
    COMMs are recorded (cached at sched_switch), where it is in a table
    and used on output of the trace and trace_pipe files.

    - Have ":mod:" be cached when written into set_ftrace_filter.
    Then the functions of the module will be traced at module load.

    - Some random clean ups and small fixes"

    * tag 'trace-v4.13' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (26 commits)
    ftrace: Test for NULL iter->tr in regex for stack_trace_filter changes
    ftrace: Decrement count for dyn_ftrace_total_info for init functions
    ftrace: Unlock hash mutex on failed allocation in process_mod_list()
    tracing: Add support for display of tgid in trace output
    tracing: Add support for recording tgid of tasks
    ftrace: Decrement count for dyn_ftrace_total_info file
    ftrace: Remove unused function ftrace_arch_read_dyn_info()
    sh/ftrace: Remove only user of ftrace_arch_read_dyn_info()
    ftrace: Have cached module filters be an active filter
    ftrace: Implement cached modules tracing on module load
    ftrace: Have the cached module list show in set_ftrace_filter
    ftrace: Add :mod: caching infrastructure to trace_array
    tracing: Show address when function names are not found
    ftrace: Add missing comment for FTRACE_OPS_FL_RCU
    tracing: Rename update the enum_map file
    tracing: Add TRACE_DEFINE_SIZEOF() macros
    tracing: define TRACE_DEFINE_SIZEOF() macro to map sizeof's to their values
    tracing: Rename enum_replace to eval_replace
    trace: rename enum_map functions
    trace: rename trace.c enum functions
    ...

    Linus Torvalds
     

06 Jul, 2017

4 commits

  • Pull percpu updates from Tejun Heo:
    "These are the percpu changes for the v4.13-rc1 merge window. There are
    a couple visibility related changes - tracepoints and allocator stats
    through debugfs, along with __ro_after_init markings and a cosmetic
    rename in percpu_counter.

    Please note that the simple O(#elements_in_the_chunk) area allocator
    used by percpu allocator is again showing scalability issues,
    primarily with bpf allocating and freeing large number of counters.
    Dennis is working on the replacement allocator and the percpu
    allocator will be seeing increased churns in the coming cycles"

    * 'for-4.13' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu:
    percpu: fix static checker warnings in pcpu_destroy_chunk
    percpu: fix early calls for spinlock in pcpu_stats
    percpu: resolve err may not be initialized in pcpu_alloc
    percpu_counter: Rename __percpu_counter_add to percpu_counter_add_batch
    percpu: add tracepoint support for percpu memory
    percpu: expose statistics about percpu memory via debugfs
    percpu: migrate percpu data structures to internal header
    percpu: add missing lockdep_assert_held to func pcpu_free_area
    mark most percpu globals as __ro_after_init

    Linus Torvalds
     
  • Most filesystems currently use mapping_set_error and
    filemap_check_errors for setting and reporting/clearing writeback errors
    at the mapping level. filemap_check_errors is indirectly called from
    most of the filemap_fdatawait_* functions and from
    filemap_write_and_wait*. These functions are called from all sorts of
    contexts to wait on writeback to finish -- e.g. mostly in fsync, but
    also in truncate calls, getattr, etc.

    The non-fsync callers are problematic. We should be reporting writeback
    errors during fsync, but many places spread over the tree clear out
    errors before they can be properly reported, or report errors at
    nonsensical times.

    If I get -EIO on a stat() call, there is no reason for me to assume that
    it is because some previous writeback failed. The fact that it also
    clears out the error such that a subsequent fsync returns 0 is a bug,
    and a nasty one since that's potentially silent data corruption.

    This patch adds a small bit of new infrastructure for setting and
    reporting errors during address_space writeback. While the above was my
    original impetus for adding this, I think it's also the case that
    current fsync semantics are just problematic for userland. Most
    applications that call fsync do so to ensure that the data they wrote
    has hit the backing store.

    In the case where there are multiple writers to the file at the same
    time, this is really hard to determine. The first one to call fsync will
    see any stored error, and the rest get back 0. The processes with open
    fds may not be associated with one another in any way. They could even
    be in different containers, so ensuring coordination between all fsync
    callers is not really an option.

    One way to remedy this would be to track what file descriptor was used
    to dirty the file, but that's rather cumbersome and would likely be
    slow. However, there is a simpler way to improve the semantics here
    without incurring too much overhead.

    This set adds an errseq_t to struct address_space, and a corresponding
    one is added to struct file. Writeback errors are recorded in the
    mapping's errseq_t, and the one in struct file is used as the "since"
    value.

    This changes the semantics of the Linux fsync implementation such that
    applications can now use it to determine whether there were any
    writeback errors since fsync(fd) was last called (or since the file was
    opened in the case of fsync having never been called).

    Note that those writeback errors may have occurred when writing data
    that was dirtied via an entirely different fd, but that's the case now
    with the current mapping_set_error/filemap_check_error infrastructure.
    This will at least prevent you from getting a false report of success.

    The new behavior is still consistent with the POSIX spec, and is more
    reliable for application developers. This patch just adds some basic
    infrastructure for doing this, and ensures that the f_wb_err "cursor"
    is properly set when a file is opened. Later patches will change the
    existing code to use this new infrastructure for reporting errors at
    fsync time.

    Signed-off-by: Jeff Layton
    Reviewed-by: Jan Kara

    Jeff Layton
     
  • Pull btrfs updates from David Sterba:
    "The core updates improve error handling (mostly related to bios), with
    the usual incremental work on the GFP_NOFS (mis)use removal,
    refactoring or cleanups. Except the two top patches, all have been in
    for-next for an extensive amount of time.

    User visible changes:

    - statx support

    - quota override tunable

    - improved compression thresholds

    - obsoleted mount option alloc_start

    Core updates:

    - bio-related updates:
    - faster bio cloning
    - no allocation failures
    - preallocated flush bios

    - more kvzalloc use, memalloc_nofs protections, GFP_NOFS updates

    - prep work for btree_inode removal

    - dir-item validation

    - qgoup fixes and updates

    - cleanups:
    - removed unused struct members, unused code, refactoring
    - argument refactoring (fs_info/root, caller -> callee sink)
    - SEARCH_TREE ioctl docs"

    * 'for-4.13-part1' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (115 commits)
    btrfs: Remove false alert when fiemap range is smaller than on-disk extent
    btrfs: Don't clear SGID when inheriting ACLs
    btrfs: fix integer overflow in calc_reclaim_items_nr
    btrfs: scrub: fix target device intialization while setting up scrub context
    btrfs: qgroup: Fix qgroup reserved space underflow by only freeing reserved ranges
    btrfs: qgroup: Introduce extent changeset for qgroup reserve functions
    btrfs: qgroup: Fix qgroup reserved space underflow caused by buffered write and quotas being enabled
    btrfs: qgroup: Return actually freed bytes for qgroup release or free data
    btrfs: qgroup: Cleanup btrfs_qgroup_prepare_account_extents function
    btrfs: qgroup: Add quick exit for non-fs extents
    Btrfs: rework delayed ref total_bytes_pinned accounting
    Btrfs: return old and new total ref mods when adding delayed refs
    Btrfs: always account pinned bytes when dropping a tree block ref
    Btrfs: update total_bytes_pinned when pinning down extents
    Btrfs: make BUG_ON() in add_pinned_bytes() an ASSERT()
    Btrfs: make add_pinned_bytes() take an s64 num_bytes instead of u64
    btrfs: fix validation of XATTR_ITEM dir items
    btrfs: Verify dir_item in iterate_object_props
    btrfs: Check name_len before in btrfs_del_root_ref
    btrfs: Check name_len before reading btrfs_get_name
    ...

    Linus Torvalds
     
  • Pull networking updates from David Miller:
    "Reasonably busy this cycle, but perhaps not as busy as in the 4.12
    merge window:

    1) Several optimizations for UDP processing under high load from
    Paolo Abeni.

    2) Support pacing internally in TCP when using the sch_fq packet
    scheduler for this is not practical. From Eric Dumazet.

    3) Support mutliple filter chains per qdisc, from Jiri Pirko.

    4) Move to 1ms TCP timestamp clock, from Eric Dumazet.

    5) Add batch dequeueing to vhost_net, from Jason Wang.

    6) Flesh out more completely SCTP checksum offload support, from
    Davide Caratti.

    7) More plumbing of extended netlink ACKs, from David Ahern, Pablo
    Neira Ayuso, and Matthias Schiffer.

    8) Add devlink support to nfp driver, from Simon Horman.

    9) Add RTM_F_FIB_MATCH flag to RTM_GETROUTE queries, from Roopa
    Prabhu.

    10) Add stack depth tracking to BPF verifier and use this information
    in the various eBPF JITs. From Alexei Starovoitov.

    11) Support XDP on qed device VFs, from Yuval Mintz.

    12) Introduce BPF PROG ID for better introspection of installed BPF
    programs. From Martin KaFai Lau.

    13) Add bpf_set_hash helper for TC bpf programs, from Daniel Borkmann.

    14) For loads, allow narrower accesses in bpf verifier checking, from
    Yonghong Song.

    15) Support MIPS in the BPF selftests and samples infrastructure, the
    MIPS eBPF JIT will be merged in via the MIPS GIT tree. From David
    Daney.

    16) Support kernel based TLS, from Dave Watson and others.

    17) Remove completely DST garbage collection, from Wei Wang.

    18) Allow installing TCP MD5 rules using prefixes, from Ivan
    Delalande.

    19) Add XDP support to Intel i40e driver, from Björn Töpel

    20) Add support for TC flower offload in nfp driver, from Simon
    Horman, Pieter Jansen van Vuuren, Benjamin LaHaise, Jakub
    Kicinski, and Bert van Leeuwen.

    21) IPSEC offloading support in mlx5, from Ilan Tayari.

    22) Add HW PTP support to macb driver, from Rafal Ozieblo.

    23) Networking refcount_t conversions, From Elena Reshetova.

    24) Add sock_ops support to BPF, from Lawrence Brako. This is useful
    for tuning the TCP sockopt settings of a group of applications,
    currently via CGROUPs"

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1899 commits)
    net: phy: dp83867: add workaround for incorrect RX_CTRL pin strap
    dt-bindings: phy: dp83867: provide a workaround for incorrect RX_CTRL pin strap
    cxgb4: Support for get_ts_info ethtool method
    cxgb4: Add PTP Hardware Clock (PHC) support
    cxgb4: time stamping interface for PTP
    nfp: default to chained metadata prepend format
    nfp: remove legacy MAC address lookup
    nfp: improve order of interfaces in breakout mode
    net: macb: remove extraneous return when MACB_EXT_DESC is defined
    bpf: add missing break in for the TCP_BPF_SNDCWND_CLAMP case
    bpf: fix return in load_bpf_file
    mpls: fix rtm policy in mpls_getroute
    net, ax25: convert ax25_cb.refcount from atomic_t to refcount_t
    net, ax25: convert ax25_route.refcount from atomic_t to refcount_t
    net, ax25: convert ax25_uid_assoc.refcount from atomic_t to refcount_t
    net, sctp: convert sctp_ep_common.refcnt from atomic_t to refcount_t
    net, sctp: convert sctp_transport.refcnt from atomic_t to refcount_t
    net, sctp: convert sctp_chunk.refcnt from atomic_t to refcount_t
    net, sctp: convert sctp_datamsg.refcnt from atomic_t to refcount_t
    net, sctp: convert sctp_auth_bytes.refcnt from atomic_t to refcount_t
    ...

    Linus Torvalds
     

05 Jul, 2017

1 commit

  • Pull spi updates from Mark Brown:
    "There's only one big change in this release but it's a very big
    change: Geert Uytterhoeven has implemented support for SPI slave mode.

    This feature has been on the cards since the subsystem was originally
    merged back in the mists of time so it's great that Geert stepped up
    and finally implemented it.

    - SPI slave support, together with wholesale renaming of SPI
    controllers from master to controller which went surprisingly
    smoothly. This is already used with Renesas SoCs and support is in
    the works for i.MX too.

    - New drivers for Meson SPICC and ST STM32"

    * tag 'spi-v4.13' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi: (57 commits)
    spi: loopback-test: Fix kfree() NULL pointer error.
    spi: loopback-test: fix spelling mistake: "reruning" -> "rerunning"
    spi: sirf: fix spelling mistake: "registerred" -> "registered"
    spi: stm32: fix potential dereference null return value
    spi: stm32: enhance DMA error management
    spi: stm32: add runtime PM support
    spi: stm32: use normal conditional statements instead of ternary operator
    spi: stm32: replace st, spi-midi with st, spi-midi-ns to fit bindings
    spi: stm32: fix example with st, spi-midi-ns property
    spi: stm32: fix compatible to fit with new bindings
    spi: stm32: use SoC specific compatible
    spi: rockchip: Disable Runtime PM when chip select is asserted
    spi: rockchip: Set GPIO_SS flag to enable Slave Select with GPIO CS
    spi: atmel: fix corrupted data issue on SAM9 family SoCs
    spi: stm32: fix error check on mbr being -ve
    spi: add driver for STM32 SPI controller
    spi: Document the STM32 SPI bindings
    spi/bcm63xx: Fix checkpatch warnings
    spi: imx: Check for allocation failure earlier
    spi: mediatek: add spi support for mt2712 IC
    ...

    Linus Torvalds
     

04 Jul, 2017

1 commit

  • Pull char/misc updates from Greg KH:
    "Here is the "big" char/misc driver patchset for 4.13-rc1.

    Lots of stuff in here, a large thunderbolt update, w1 driver header
    reorg, the new mux driver subsystem, google firmware driver updates,
    and a raft of other smaller things. Full details in the shortlog.

    All of these have been in linux-next for a while with the only
    reported issue being a merge problem with this tree and the jc-docs
    tree in the w1 documentation area"

    * tag 'char-misc-4.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc: (147 commits)
    misc: apds990x: Use sysfs_match_string() helper
    mei: drop unreachable code in mei_start
    mei: validate the message header only in first fragment.
    DocBook: w1: Update W1 file locations and names in DocBook
    mux: adg792a: always require I2C support
    nvmem: rockchip-efuse: add support for rk322x-efuse
    nvmem: core: add locking to nvmem_find_cell
    nvmem: core: Call put_device() in nvmem_unregister()
    nvmem: core: fix leaks on registration errors
    nvmem: correct Broadcom OTP controller driver writes
    w1: Add subsystem kernel public interface
    drivers/fsi: Add module license to core driver
    drivers/fsi: Use asynchronous slave mode
    drivers/fsi: Add hub master support
    drivers/fsi: Add SCOM FSI client device driver
    drivers/fsi/gpio: Add tracepoints for GPIO master
    drivers/fsi: Add GPIO based FSI master
    drivers/fsi: Document FSI master sysfs files in ABI
    drivers/fsi: Add error handling for slave
    drivers/fsi: Add tracepoints for low-level operations
    ...

    Linus Torvalds
     

21 Jun, 2017

1 commit

  • Add support for tracepoints to the following events: chunk allocation,
    chunk free, area allocation, area free, and area allocation failure.
    This should let us replay percpu memory requests and evaluate
    corresponding decisions.

    Signed-off-by: Dennis Zhou
    Signed-off-by: Tejun Heo

    Dennis Zhou
     

20 Jun, 2017

1 commit

  • Commit 81fb6f77a026 (btrfs: qgroup: Add new trace point for
    qgroup data reserve) added the following events which aren't used.
    btrfs__qgroup_data_map
    btrfs_qgroup_init_data_rsv_map
    btrfs_qgroup_free_data_rsv_map
    So remove them.

    CC: quwenruo@cn.fujitsu.com
    Signed-off-by: Anand Jain
    Reviewed-by: Qu Wenruo
    Signed-off-by: David Sterba

    Anand Jain
     

14 Jun, 2017

5 commits

  • There are a few places in the kernel where sizeof() is already
    being used. Update those locations with TRACE_DEFINE_SIZEOF.

    Link: http://lkml.kernel.org/r/20170531215653.3240-12-jeremy.linton@arm.com

    Signed-off-by: Jeremy Linton
    Signed-off-by: Steven Rostedt (VMware)

    Jeremy Linton
     
  • Perf has a problem that if sizeof() macros are used within TRACE_EVENT()
    macro's they end up in userspace as "sizeof(kernel structure)" which
    cannot properly be parsed. Add a macro which can forward this data
    through the eval_map for userspace utilization.

    Link: http://lkml.kernel.org/r/20170531215653.3240-10-jeremy.linton@arm.com

    Signed-off-by: Jeremy Linton
    Signed-off-by: Steven Rostedt (VMware)

    Jeremy Linton
     
  • Each enum is loaded into the trace_enum_map, as we
    are now using this for more than enums rename it.

    Link: http://lkml.kernel.org/r/20170531215653.3240-3-jeremy.linton@arm.com

    Signed-off-by: Jeremy Linton
    Signed-off-by: Steven Rostedt (VMware)

    Jeremy Linton
     
  • The kernel and its modules have sections containing the enum
    string to value conversions. Rename this section because we
    intend to store more than enums in it.

    Link: http://lkml.kernel.org/r/20170531215653.3240-2-jeremy.linton@arm.com

    Signed-off-by: Jeremy Linton
    Signed-off-by: Steven Rostedt (VMware)

    Jeremy Linton
     
  • Now struct spi_master is used for both SPI master and slave controllers,
    it makes sense to rename it to struct spi_controller, and replace
    "master" by "controller" where appropriate.

    For now this conversion is done for SPI core infrastructure only.
    Wrappers are provided for backwards compatibility, until all SPI drivers
    have been converted.

    Noteworthy details:
    - SPI_MASTER_GPIO_SS is retained, as it only makes sense for SPI
    master controllers,
    - spi_busnum_to_master() is retained, as it looks up masters only,
    - A new field spi_device.controller is added, but spi_device.master is
    retained for compatibility (both are always initialized by
    spi_alloc_device()),
    - spi_flash_read() is used by SPI masters only.

    Signed-off-by: Geert Uytterhoeven
    Signed-off-by: Mark Brown

    Geert Uytterhoeven
     

09 Jun, 2017

2 commits


08 Jun, 2017

1 commit

  • Currently rcu_barrier() uses call_rcu() to enqueue new callbacks
    on each CPU with a non-empty callback list. This works, but means
    that rcu_barrier() forces grace periods that are not otherwise needed.
    The key point is that rcu_barrier() never needs to wait for a grace
    period, but instead only for all pre-existing callbacks to be invoked.
    This means that rcu_barrier()'s new callbacks should be placed in
    the callback-list segment containing the last pre-existing callback.

    This commit makes this change using the new rcu_segcblist_entrain()
    function.

    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     

05 Jun, 2017

1 commit

  • Make it possible for a client to use AuriStor's service upgrade facility.

    The client does this by adding an RXRPC_UPGRADE_SERVICE control message to
    the first sendmsg() of a call. This takes no parameters.

    When recvmsg() starts returning data from the call, the service ID field in
    the returned msg_name will reflect the result of the upgrade attempt. If
    the upgrade was ignored, srx_service will match what was set in the
    sendmsg(); if the upgrade happened the srx_service will be altered to
    indicate the service the server upgraded to.

    Note that:

    (1) The choice of upgrade service is up to the server

    (2) Further client calls to the same server that would share a connection
    are blocked if an upgrade probe is in progress.

    (3) This should only be used to probe the service. Clients should then
    use the returned service ID in all subsequent communications with that
    server (and not set the upgrade). Note that the kernel will not
    retain this information should the connection expire from its cache.

    (4) If a server that supports upgrading is replaced by one that doesn't,
    whilst a connection is live, and if the replacement is running, say,
    OpenAFS 1.6.4 or older or an older IBM AFS, then the replacement
    server will not respond to packets sent to the upgraded connection.

    At this point, calls will time out and the server must be reprobed.

    Signed-off-by: David Howells

    David Howells
     

01 Jun, 2017

1 commit


24 May, 2017

2 commits

  • Split DATA/NODE type bio cache according to different temperature,
    so write IOs with the same temperature can be merged in corresponding
    bio cache as much as possible, otherwise, different temperature write
    IOs submitting into one bio cache will always cause split of bio.

    Signed-off-by: Chao Yu
    Signed-off-by: Jaegeuk Kim

    Jaegeuk Kim
     
  • Merged IO flow doesn't need to care about read IOs.

    f2fs_submit_merged_bio -> f2fs_submit_merged_write
    f2fs_submit_merged_bios -> f2fs_submit_merged_writes
    f2fs_submit_merged_bio_cond -> f2fs_submit_merged_write_cond

    Signed-off-by: Jaegeuk Kim

    Jaegeuk Kim
     

13 May, 2017

1 commit

  • Pull thermal management updates from Zhang Rui:

    - Fix a problem where orderly_shutdown() is called for multiple times
    due to multiple critical overheating events raised in a short period
    by platform thermal driver. (Keerthy)

    - Introduce a backup thermal shutdown mechanism, which invokes
    kernel_power_off()/emergency_restart() directly, after
    orderly_shutdown() being issued for certain amount of time(specified
    via Kconfig). This is useful in certain conditions that userspace may
    be unable to power off the system in a clean manner and leaves the
    system in a critical state, like in the middle of driver probing
    phase. (Keerthy)

    - Introduce a new interface in thermal devfreq_cooling code so that the
    driver can provide more precise data regarding actual power to the
    thermal governor every time the power budget is calculated. (Lukasz
    Luba)

    - Introduce BCM 2835 soc thermal driver and northstar thermal driver,
    within a new sub-folder. (Rafał Miłecki)

    - Introduce DA9062/61 thermal driver. (Steve Twiss)

    - Remove non-DT booting on TI-SoC driver. Also add support to fetching
    coefficients from DT. (Keerthy)

    - Refactorf RCAR Gen3 thermal driver. (Niklas Söderlund)

    - Small fix on MTK and intel-soc-dts thermal driver. (Dawei Chien,
    Brian Bian)

    * 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/rzhang/linux: (25 commits)
    thermal: core: Add a back up thermal shutdown mechanism
    thermal: core: Allow orderly_poweroff to be called only once
    Thermal: Intel SoC DTS: Change interrupt request behavior
    trace: thermal: add another parameter 'power' to the tracing function
    thermal: devfreq_cooling: add new interface for direct power read
    thermal: devfreq_cooling: refactor code and add get_voltage function
    thermal: mt8173: minor mtk_thermal.c cleanups
    thermal: bcm2835: move to the broadcom subdirectory
    thermal: broadcom: ns: specify myself as MODULE_AUTHOR
    thermal: da9062/61: Thermal junction temperature monitoring driver
    Documentation: devicetree: thermal: da9062/61 TJUNC temperature binding
    thermal: broadcom: add Northstar thermal driver
    dt-bindings: thermal: add support for Broadcom's Northstar thermal
    thermal: bcm2835: add thermal driver for bcm2835 SoC
    dt-bindings: Add thermal zone to bcm2835-thermal example
    thermal: rcar_gen3_thermal: add suspend and resume support
    thermal: rcar_gen3_thermal: store device match data in private structure
    thermal: rcar_gen3_thermal: enable hardware interrupts for trip points
    thermal: rcar_gen3_thermal: record and check number of TSCs found
    thermal: rcar_gen3_thermal: check that TSC exists before memory allocation
    ...

    Linus Torvalds
     

10 May, 2017

2 commits

  • Pull btrfs updates from Chris Mason:
    "This has fixes and cleanups Dave Sterba collected for the merge
    window.

    The biggest functional fixes are between btrfs raid5/6 and scrub, and
    raid5/6 and device replacement. Some of our pending qgroup fixes are
    included as well while I bash on the rest in testing.

    We also have the usual set of cleanups, including one that makes
    __btrfs_map_block() much more maintainable, and conversions from
    atomic_t to refcount_t"

    * 'for-linus-4.12' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (71 commits)
    btrfs: fix the gfp_mask for the reada_zones radix tree
    Btrfs: fix reported number of inode blocks
    Btrfs: send, fix file hole not being preserved due to inline extent
    Btrfs: fix extent map leak during fallocate error path
    Btrfs: fix incorrect space accounting after failure to insert inline extent
    Btrfs: fix invalid attempt to free reserved space on failure to cow range
    btrfs: Handle delalloc error correctly to avoid ordered extent hang
    btrfs: Fix metadata underflow caused by btrfs_reloc_clone_csum error
    btrfs: check if the device is flush capable
    btrfs: delete unused member nobarriers
    btrfs: scrub: Fix RAID56 recovery race condition
    btrfs: scrub: Introduce full stripe lock for RAID56
    btrfs: Use ktime_get_real_ts for root ctime
    Btrfs: handle only applicable errors returned by btrfs_get_extent
    btrfs: qgroup: Fix qgroup corruption caused by inode_cache mount option
    btrfs: use q which is already obtained from bdev_get_queue
    Btrfs: switch to div64_u64 if with a u64 divisor
    Btrfs: update scrub_parity to use u64 stripe_len
    Btrfs: enable repair during read for raid56 profile
    btrfs: use clear_page where appropriate
    ...

    Linus Torvalds
     
  • Pull IOMMU updates from Joerg Roedel:

    - code optimizations for the Intel VT-d driver

    - ability to switch off a previously enabled Intel IOMMU

    - support for 'struct iommu_device' for OMAP, Rockchip and Mediatek
    IOMMUs

    - header optimizations for IOMMU core code headers and a few fixes that
    became necessary in other parts of the kernel because of that

    - ACPI/IORT updates and fixes

    - Exynos IOMMU optimizations

    - updates for the IOMMU dma-api code to bring it closer to use per-cpu
    iova caches

    - new command-line option to set default domain type allocated by the
    iommu core code

    - another command line option to allow the Intel IOMMU switched off in
    a tboot environment

    - ARM/SMMU: TLB sync optimisations for SMMUv2, Support for using an
    IDENTITY domain in conjunction with DMA ops, Support for SMR masking,
    Support for 16-bit ASIDs (was previously broken)

    - various other small fixes and improvements

    * tag 'iommu-updates-v4.12' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu: (63 commits)
    soc/qbman: Move dma-mapping.h include to qman_priv.h
    soc/qbman: Fix implicit header dependency now causing build fails
    iommu: Remove trace-events include from iommu.h
    iommu: Remove pci.h include from trace/events/iommu.h
    arm: dma-mapping: Don't override dma_ops in arch_setup_dma_ops()
    ACPI/IORT: Fix CONFIG_IOMMU_API dependency
    iommu/vt-d: Don't print the failure message when booting non-kdump kernel
    iommu: Move report_iommu_fault() to iommu.c
    iommu: Include device.h in iommu.h
    x86, iommu/vt-d: Add an option to disable Intel IOMMU force on
    iommu/arm-smmu: Return IOVA in iova_to_phys when SMMU is bypassed
    iommu/arm-smmu: Correct sid to mask
    iommu/amd: Fix incorrect error handling in amd_iommu_bind_pasid()
    iommu: Make iommu_bus_notifier return NOTIFY_DONE rather than error code
    omap3isp: Remove iommu_group related code
    iommu/omap: Add iommu-group support
    iommu/omap: Make use of 'struct iommu_device'
    iommu/omap: Store iommu_dev pointer in arch_data
    iommu/omap: Move data structures to omap-iommu.h
    iommu/omap: Drop legacy-style device support
    ...

    Linus Torvalds
     

09 May, 2017

8 commits

  • Merge more updates from Andrew Morton:

    - the rest of MM

    - various misc things

    - procfs updates

    - lib/ updates

    - checkpatch updates

    - kdump/kexec updates

    - add kvmalloc helpers, use them

    - time helper updates for Y2038 issues. We're almost ready to remove
    current_fs_time() but that awaits a btrfs merge.

    - add tracepoints to DAX

    * emailed patches from Andrew Morton : (114 commits)
    drivers/staging/ccree/ssi_hash.c: fix build with gcc-4.4.4
    selftests/vm: add a test for virtual address range mapping
    dax: add tracepoint to dax_insert_mapping()
    dax: add tracepoint to dax_writeback_one()
    dax: add tracepoints to dax_writeback_mapping_range()
    dax: add tracepoints to dax_load_hole()
    dax: add tracepoints to dax_pfn_mkwrite()
    dax: add tracepoints to dax_iomap_pte_fault()
    mtd: nand: nandsim: convert to memalloc_noreclaim_*()
    treewide: convert PF_MEMALLOC manipulations to new helpers
    mm: introduce memalloc_noreclaim_{save,restore}
    mm: prevent potential recursive reclaim due to clearing PF_MEMALLOC
    mm/huge_memory.c: deposit a pgtable for DAX PMD faults when required
    mm/huge_memory.c: use zap_deposited_table() more
    time: delete CURRENT_TIME_SEC and CURRENT_TIME
    gfs2: replace CURRENT_TIME with current_time
    apparmorfs: replace CURRENT_TIME with current_time()
    lustre: replace CURRENT_TIME macro
    fs: ubifs: replace CURRENT_TIME_SEC with current_time
    fs: ufs: use ktime_get_real_ts64() for birthtime
    ...

    Linus Torvalds
     
  • Add a tracepoint to dax_insert_mapping(), following the same logging
    conventions as the rest of DAX. This tracepoint, along with the one in
    dax_load_hole(), lets us know how a DAX PTE fault was serviced.

    Here is an example DAX fault that inserts a PTE mapping:

    small-1126 [007] ....
    145.451604: dax_pte_fault: dev 259:0 ino 0x1003 shared WRITE|ALLOW_RETRY|KILLABLE|USER address 0x10420000 pgoff 0x220

    small-1126 [007] ....
    145.452317: dax_insert_mapping: dev 259:0 ino 0x1003 shared write address 0x10420000 radix_entry 0x100006

    small-1126 [007] ....
    145.452399: dax_pte_fault_done: dev 259:0 ino 0x1003 shared WRITE|ALLOW_RETRY|KILLABLE|USER address 0x10420000 pgoff 0x220 MAJOR|NOPAGE

    Link: http://lkml.kernel.org/r/20170221195116.13278-7-ross.zwisler@linux.intel.com
    Signed-off-by: Ross Zwisler
    Reviewed-by: Jan Kara
    Cc: Alexander Viro
    Cc: Dan Williams
    Cc: Ingo Molnar
    Cc: Matthew Wilcox
    Cc: Steven Rostedt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ross Zwisler
     
  • Add a tracepoint to dax_writeback_one(), following the same logging
    conventions as the rest of DAX.

    Here is an example range writeback which ends up flushing one PMD and
    one PTE:

    test-1265 [003] ....
    496.615250: dax_writeback_range: dev 259:0 ino 0x1003 pgoff 0x0-0x7ffffffffffff

    test-1265 [003] ....
    496.616263: dax_writeback_one: dev 259:0 ino 0x1003 pgoff 0x0 pglen 0x200

    test-1265 [003] ....
    496.616270: dax_writeback_one: dev 259:0 ino 0x1003 pgoff 0x305 pglen 0x1

    test-1265 [003] ....
    496.616272: dax_writeback_range_done: dev 259:0 ino 0x1003 pgoff 0x0-0x7ffffffffffff

    [akpm@linux-foundation.org: struct blk_dax_ctl has disappeared]
    Link: http://lkml.kernel.org/r/20170221195116.13278-6-ross.zwisler@linux.intel.com
    Signed-off-by: Ross Zwisler
    Reviewed-by: Jan Kara
    Cc: Alexander Viro
    Cc: Dan Williams
    Cc: Ingo Molnar
    Cc: Matthew Wilcox
    Cc: Steven Rostedt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ross Zwisler
     
  • Add tracepoints to dax_writeback_mapping_range(), following the same
    logging conventions as the rest of DAX.

    Here is an example writeback call:

    msync-1085 [006] ....
    200.902565: dax_writeback_range: dev 259:0 ino 0x1003 pgoff 0x200-0x2ff

    msync-1085 [006] ....
    200.902579: dax_writeback_range_done: dev 259:0 ino 0x1003 pgoff 0x200-0x2ff

    [ross.zwisler@linux.intel.com: fix regression in dax_writeback_mapping_range()]
    Link: http://lkml.kernel.org/r/20170314215358.31451-1-ross.zwisler@linux.intel.com
    Link: http://lkml.kernel.org/r/20170221195116.13278-5-ross.zwisler@linux.intel.com
    Signed-off-by: Ross Zwisler
    Reviewed-by: Jan Kara
    Cc: Alexander Viro
    Cc: Dan Williams
    Cc: Ingo Molnar
    Cc: Matthew Wilcox
    Cc: Steven Rostedt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ross Zwisler
     
  • Add tracepoints to dax_load_hole(), following the same logging conventions
    as the rest of DAX.

    Here is the logging generated by a PTE read from a hole:

    read-1075 [002] ....
    62.362108: dax_pte_fault: dev 259:0 ino 0x1003 shared ALLOW_RETRY|KILLABLE|USER address 0x10480000 pgoff 0x280

    read-1075 [002] ....
    62.362140: dax_load_hole: dev 259:0 ino 0x1003 shared ALLOW_RETRY|KILLABLE|USER address 0x10480000 pgoff 0x280 NOPAGE

    read-1075 [002] ....
    62.362141: dax_pte_fault_done: dev 259:0 ino 0x1003 shared ALLOW_RETRY|KILLABLE|USER address 0x10480000 pgoff 0x280 NOPAGE

    Link: http://lkml.kernel.org/r/20170221195116.13278-4-ross.zwisler@linux.intel.com
    Signed-off-by: Ross Zwisler
    Reviewed-by: Jan Kara
    Cc: Alexander Viro
    Cc: Dan Williams
    Cc: Ingo Molnar
    Cc: Matthew Wilcox
    Cc: Steven Rostedt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ross Zwisler
     
  • Add tracepoints to dax_pfn_mkwrite(), following the same logging
    conventions as the rest of DAX.

    Here is an example PTE fault followed by a pfn_mkwrite:

    small_aligned-1094 [002] ....
    374.084998: dax_pte_fault: dev 259:0 ino 0x1003 shared WRITE|ALLOW_RETRY|KILLABLE|USER address 0x10400000 pgoff 0x200

    small_aligned-1094 [002] ....
    374.085145: dax_pte_fault_done: dev 259:0 ino 0x1003 shared WRITE|ALLOW_RETRY|KILLABLE|USER address 0x10400000 pgoff 0x200 MAJOR|NOPAGE

    small_aligned-1094 [002] ....
    374.085165: dax_pfn_mkwrite: dev 259:0 ino 0x1003 shared WRITE|MKWRITE|ALLOW_RETRY|KILLABLE|USER address 0x10400000 pgoff 0x200 NOPAGE

    Link: http://lkml.kernel.org/r/20170221195116.13278-3-ross.zwisler@linux.intel.com
    Signed-off-by: Ross Zwisler
    Reviewed-by: Jan Kara
    Cc: Alexander Viro
    Cc: Dan Williams
    Cc: Ingo Molnar
    Cc: Matthew Wilcox
    Cc: Steven Rostedt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ross Zwisler
     
  • Patch series "second round of tracepoints for DAX".

    This second round of DAX tracepoint patches adds tracing to the PTE
    fault path (dax_iomap_pte_fault(), dax_pfn_mkwrite(), dax_load_hole(),
    dax_insert_mapping()) and to the writeback path
    (dax_writeback_mapping_range(), dax_writeback_one()).

    The purpose of this tracing is to give us a high level view of what DAX
    is doing, whether faults are being serviced by PMDs or PTEs, and by real
    storage or by zero pages covering holes.

    I do have some patches nearly ready which also add tracing to
    grab_mapping_entry() and dax_insert_mapping_entry(). These are more
    targeted at logging how we are interacting with the radix tree, how we
    use empty entries for locking, whether we "downgrade" huge zero pages to
    4k PTE sized allocations, etc. In the end it seemed to me that this
    might be too detailed to have as constantly present tracepoints, but if
    anyone sees value in having tracepoints like this in the DAX code
    permanently (Jan?), please let me know and I'll add those last two
    patches.

    All these tracepoints were done to be consistent with the style of the
    XFS tracepoints and with the existing DAX PMD tracepoints.

    This patch (of 6):

    Add tracepoints to dax_iomap_pte_fault(), following the same logging
    conventions as the rest of DAX.

    Here is an example fault that initially tries to be serviced by the PMD
    fault handler but which falls back to PTEs because the VMA isn't large
    enough to hold a PMD:

    small-1086 [005] ....
    71.140014: xfs_filemap_huge_fault: dev 259:0 ino 0x1003

    small-1086 [005] ....
    71.140027: dax_pmd_fault: dev 259:0 ino 0x1003 shared WRITE|ALLOW_RETRY|KILLABLE|USER address 0x10420000 vm_start 0x10200000 vm_end 0x10500000 pgoff 0x220 max_pgoff 0x1400

    small-1086 [005] ....
    71.140028: dax_pmd_fault_done: dev 259:0 ino 0x1003 shared WRITE|ALLOW_RETRY|KILLABLE|USER address 0x10420000 vm_start 0x10200000 vm_end 0x10500000 pgoff 0x220 max_pgoff 0x1400 FALLBACK

    small-1086 [005] ....
    71.140035: dax_pte_fault: dev 259:0 ino 0x1003 shared WRITE|ALLOW_RETRY|KILLABLE|USER address 0x10420000 pgoff 0x220

    small-1086 [005] ....
    71.140396: dax_pte_fault_done: dev 259:0 ino 0x1003 shared WRITE|ALLOW_RETRY|KILLABLE|USER address 0x10420000 pgoff 0x220 MAJOR|NOPAGE

    Link: http://lkml.kernel.org/r/20170221195116.13278-2-ross.zwisler@linux.intel.com
    Signed-off-by: Ross Zwisler
    Reviewed-by: Jan Kara
    Cc: Alexander Viro
    Cc: Dan Williams
    Cc: Ingo Molnar
    Cc: Matthew Wilcox
    Cc: Steven Rostedt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ross Zwisler
     
  • Pull f2fs updates from Jaegeuk Kim:
    "In this round, we've focused on enhancing performance with regards to
    block allocation, GC, and discard/in-place-update IO controls. There
    are a bunch of clean-ups as well as minor bug fixes.

    Enhancements:
    - disable heap-based allocation by default
    - issue small-sized discard commands by default
    - change the policy of data hotness for logging
    - distinguish IOs in terms of size and wbc type
    - start SSR earlier to avoid foreground GC
    - enhance data structures managing discard commands
    - enhance in-place update flow
    - add some more fault injection routines
    - secure one more xattr entry

    Bug fixes:
    - calculate victim cost for GC correctly
    - remain correct victim segment number for GC
    - race condition in nid allocator and initializer
    - stale pointer produced by atomic_writes
    - fix missing REQ_SYNC for flush commands
    - handle missing errors in more corner cases"

    * tag 'for-f2fs-4.12' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (111 commits)
    f2fs: fix a mount fail for wrong next_scan_nid
    f2fs: enhance scalability of trace macro
    f2fs: relocate inode_{,un}lock in F2FS_IOC_SETFLAGS
    f2fs: Make flush bios explicitely sync
    f2fs: show available_nids in f2fs/status
    f2fs: flush dirty nats periodically
    f2fs: introduce CP_TRIMMED_FLAG to avoid unneeded discard
    f2fs: allow cpc->reason to indicate more than one reason
    f2fs: release cp and dnode lock before IPU
    f2fs: shrink size of struct discard_cmd
    f2fs: don't hold cmd_lock during waiting discard command
    f2fs: nullify fio->encrypted_page for each writes
    f2fs: sanity check segment count
    f2fs: introduce valid_ipu_blkaddr to clean up
    f2fs: lookup extent cache first under IPU scenario
    f2fs: reconstruct code to write a data page
    f2fs: introduce __wait_discard_cmd
    f2fs: introduce __issue_discard_cmd
    f2fs: enable small discard by default
    f2fs: delay awaking discard thread
    ...

    Linus Torvalds