17 Sep, 2013

1 commit

  • Pull timer code update from Thomas Gleixner:
    - armada SoC clocksource overhaul with a trivial merge conflict
    - Minor improvements to various SoC clocksource drivers

    * 'timers/core' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    clocksource: armada-370-xp: Add detailed clock requirements in devicetree binding
    clocksource: armada-370-xp: Get reference fixed-clock by name
    clocksource: armada-370-xp: Replace WARN_ON with BUG_ON
    clocksource: armada-370-xp: Fix device-tree binding
    clocksource: armada-370-xp: Introduce new compatibles
    clocksource: armada-370-xp: Use CLOCKSOURCE_OF_DECLARE
    clocksource: armada-370-xp: Simplify TIMER_CTRL register access
    clocksource: armada-370-xp: Use BIT()
    ARM: timer-sp: Set dynamic irq affinity
    ARM: nomadik: add dynamic irq flag to the timer
    clocksource: sh_cmt: 32-bit control register support
    clocksource: em_sti: Convert to devm_* managed helpers

    Linus Torvalds
     

16 Sep, 2013

1 commit

  • Pull misc SCSI driver updates from James Bottomley:
    "This patch set is a set of driver updates (megaraid_sas, fnic, lpfc,
    ufs, hpsa) we also have a couple of bug fixes (sd out of bounds and
    ibmvfc error handling) and the first round of esas2r checker fixes and
    finally the much anticipated big endian additions for megaraid_sas"

    * tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (47 commits)
    [SCSI] fnic: fnic Driver Tuneables Exposed through CLI
    [SCSI] fnic: Kernel panic while running sh/nosh with max lun cfg
    [SCSI] fnic: Hitting BUG_ON(io_req->abts_done) in fnic_rport_exch_reset
    [SCSI] fnic: Remove QUEUE_FULL handling code
    [SCSI] fnic: On system with >1.1TB RAM, VIC fails multipath after boot up
    [SCSI] fnic: FC stat param seconds_since_last_reset not getting updated
    [SCSI] sd: Fix potential out-of-bounds access
    [SCSI] lpfc 8.3.42: Update lpfc version to driver version 8.3.42
    [SCSI] lpfc 8.3.42: Fixed issue of task management commands having a fixed timeout
    [SCSI] lpfc 8.3.42: Fixed inconsistent spin lock usage.
    [SCSI] lpfc 8.3.42: Fix driver's abort loop functionality to skip IOs already getting aborted
    [SCSI] lpfc 8.3.42: Fixed failure to allocate SCSI buffer on PPC64 platform for SLI4 devices
    [SCSI] lpfc 8.3.42: Fix WARN_ON when driver unloads
    [SCSI] lpfc 8.3.42: Avoided making pci bar ioremap call during dual-chute WQ/RQ pci bar selection
    [SCSI] lpfc 8.3.42: Fixed driver iocbq structure's iocb_flag field running out of space
    [SCSI] lpfc 8.3.42: Fix crash on driver load due to cpu affinity logic
    [SCSI] lpfc 8.3.42: Fixed logging format of setting driver sysfs attributes hard to interpret
    [SCSI] lpfc 8.3.42: Fixed back to back RSCNs discovery failure.
    [SCSI] lpfc 8.3.42: Fixed race condition between BSG I/O dispatch and timeout handling
    [SCSI] lpfc 8.3.42: Fixed function mode field defined too small for not recognizing dual-chute mode
    ...

    Linus Torvalds
     

15 Sep, 2013

2 commits

  • Pull SLAB update from Pekka Enberg:
    "Nothing terribly exciting here apart from Christoph's kmalloc
    unification patches that brings sl[aou]b implementations closer to
    each other"

    * 'slab/next' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/linux:
    slab: Use correct GFP_DMA constant
    slub: remove verify_mem_not_deleted()
    mm/sl[aou]b: Move kmallocXXX functions to common code
    mm, slab_common: add 'unlikely' to size check of kmalloc_slab()
    mm/slub.c: beautify code for removing redundancy 'break' statement.
    slub: Remove unnecessary page NULL check
    slub: don't use cpu partial pages on UP
    mm/slub: beautify code for 80 column limitation and tab alignment
    mm/slub: remove 'per_cpu' which is useless variable

    Linus Torvalds
     
  • Pull input update from Dmitry Torokhov:
    "The only change is David Hermann's new EVIOCREVOKE evdev ioctl that
    allows safely passing file descriptors to input devices to session
    processes and later being able to stop delivery of events through
    these fds so that inactive sessions will no longer receive user input
    that does not belong to them"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input:
    Input: evdev - add EVIOCREVOKE ioctl

    Linus Torvalds
     

14 Sep, 2013

2 commits

  • Pull writeback fix from Wu Fengguang:
    "A trivial writeback fix"

    * tag 'writeback-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linux:
    writeback: Do not sort b_io list only because of block device inode

    Linus Torvalds
     
  • Pull aio changes from Ben LaHaise:
    "First off, sorry for this pull request being late in the merge window.
    Al had raised a couple of concerns about 2 items in the series below.
    I addressed the first issue (the race introduced by Gu's use of
    mm_populate()), but he has not provided any further details on how he
    wants to rework the anon_inode.c changes (which were sent out months
    ago but have yet to be commented on).

    The bulk of the changes have been sitting in the -next tree for a few
    months, with all the issues raised being addressed"

    * git://git.kvack.org/~bcrl/aio-next: (22 commits)
    aio: rcu_read_lock protection for new rcu_dereference calls
    aio: fix race in ring buffer page lookup introduced by page migration support
    aio: fix rcu sparse warnings introduced by ioctx table lookup patch
    aio: remove unnecessary debugging from aio_free_ring()
    aio: table lookup: verify ctx pointer
    staging/lustre: kiocb->ki_left is removed
    aio: fix error handling and rcu usage in "convert the ioctx list to table lookup v3"
    aio: be defensive to ensure request batching is non-zero instead of BUG_ON()
    aio: convert the ioctx list to table lookup v3
    aio: double aio_max_nr in calculations
    aio: Kill ki_dtor
    aio: Kill ki_users
    aio: Kill unneeded kiocb members
    aio: Kill aio_rw_vect_retry()
    aio: Don't use ctx->tail unnecessarily
    aio: io_cancel() no longer returns the io_event
    aio: percpu ioctx refcount
    aio: percpu reqs_available
    aio: reqs_active -> reqs_available
    aio: fix build when migration is disabled
    ...

    Linus Torvalds
     

13 Sep, 2013

23 commits

  • After the last architecture switched to generic hard irqs the config
    options HAVE_GENERIC_HARDIRQS & GENERIC_HARDIRQS and the related code
    for !CONFIG_GENERIC_HARDIRQS can be removed.

    Signed-off-by: Martin Schwidefsky

    Martin Schwidefsky
     
  • Pull SCSI target updates from Nicholas Bellinger:
    "Lots of activity again this round for I/O performance optimizations
    (per-cpu IDA pre-allocation for vhost + iscsi/target), and the
    addition of new fabric independent features to target-core
    (COMPARE_AND_WRITE + EXTENDED_COPY).

    The main highlights include:

    - Support for iscsi-target login multiplexing across individual
    network portals
    - Generic Per-cpu IDA logic (kent + akpm + clameter)
    - Conversion of vhost to use per-cpu IDA pre-allocation for
    descriptors, SGLs and userspace page pointer list
    - Conversion of iscsi-target + iser-target to use per-cpu IDA
    pre-allocation for descriptors
    - Add support for generic COMPARE_AND_WRITE (AtomicTestandSet)
    emulation for virtual backend drivers
    - Add support for generic EXTENDED_COPY (CopyOffload) emulation for
    virtual backend drivers.
    - Add support for fast memory registration mode to iser-target (Vu)

    The patches to add COMPARE_AND_WRITE and EXTENDED_COPY support are of
    particular significance, which make us the first and only open source
    target to support the full set of VAAI primitives.

    Currently Linux clients are lacking upstream support to actually
    utilize these primitives. However, with server side support now in
    place for folks like MKP + ZAB working on the client, this logic once
    reserved for the highest end of storage arrays, can now be run in VMs
    on their laptops"

    * 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/nab/target-pending: (50 commits)
    target/iscsi: Bump versions to v4.1.0
    target: Update copyright ownership/year information to 2013
    iscsi-target: Bump default TCP listen backlog to 256
    target: Fix >= v3.9+ regression in PR APTPL + ALUA metadata write-out
    iscsi-target; Bump default CmdSN Depth to 64
    iscsi-target: Remove unnecessary wait_for_completion in iscsi_get_thread_set
    iscsi-target: Add thread_set->ts_activate_sem + use common deallocate
    iscsi-target: Fix race with thread_pre_handler flush_signals + ISCSI_THREAD_SET_DIE
    target: remove unused including
    iser-target: introduce fast memory registration mode (FRWR)
    iser-target: generalize rdma memory registration and cleanup
    iser-target: move rdma wr processing to a shared function
    target: Enable global EXTENDED_COPY setup/release
    target: Add Third Party Copy (3PC) bit in INQUIRY response
    target: Enable EXTENDED_COPY setup in spc_parse_cdb
    target: Add support for EXTENDED_COPY copy offload emulation
    target: Avoid non-existent tg_pt_gp_mem in target_alua_state_check
    target: Add global device list for EXTENDED_COPY
    target: Make helpers non static for EXTENDED_COPY command setup
    target: Make spc_parse_naa_6h_vendor_specific non static
    ...

    Linus Torvalds
     
  • Merge more patches from Andrew Morton:
    "The rest of MM. Plus one misc cleanup"

    * emailed patches from Andrew Morton : (35 commits)
    mm/Kconfig: add MMU dependency for MIGRATION.
    kernel: replace strict_strto*() with kstrto*()
    mm, thp: count thp_fault_fallback anytime thp fault fails
    thp: consolidate code between handle_mm_fault() and do_huge_pmd_anonymous_page()
    thp: do_huge_pmd_anonymous_page() cleanup
    thp: move maybe_pmd_mkwrite() out of mk_huge_pmd()
    mm: cleanup add_to_page_cache_locked()
    thp: account anon transparent huge pages into NR_ANON_PAGES
    truncate: drop 'oldsize' truncate_pagecache() parameter
    mm: make lru_add_drain_all() selective
    memcg: document cgroup dirty/writeback memory statistics
    memcg: add per cgroup writeback pages accounting
    memcg: check for proper lock held in mem_cgroup_update_page_stat
    memcg: remove MEMCG_NR_FILE_MAPPED
    memcg: reduce function dereference
    memcg: avoid overflow caused by PAGE_ALIGN
    memcg: rename RESOURCE_MAX to RES_COUNTER_MAX
    memcg: correct RESOURCE_MAX to ULLONG_MAX
    mm: memcg: do not trap chargers with full callstack on OOM
    mm: memcg: rework and document OOM waiting and wakeup
    ...

    Linus Torvalds
     
  • do_huge_pmd_anonymous_page() has copy-pasted piece of handle_mm_fault()
    to handle fallback path.

    Let's consolidate code back by introducing VM_FAULT_FALLBACK return
    code.

    Signed-off-by: Kirill A. Shutemov
    Acked-by: Hillf Danton
    Cc: Andrea Arcangeli
    Cc: Al Viro
    Cc: Hugh Dickins
    Cc: Wu Fengguang
    Cc: Jan Kara
    Cc: Mel Gorman
    Cc: Andi Kleen
    Cc: Matthew Wilcox
    Cc: Dave Hansen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • truncate_pagecache() doesn't care about old size since commit
    cedabed49b39 ("vfs: Fix vmtruncate() regression"). Let's drop it.

    Signed-off-by: Kirill A. Shutemov
    Cc: OGAWA Hirofumi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • make lru_add_drain_all() only selectively interrupt the cpus that have
    per-cpu free pages that can be drained.

    This is important in nohz mode where calling mlockall(), for example,
    otherwise will interrupt every core unnecessarily.

    This is important on workloads where nohz cores are handling 10 Gb traffic
    in userspace. Those CPUs do not enter the kernel and place pages into LRU
    pagevecs and they really, really don't want to be interrupted, or they
    drop packets on the floor.

    Signed-off-by: Chris Metcalf
    Reviewed-by: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Chris Metcalf
     
  • Add memcg routines to count writeback pages, later dirty pages will also
    be accounted.

    After Kame's commit 89c06bd52fb9 ("memcg: use new logic for page stat
    accounting"), we can use 'struct page' flag to test page state instead
    of per page_cgroup flag. But memcg has a feature to move a page from a
    cgroup to another one and may have race between "move" and "page stat
    accounting". So in order to avoid the race we have designed a new lock:

    mem_cgroup_begin_update_page_stat()
    modify page information -->(a)
    mem_cgroup_update_page_stat() -->(b)
    mem_cgroup_end_update_page_stat()

    It requires both (a) and (b)(writeback pages accounting) to be pretected
    in mem_cgroup_{begin/end}_update_page_stat(). It's full no-op for
    !CONFIG_MEMCG, almost no-op if memcg is disabled (but compiled in), rcu
    read lock in the most cases (no task is moving), and spin_lock_irqsave
    on top in the slow path.

    There're two writeback interfaces to modify: test_{clear/set}_page_writeback().
    And the lock order is:
    --> memcg->move_lock
    --> mapping->tree_lock

    Signed-off-by: Sha Zhengju
    Acked-by: Michal Hocko
    Reviewed-by: Greg Thelen
    Cc: Fengguang Wu
    Cc: KAMEZAWA Hiroyuki
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sha Zhengju
     
  • While accounting memcg page stat, it's not worth to use
    MEMCG_NR_FILE_MAPPED as an extra layer of indirection because of the
    complexity and presumed performance overhead. We can use
    MEM_CGROUP_STAT_FILE_MAPPED directly.

    Signed-off-by: Sha Zhengju
    Acked-by: KAMEZAWA Hiroyuki
    Acked-by: Michal Hocko
    Acked-by: Fengguang Wu
    Reviewed-by: Greg Thelen
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sha Zhengju
     
  • RESOURCE_MAX is far too general name, change it to RES_COUNTER_MAX.

    Signed-off-by: Sha Zhengju
    Signed-off-by: Qiang Huang
    Acked-by: Michal Hocko
    Cc: Daisuke Nishimura
    Cc: Jeff Liu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sha Zhengju
     
  • Current RESOURCE_MAX is ULONG_MAX, but the value we used to set resource
    limit is unsigned long long, so we can set bigger value than that which is
    strange. The XXX_MAX should be reasonable max value, bigger than that
    should be overflow.

    Notice that this change will affect user output of default *.limit_in_bytes:
    before change:

    $ cat /cgroup/memory/memory.limit_in_bytes
    9223372036854775807

    after change:

    $ cat /cgroup/memory/memory.limit_in_bytes
    18446744073709551615

    But it doesn't alter the API in term of input - we can still use "echo -1
    > *.limit_in_bytes" to reset the numbers to "unlimited".

    Signed-off-by: Sha Zhengju
    Signed-off-by: Qiang Huang
    Acked-by: Michal Hocko
    Cc: Daisuke Nishimura
    Cc: Jeff Liu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sha Zhengju
     
  • The memcg OOM handling is incredibly fragile and can deadlock. When a
    task fails to charge memory, it invokes the OOM killer and loops right
    there in the charge code until it succeeds. Comparably, any other task
    that enters the charge path at this point will go to a waitqueue right
    then and there and sleep until the OOM situation is resolved. The problem
    is that these tasks may hold filesystem locks and the mmap_sem; locks that
    the selected OOM victim may need to exit.

    For example, in one reported case, the task invoking the OOM killer was
    about to charge a page cache page during a write(), which holds the
    i_mutex. The OOM killer selected a task that was just entering truncate()
    and trying to acquire the i_mutex:

    OOM invoking task:
    mem_cgroup_handle_oom+0x241/0x3b0
    mem_cgroup_cache_charge+0xbe/0xe0
    add_to_page_cache_locked+0x4c/0x140
    add_to_page_cache_lru+0x22/0x50
    grab_cache_page_write_begin+0x8b/0xe0
    ext3_write_begin+0x88/0x270
    generic_file_buffered_write+0x116/0x290
    __generic_file_aio_write+0x27c/0x480
    generic_file_aio_write+0x76/0xf0 # takes ->i_mutex
    do_sync_write+0xea/0x130
    vfs_write+0xf3/0x1f0
    sys_write+0x51/0x90
    system_call_fastpath+0x18/0x1d

    OOM kill victim:
    do_truncate+0x58/0xa0 # takes i_mutex
    do_last+0x250/0xa30
    path_openat+0xd7/0x440
    do_filp_open+0x49/0xa0
    do_sys_open+0x106/0x240
    sys_open+0x20/0x30
    system_call_fastpath+0x18/0x1d

    The OOM handling task will retry the charge indefinitely while the OOM
    killed task is not releasing any resources.

    A similar scenario can happen when the kernel OOM killer for a memcg is
    disabled and a userspace task is in charge of resolving OOM situations.
    In this case, ALL tasks that enter the OOM path will be made to sleep on
    the OOM waitqueue and wait for userspace to free resources or increase
    the group's limit. But a userspace OOM handler is prone to deadlock
    itself on the locks held by the waiting tasks. For example one of the
    sleeping tasks may be stuck in a brk() call with the mmap_sem held for
    writing but the userspace handler, in order to pick an optimal victim,
    may need to read files from /proc/, which tries to acquire the same
    mmap_sem for reading and deadlocks.

    This patch changes the way tasks behave after detecting a memcg OOM and
    makes sure nobody loops or sleeps with locks held:

    1. When OOMing in a user fault, invoke the OOM killer and restart the
    fault instead of looping on the charge attempt. This way, the OOM
    victim can not get stuck on locks the looping task may hold.

    2. When OOMing in a user fault but somebody else is handling it
    (either the kernel OOM killer or a userspace handler), don't go to
    sleep in the charge context. Instead, remember the OOMing memcg in
    the task struct and then fully unwind the page fault stack with
    -ENOMEM. pagefault_out_of_memory() will then call back into the
    memcg code to check if the -ENOMEM came from the memcg, and then
    either put the task to sleep on the memcg's OOM waitqueue or just
    restart the fault. The OOM victim can no longer get stuck on any
    lock a sleeping task may hold.

    Debugged by Michal Hocko.

    Signed-off-by: Johannes Weiner
    Reported-by: azurIt
    Acked-by: Michal Hocko
    Cc: David Rientjes
    Cc: KAMEZAWA Hiroyuki
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • System calls and kernel faults (uaccess, gup) can handle an out of memory
    situation gracefully and just return -ENOMEM.

    Enable the memcg OOM killer only for user faults, where it's really the
    only option available.

    Signed-off-by: Johannes Weiner
    Acked-by: Michal Hocko
    Cc: David Rientjes
    Cc: KAMEZAWA Hiroyuki
    Cc: azurIt
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Unlike global OOM handling, memory cgroup code will invoke the OOM killer
    in any OOM situation because it has no way of telling faults occuring in
    kernel context - which could be handled more gracefully - from
    user-triggered faults.

    Pass a flag that identifies faults originating in user space from the
    architecture-specific fault handlers to generic code so that memcg OOM
    handling can be improved.

    Signed-off-by: Johannes Weiner
    Reviewed-by: Michal Hocko
    Cc: David Rientjes
    Cc: KAMEZAWA Hiroyuki
    Cc: azurIt
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • The caller of the iterator might know that some nodes or even subtrees
    should be skipped but there is no way to tell iterators about that so the
    only choice left is to let iterators to visit each node and do the
    selection outside of the iterating code. This, however, doesn't scale
    well with hierarchies with many groups where only few groups are
    interesting.

    This patch adds mem_cgroup_iter_cond variant of the iterator with a
    callback which gets called for every visited node. There are three
    possible ways how the callback can influence the walk. Either the node is
    visited, it is skipped but the tree walk continues down the tree or the
    whole subtree of the current group is skipped.

    [hughd@google.com: fix memcg-less page reclaim]
    Signed-off-by: Michal Hocko
    Cc: Balbir Singh
    Cc: Glauber Costa
    Cc: Greg Thelen
    Cc: Johannes Weiner
    Cc: KAMEZAWA Hiroyuki
    Cc: KOSAKI Motohiro
    Cc: Michel Lespinasse
    Cc: Tejun Heo
    Cc: Ying Han
    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • Soft reclaim has been done only for the global reclaim (both background
    and direct). Since "memcg: integrate soft reclaim tighter with zone
    shrinking code" there is no reason for this limitation anymore as the soft
    limit reclaim doesn't use any special code paths and it is a part of the
    zone shrinking code which is used by both global and targeted reclaims.

    From the semantic point of view it is natural to consider soft limit
    before touching all groups in the hierarchy tree which is touching the
    hard limit because soft limit tells us where to push back when there is a
    memory pressure. It is not important whether the pressure comes from the
    limit or imbalanced zones.

    This patch simply enables soft reclaim unconditionally in
    mem_cgroup_should_soft_reclaim so it is enabled for both global and
    targeted reclaim paths. mem_cgroup_soft_reclaim_eligible needs to learn
    about the root of the reclaim to know where to stop checking soft limit
    state of parents up the hierarchy. Say we have

    A (over soft limit)
    \
    B (below s.l., hit the hard limit)
    / \
    C D (below s.l.)

    B is the source of the outside memory pressure now for D but we shouldn't
    soft reclaim it because it is behaving well under B subtree and we can
    still reclaim from C (pressumably it is over the limit).
    mem_cgroup_soft_reclaim_eligible should therefore stop climbing up the
    hierarchy at B (root of the memory pressure).

    Signed-off-by: Michal Hocko
    Reviewed-by: Glauber Costa
    Reviewed-by: Tejun Heo
    Cc: Balbir Singh
    Cc: Greg Thelen
    Cc: Hugh Dickins
    Cc: Johannes Weiner
    Cc: KAMEZAWA Hiroyuki
    Cc: KOSAKI Motohiro
    Cc: Michel Lespinasse
    Cc: Ying Han
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • This patchset is sitting out of tree for quite some time without any
    objections. I would be really happy if it made it into 3.12. I do not
    want to push it too hard but I think this work is basically ready and
    waiting more doesn't help.

    The basic idea is quite simple. Pull soft reclaim into shrink_zone in the
    first step and get rid of the previous soft reclaim infrastructure.
    shrink_zone is done in two passes now. First it tries to do the soft
    limit reclaim and it falls back to reclaim-all mode if no group is over
    the limit or no pages have been scanned. The second pass happens at the
    same priority so the only time we waste is the memcg tree walk which has
    been updated in the third step to have only negligible overhead.

    As a bonus we will get rid of a _lot_ of code by this and soft reclaim
    will not stand out like before when it wasn't integrated into the zone
    shrinking code and it reclaimed at priority 0 (the testing results show
    that some workloads suffers from such an aggressive reclaim). The clean
    up is in a separate patch because I felt it would be easier to review that
    way.

    The second step is soft limit reclaim integration into targeted reclaim.
    It should be rather straight forward. Soft limit has been used only for
    the global reclaim so far but it makes sense for any kind of pressure
    coming from up-the-hierarchy, including targeted reclaim.

    The third step (patches 4-8) addresses the tree walk overhead by enhancing
    memcg iterators to enable skipping whole subtrees and tracking number of
    over soft limit children at each level of the hierarchy. This information
    is updated same way the old soft limit tree was updated (from
    memcg_check_events) so we shouldn't see an additional overhead. In fact
    mem_cgroup_update_soft_limit is much simpler than tree manipulation done
    previously.

    __shrink_zone uses mem_cgroup_soft_reclaim_eligible as a predicate for
    mem_cgroup_iter so the decision whether a particular group should be
    visited is done at the iterator level which allows us to decide to skip
    the whole subtree as well (if there is no child in excess). This reduces
    the tree walk overhead considerably.

    * TEST 1
    ========

    My primary test case was a parallel kernel build with 2 groups (make is
    running with -j8 with a distribution .config in a separate cgroup without
    any hard limit) on a 32 CPU machine booted with 1GB memory and both builds
    run taskset to Node 0 cpus.

    I was mostly interested in 2 setups. Default - no soft limit set and -
    and 0 soft limit set to both groups. The first one should tell us whether
    the rework regresses the default behavior while the second one should show
    us improvements in an extreme case where both workloads are always over
    the soft limit.

    /usr/bin/time -v has been used to collect the statistics and each
    configuration had 3 runs after fresh boot without any other load on the
    system.

    base is mmotm-2013-07-18-16-40
    rework all 8 patches applied on top of base

    * No-limit
    User
    no-limit/base: min: 651.92 max: 672.65 avg: 664.33 std: 8.01 runs: 6
    no-limit/rework: min: 657.34 [100.8%] max: 668.39 [99.4%] avg: 663.13 [99.8%] std: 3.61 runs: 6
    System
    no-limit/base: min: 69.33 max: 71.39 avg: 70.32 std: 0.79 runs: 6
    no-limit/rework: min: 69.12 [99.7%] max: 71.05 [99.5%] avg: 70.04 [99.6%] std: 0.59 runs: 6
    Elapsed
    no-limit/base: min: 398.27 max: 422.36 avg: 408.85 std: 7.74 runs: 6
    no-limit/rework: min: 386.36 [97.0%] max: 438.40 [103.8%] avg: 416.34 [101.8%] std: 18.85 runs: 6

    The results are within noise. Elapsed time has a bigger variance but the
    average looks good.

    * 0-limit
    User
    0-limit/base: min: 573.76 max: 605.63 avg: 585.73 std: 12.21 runs: 6
    0-limit/rework: min: 645.77 [112.6%] max: 666.25 [110.0%] avg: 656.97 [112.2%] std: 7.77 runs: 6
    System
    0-limit/base: min: 69.57 max: 71.13 avg: 70.29 std: 0.54 runs: 6
    0-limit/rework: min: 68.68 [98.7%] max: 71.40 [100.4%] avg: 69.91 [99.5%] std: 0.87 runs: 6
    Elapsed
    0-limit/base: min: 1306.14 max: 1550.17 avg: 1430.35 std: 90.86 runs: 6
    0-limit/rework: min: 404.06 [30.9%] max: 465.94 [30.1%] avg: 434.81 [30.4%] std: 22.68 runs: 6

    The improvement is really huge here (even bigger than with my previous
    testing and I suspect that this highly depends on the storage). Page
    fault statistics tell us at least part of the story:

    Minor
    0-limit/base: min: 37180461.00 max: 37319986.00 avg: 37247470.00 std: 54772.71 runs: 6
    0-limit/rework: min: 36751685.00 [98.8%] max: 36805379.00 [98.6%] avg: 36774506.33 [98.7%] std: 17109.03 runs: 6
    Major
    0-limit/base: min: 170604.00 max: 221141.00 avg: 196081.83 std: 18217.01 runs: 6
    0-limit/rework: min: 2864.00 [1.7%] max: 10029.00 [4.5%] avg: 5627.33 [2.9%] std: 2252.71 runs: 6

    Same as with my previous testing Minor faults are more or less within
    noise but Major fault count is way bellow the base kernel.

    While this looks as a nice win it is fair to say that 0-limit
    configuration is quite artificial. So I was playing with 0-no-limit
    loads as well.

    * TEST 2
    ========

    The following results are from 2 groups configuration on a 16GB machine
    (single NUMA node).

    - A running stream IO (dd if=/dev/zero of=local.file bs=1024) with
    2*TotalMem with 0 soft limit.
    - B running a mem_eater which consumes TotalMem-1G without any limit. The
    mem_eater consumes the memory in 100 chunks with 1s nap after each
    mmap+poppulate so that both loads have chance to fight for the memory.

    The expected result is that B shouldn't be reclaimed and A shouldn't see
    a big dropdown in elapsed time.

    User
    base: min: 2.68 max: 2.89 avg: 2.76 std: 0.09 runs: 3
    rework: min: 3.27 [122.0%] max: 3.74 [129.4%] avg: 3.44 [124.6%] std: 0.21 runs: 3
    System
    base: min: 86.26 max: 88.29 avg: 87.28 std: 0.83 runs: 3
    rework: min: 81.05 [94.0%] max: 84.96 [96.2%] avg: 83.14 [95.3%] std: 1.61 runs: 3
    Elapsed
    base: min: 317.28 max: 332.39 avg: 325.84 std: 6.33 runs: 3
    rework: min: 281.53 [88.7%] max: 298.16 [89.7%] avg: 290.99 [89.3%] std: 6.98 runs: 3

    System time improved slightly as well as Elapsed. My previous testing
    has shown worse numbers but this again seem to depend on the storage
    speed.

    My theory is that the writeback doesn't catch up and prio-0 soft reclaim
    falls into wait on writeback page too often in the base kernel. The
    patched kernel doesn't do that because the soft reclaim is done from the
    kswapd/direct reclaim context. This can be seen on the following graph
    nicely. The A's group usage_in_bytes regurarly drops really low very often.

    All 3 runs
    http://labs.suse.cz/mhocko/soft_limit_rework/stream_io-vs-mem_eater/stream.png
    resp. a detail of the single run
    http://labs.suse.cz/mhocko/soft_limit_rework/stream_io-vs-mem_eater/stream-one-run.png

    mem_eater seems to be doing better as well. It gets to the full
    allocation size faster as can be seen on the following graph:
    http://labs.suse.cz/mhocko/soft_limit_rework/stream_io-vs-mem_eater/mem_eater-one-run.png

    /proc/meminfo collected during the test also shows that rework kernel
    hasn't swapped that much (well almost not at all):
    base: max: 123900 K avg: 56388.29 K
    rework: max: 300 K avg: 128.68 K

    kswapd and direct reclaim statistics are of no use unfortunatelly because
    soft reclaim is not accounted properly as the counters are hidden by
    global_reclaim() checks in the base kernel.

    * TEST 3
    ========

    Another test was the same configuration as TEST2 except the stream IO was
    replaced by a single kbuild (16 parallel jobs bound to Node0 cpus same as
    in TEST1) and mem_eater allocated TotalMem-200M so kbuild had only 200MB
    left.

    Kbuild did better with the rework kernel here as well:
    User
    base: min: 860.28 max: 872.86 avg: 868.03 std: 5.54 runs: 3
    rework: min: 880.81 [102.4%] max: 887.45 [101.7%] avg: 883.56 [101.8%] std: 2.83 runs: 3
    System
    base: min: 84.35 max: 85.06 avg: 84.79 std: 0.31 runs: 3
    rework: min: 85.62 [101.5%] max: 86.09 [101.2%] avg: 85.79 [101.2%] std: 0.21 runs: 3
    Elapsed
    base: min: 135.36 max: 243.30 avg: 182.47 std: 45.12 runs: 3
    rework: min: 110.46 [81.6%] max: 116.20 [47.8%] avg: 114.15 [62.6%] std: 2.61 runs: 3
    Minor
    base: min: 36635476.00 max: 36673365.00 avg: 36654812.00 std: 15478.03 runs: 3
    rework: min: 36639301.00 [100.0%] max: 36695541.00 [100.1%] avg: 36665511.00 [100.0%] std: 23118.23 runs: 3
    Major
    base: min: 14708.00 max: 53328.00 avg: 31379.00 std: 16202.24 runs: 3
    rework: min: 302.00 [2.1%] max: 414.00 [0.8%] avg: 366.33 [1.2%] std: 47.22 runs: 3

    Again we can see a significant improvement in Elapsed (it also seems to
    be more stable), there is a huge dropdown for the Major page faults and
    much more swapping:
    base: max: 583736 K avg: 112547.43 K
    rework: max: 4012 K avg: 124.36 K

    Graphs from all three runs show the variability of the kbuild quite
    nicely. It even seems that it took longer after every run with the base
    kernel which would be quite surprising as the source tree for the build is
    removed and caches are dropped after each run so the build operates on a
    freshly extracted sources everytime.
    http://labs.suse.cz/mhocko/soft_limit_rework/stream_io-vs-mem_eater/kbuild-mem_eater.png

    My other testing shows that this is just a matter of timing and other runs
    behave differently the std for Elapsed time is similar ~50. Example of
    other three runs:
    http://labs.suse.cz/mhocko/soft_limit_rework/stream_io-vs-mem_eater/kbuild-mem_eater2.png

    So to wrap this up. The series is still doing good and improves the soft
    limit.

    The testing results for bunch of cgroups with both stream IO and kbuild
    loads can be found in "memcg: track children in soft limit excess to
    improve soft limit".

    This patch:

    Memcg soft reclaim has been traditionally triggered from the global
    reclaim paths before calling shrink_zone. mem_cgroup_soft_limit_reclaim
    then picked up a group which exceeds the soft limit the most and reclaimed
    it with 0 priority to reclaim at least SWAP_CLUSTER_MAX pages.

    The infrastructure requires per-node-zone trees which hold over-limit
    groups and keep them up-to-date (via memcg_check_events) which is not cost
    free. Although this overhead hasn't turned out to be a bottle neck the
    implementation is suboptimal because mem_cgroup_update_tree has no idea
    which zones consumed memory over the limit so we could easily end up
    having a group on a node-zone tree having only few pages from that
    node-zone.

    This patch doesn't try to fix node-zone trees management because it seems
    that integrating soft reclaim into zone shrinking sounds much easier and
    more appropriate for several reasons. First of all 0 priority reclaim was
    a crude hack which might lead to big stalls if the group's LRUs are big
    and hard to reclaim (e.g. a lot of dirty/writeback pages). Soft reclaim
    should be applicable also to the targeted reclaim which is awkward right
    now without additional hacks. Last but not least the whole infrastructure
    eats quite some code.

    After this patch shrink_zone is done in 2 passes. First it tries to do
    the soft reclaim if appropriate (only for global reclaim for now to keep
    compatible with the original state) and fall back to ignoring soft limit
    if no group is eligible to soft reclaim or nothing has been scanned during
    the first pass. Only groups which are over their soft limit or any of
    their parents up the hierarchy is over the limit are considered eligible
    during the first pass.

    Soft limit tree which is not necessary anymore will be removed in the
    follow up patch to make this patch smaller and easier to review.

    Signed-off-by: Michal Hocko
    Reviewed-by: Glauber Costa
    Reviewed-by: Tejun Heo
    Cc: Johannes Weiner
    Cc: KAMEZAWA Hiroyuki
    Cc: Ying Han
    Cc: Hugh Dickins
    Cc: Michel Lespinasse
    Cc: Greg Thelen
    Cc: KOSAKI Motohiro
    Cc: Balbir Singh
    Cc: Glauber Costa
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • Pull vfs pile 4 from Al Viro:
    "list_lru pile, mostly"

    This came out of Andrew's pile, Al ended up doing the merge work so that
    Andrew didn't have to.

    Additionally, a few fixes.

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (42 commits)
    super: fix for destroy lrus
    list_lru: dynamically adjust node arrays
    shrinker: Kill old ->shrink API.
    shrinker: convert remaining shrinkers to count/scan API
    staging/lustre/libcfs: cleanup linux-mem.h
    staging/lustre/ptlrpc: convert to new shrinker API
    staging/lustre/obdclass: convert lu_object shrinker to count/scan API
    staging/lustre/ldlm: convert to shrinkers to count/scan API
    hugepage: convert huge zero page shrinker to new shrinker API
    i915: bail out earlier when shrinker cannot acquire mutex
    drivers: convert shrinkers to new count/scan API
    fs: convert fs shrinkers to new scan/count API
    xfs: fix dquot isolation hang
    xfs-convert-dquot-cache-lru-to-list_lru-fix
    xfs: convert dquot cache lru to list_lru
    xfs: rework buffer dispose list tracking
    xfs-convert-buftarg-lru-to-generic-code-fix
    xfs: convert buftarg LRU to generic code
    fs: convert inode and dentry shrinking to be node aware
    vmscan: per-node deferred work
    ...

    Linus Torvalds
     
  • Pull led updates from Bryan Wu:
    "Sorry for the late pull request, since I'm just back from vacation.

    LED subsystem updates for 3.12:
    - pca9633 driver DT supporting and pca9634 chip supporting
    - restore legacy device attributes for lp5521
    - other fixing and updates"

    * 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/cooloney/linux-leds: (28 commits)
    leds: wm831x-status: Request a REG resource
    leds: trigger: ledtrig-backlight: Fix invalid memory access in fb_event notification callback
    leds-pca963x: Fix device tree parsing
    leds-pca9633: Rename to leds-pca963x
    leds-pca9633: Add mutex to the ledout register
    leds-pca9633: Unique naming of the LEDs
    leds-pca9633: Add support for PCA9634
    leds: lp5562: use LP55xx common macros for device attributes
    Documentation: leds-lp5521,lp5523: update device attribute information
    leds: lp5523: remove unnecessary writing commands
    leds: lp5523: restore legacy device attributes
    leds: lp5523: LED MUX configuration on initializing
    leds: lp5523: make separate API for loading engine
    leds: lp5521: remove unnecessary writing commands
    leds: lp5521: restore legacy device attributes
    leds: lp55xx: add common macros for device attributes
    leds: lp55xx: add common data structure for program
    Documentation: leds: Fix a typo
    leds: ss4200: Fix incorrect placement of __initdata
    leds: clevo-mail: Fix incorrect placement of __initdata
    ...

    Linus Torvalds
     
  • Pull IOMMU Updates from Joerg Roedel:
    "This round the updates contain:

    - A new driver for the Freescale PAMU IOMMU from Varun Sethi.

    This driver has cooked for a while and required changes to the
    IOMMU-API and infrastructure that were already merged before.

    - Updates for the ARM-SMMU driver from Will Deacon

    - Various fixes, the most important one is probably a fix from Alex
    Williamson for a memory leak in the VT-d page-table freeing code

    In summary not all that much. The biggest part in the diffstat is the
    new PAMU driver"

    * tag 'iommu-updates-v3.12' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu:
    intel-iommu: Fix leaks in pagetable freeing
    iommu/amd: Fix resource leak in iommu_init_device()
    iommu/amd: Clean up unnecessary MSI/MSI-X capability find
    iommu/arm-smmu: Simplify VMID and ASID allocation
    iommu/arm-smmu: Don't use VMIDs for stage-1 translations
    iommu/arm-smmu: Tighten up global fault reporting
    iommu/arm-smmu: Remove broken big-endian check
    iommu/fsl: Remove unnecessary 'fsl-pamu' prefixes
    iommu/fsl: Fix whitespace problems noticed by git-am
    iommu/fsl: Freescale PAMU driver and iommu implementation.
    iommu/fsl: Add additional iommu attributes required by the PAMU driver.
    powerpc: Add iommu domain pointer to device archdata
    iommu/exynos: Remove dead code (set_prefbuf)

    Linus Torvalds
     
  • Pull ACPI and power management fixes from Rafael Wysocki:
    "All of these commits are fixes that have emerged recently and some of
    them fix bugs introduced during this merge window.

    Specifics:

    1) ACPI-based PCI hotplug (ACPIPHP) fixes related to spurious events

    After the recent ACPIPHP changes we've seen some interesting
    breakage on a system that triggers device check notifications
    during boot for non-existing devices. Although those
    notifications are really spurious, we should be able to deal with
    them nevertheless and that shouldn't introduce too much overhead.
    Four commits to make that work properly.

    2) Memory hotplug and hibernation mutual exclusion rework

    This was maent to be a cleanup, but it happens to fix a classical
    ABBA deadlock between system suspend/hibernation and ACPI memory
    hotplug which is possible if they are started roughly at the same
    time. Three commits rework memory hotplug so that it doesn't
    acquire pm_mutex and make hibernation use device_hotplug_lock
    which prevents it from racing with memory hotplug.

    3) ACPI Intel LPSS (Low-Power Subsystem) driver crash fix

    The ACPI LPSS driver crashes during boot on Apple Macbook Air with
    Haswell that has slightly unusual BIOS configuration in which one
    of the LPSS device's _CRS method doesn't return all of the
    information expected by the driver. Fix from Mika Westerberg, for
    stable.

    4) ACPICA fix related to Store->ArgX operation

    AML interpreter fix for obscure breakage that causes AML to be
    executed incorrectly on some machines (observed in practice).
    From Bob Moore.

    5) ACPI core fix for PCI ACPI device objects lookup

    There still are cases in which there is more than one ACPI device
    object matching a given PCI device and we don't choose the one
    that the BIOS expects us to choose, so this makes the lookup take
    more criteria into account in those cases.

    6) Fix to prevent cpuidle from crashing in some rare cases

    If the result of cpuidle_get_driver() is NULL, which can happen on
    some systems, cpuidle_driver_ref() will crash trying to use that
    pointer and the Daniel Fu's fix prevents that from happening.

    7) cpufreq fixes related to CPU hotplug

    Stephen Boyd reported a number of concurrency problems with
    cpufreq related to CPU hotplug which are addressed by a series of
    fixes from Srivatsa S Bhat and Viresh Kumar.

    8) cpufreq fix for time conversion in time_in_state attribute

    Time conversion carried out by cpufreq when user space attempts to
    read /sys/devices/system/cpu/cpu*/cpufreq/stats/time_in_state
    won't work correcty if cputime_t doesn't map directly to jiffies.
    Fix from Andreas Schwab.

    9) Revert of a troublesome cpufreq commit

    Commit 7c30ed5 (cpufreq: make sure frequency transitions are
    serialized) was intended to address some known concurrency
    problems in cpufreq related to the ordering of transitions, but
    unfortunately it introduced several problems of its own, so I
    decided to revert it now and address the original problems later
    in a more robust way.

    10) Intel Haswell CPU models for intel_pstate from Nell Hardcastle.

    11) cpufreq fixes related to system suspend/resume

    The recent cpufreq changes that made it preserve CPU sysfs
    attributes over suspend/resume cycles introduced a possible NULL
    pointer dereference that caused it to crash during the second
    attempt to suspend. Three commits from Srivatsa S Bhat fix that
    problem and a couple of related issues.

    12) cpufreq locking fix

    cpufreq_policy_restore() should acquire the lock for reading, but
    it acquires it for writing. Fix from Lan Tianyu"

    * tag 'pm+acpi-fixes-3.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (25 commits)
    cpufreq: Acquire the lock in cpufreq_policy_restore() for reading
    cpufreq: Prevent problems in update_policy_cpu() if last_cpu == new_cpu
    cpufreq: Restructure if/else block to avoid unintended behavior
    cpufreq: Fix crash in cpufreq-stats during suspend/resume
    intel_pstate: Add Haswell CPU models
    Revert "cpufreq: make sure frequency transitions are serialized"
    cpufreq: Use signed type for 'ret' variable, to store negative error values
    cpufreq: Remove temporary fix for race between CPU hotplug and sysfs-writes
    cpufreq: Synchronize the cpufreq store_*() routines with CPU hotplug
    cpufreq: Invoke __cpufreq_remove_dev_finish() after releasing cpu_hotplug.lock
    cpufreq: Split __cpufreq_remove_dev() into two parts
    cpufreq: Fix wrong time unit conversion
    cpufreq: serialize calls to __cpufreq_governor()
    cpufreq: don't allow governor limits to be changed when it is disabled
    ACPI / bind: Prefer device objects with _STA to those without it
    ACPI / hotplug / PCI: Avoid parent bus rescans on spurious device checks
    ACPI / hotplug / PCI: Use _OST to notify firmware about notify status
    ACPI / hotplug / PCI: Avoid doing too much for spurious notifies
    ACPICA: Fix for a Store->ArgX when ArgX contains a reference to a field.
    ACPI / hotplug / PCI: Don't trim devices before scanning the namespace
    ...

    Linus Torvalds
     
  • Let's not pollute the include files with inline functions that are only
    used in a single place. Especially not if we decide we might want to
    change the semantics of said function to make it more efficient..

    Signed-off-by: Linus Torvalds

    Linus Torvalds
     
  • Pull btrfs updates from Chris Mason:
    "This is against 3.11-rc7, but was pulled and tested against your tree
    as of yesterday. We do have two small incrementals queued up, but I
    wanted to get this bunch out the door before I hop on an airplane.

    This is a fairly large batch of fixes, performance improvements, and
    cleanups from the usual Btrfs suspects.

    We've included Stefan Behren's work to index subvolume UUIDs, which is
    targeted at speeding up send/receive with many subvolumes or snapshots
    in place. It closes a long standing performance issue that was built
    in to the disk format.

    Mark Fasheh's offline dedup work is also here. In this case offline
    means the FS is mounted and active, but the dedup work is not done
    inline during file IO. This is a building block where utilities are
    able to ask the FS to dedup a series of extents. The kernel takes
    care of verifying the data involved really is the same. Today this
    involves reading both extents, but we'll continue to evolve the
    patches"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (118 commits)
    Btrfs: optimize key searches in btrfs_search_slot
    Btrfs: don't use an async starter for most of our workers
    Btrfs: only update disk_i_size as we remove extents
    Btrfs: fix deadlock in uuid scan kthread
    Btrfs: stop refusing the relocation of chunk 0
    Btrfs: fix memory leak of uuid_root in free_fs_info
    btrfs: reuse kbasename helper
    btrfs: return btrfs error code for dev excl ops err
    Btrfs: allow partial ordered extent completion
    Btrfs: convert all bug_ons in free-space-cache.c
    Btrfs: add support for asserts
    Btrfs: adjust the fs_devices->missing count on unmount
    Btrf: cleanup: don't check for root_refs == 0 twice
    Btrfs: fix for patch "cleanup: don't check the same thing twice"
    Btrfs: get rid of one BUG() in write_all_supers()
    Btrfs: allocate prelim_ref with a slab allocater
    Btrfs: pass gfp_t to __add_prelim_ref() to avoid always using GFP_ATOMIC
    Btrfs: fix race conditions in BTRFS_IOC_FS_INFO ioctl
    Btrfs: fix race between removing a dev and writing sbs
    Btrfs: remove ourselves from the cluster list under lock
    ...

    Linus Torvalds
     
  • The sequence lock (seqlock) was originally designed for the cases where
    the readers do not need to block the writers by making the readers retry
    the read operation when the data change.

    Since then, the use cases have been expanded to include situations where
    a thread does not need to change the data (effectively a reader) at all
    but have to take the writer lock because it can't tolerate changes to
    the protected structure. Some examples are the d_path() function and
    the getcwd() syscall in fs/dcache.c where the functions take the writer
    lock on rename_lock even though they don't need to change anything in
    the protected data structure at all. This is inefficient as a reader is
    now blocking other sequence number reading readers from moving forward
    by pretending to be a writer.

    This patch tries to eliminate this inefficiency by introducing a new
    type of locking reader to the seqlock locking mechanism. This new
    locking reader will try to take an exclusive lock preventing other
    writers and locking readers from going forward. However, it won't
    affect the progress of the other sequence number reading readers as the
    sequence number won't be changed.

    Signed-off-by: Waiman Long
    Cc: Alexander Viro
    Signed-off-by: Linus Torvalds

    Waiman Long
     

12 Sep, 2013

11 commits

  • Pull sound fixes from Takashi Iwai:
    "A few last-minute fixes for 3.12-rc1. All patches are driver
    specific.

    - HD-audio fixes: MacBook 6,1/6,2 speaker fix, ASUS TX300 dock
    speaker fix, Toshiba Satellite irq fix, Haswell HDMI audio
    cleanups)

    - ASoC fixes: atmel irq fix, fsl DT fix, mc13783 spi fix, kirkwood
    compatible string change, etc"

    * tag 'sound-3.12' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound:
    ASoC: mc13783: add spi errata fix
    ASoC: rsnd: fixup flag name of rsnd_scu_platform_info
    ALSA: hda - Add CS4208 codec support for MacBook 6,1 and 6,2
    ALSA: hda - Add Toshiba Satellite C870 to MSI blacklist
    ASoC: fsl_spdif: Select regmap-mmio
    ALSA: hda - unmute pin amplifier in infoframe setup for Haswell
    ALSA: hda - define is_haswell() to check if a display audio codec is Haswell
    ALSA: hda - Add dock speaker support for ASUS TX300
    ASoC: kirkwood: change the compatible string of the kirkwood-i2s driver
    ASoC: atmel: disable error interrupt
    ASoC: fsl: imx-audmux: Do not call imx_audmux_parse_dt_defaults() on non-dt kernel

    Linus Torvalds
     
  • Joerg Roedel
     
  • Pull thermal management updates from Zhang Rui:
    "We have a lot of SOC changes and a few thermal core fixes this time.

    The biggest change is about exynos thermal driver restructure. The
    patch set adds TMU (Thermal management Unit) driver support for
    exynos5440 platform. There are 3 instances of the TMU controllers so
    necessary cleanup/re-structure is done to handle multiple thermal
    zone.

    The next biggest change is the introduction of the imx thermal driver.
    It adds the imx thermal support using Temperature Monitor (TEMPMON)
    block found on some Freescale i.MX SoCs. The driver uses syscon
    regmap interface to access TEMPMON control registers and calibration
    data, and supports cpufreq as the cooling device.

    Highlights:

    - restructure exynos thermal driver.

    - introduce new imx thermal driver.

    - fix a bug in thermal core, which powers on the fans unexpectedly
    after resume from suspend"

    * 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/rzhang/linux: (46 commits)
    drivers: thermal: add check when unregistering cpu cooling
    thermal: thermal_core: allow binding with limits on bind_params
    drivers: thermal: make usage of CONFIG_THERMAL_HWMON optional
    drivers: thermal: parent virtual hwmon with thermal zone
    thermal: hwmon: move hwmon support to single file
    thermal: exynos: Clean up non-DT remnants
    thermal: exynos: Fix potential NULL pointer dereference
    thermal: exynos: Fix typos in Kconfig
    thermal: ti-soc-thermal: Ensure to compute thermal trend
    thermal: ti-soc-thermal: Set the bandgap mask counter delay value
    thermal: ti-soc-thermal: Initialize counter_delay field for TI DRA752 sensors
    thermal: step_wise: return instance->target by default
    thermal: step_wise: cdev only needs update on a new target state
    Thermal/cpu_cooling: Return directly for the cpu out of allowed_cpus in the cpufreq_thermal_notifier()
    thermal: exynos_tmu: fix wrong error check for mapped memory
    thermal: imx: implement thermal alarm interrupt handling
    thermal: imx: dynamic passive and SoC specific critical trip points
    Documentation: thermal: Explain the exynos thermal driver model
    ARM: dts: thermal: exynos: Add documentation for Exynos SoC thermal bindings
    thermal: exynos: Support for TMU regulator defined at device tree
    ...

    Linus Torvalds
     
  • Pull CIFS fixes from Steve French:
    "CIFS update including case insensitive file name matching improvements
    for UTF-8 to Unicode, various small cifs fixes, SMB2/SMB3 leasing
    improvements, support for following SMB2 symlinks, SMB3 packet signing
    improvements"

    * 'for-next' of git://git.samba.org/sfrench/cifs-2.6: (25 commits)
    CIFS: Respect epoch value from create lease context v2
    CIFS: Add create lease v2 context for SMB3
    CIFS: Move parsing lease buffer to ops struct
    CIFS: Move creating lease buffer to ops struct
    CIFS: Store lease state itself rather than a mapped oplock value
    CIFS: Replace clientCanCache* bools with an integer
    [CIFS] quiet sparse compile warning
    cifs: Start using per session key for smb2/3 for signature generation
    cifs: Add a variable specific to NTLMSSP for key exchange.
    cifs: Process post session setup code in respective dialect functions.
    CIFS: convert to use le32_add_cpu()
    CIFS: Fix missing lease break
    CIFS: Fix a memory leak when a lease break comes
    cifs: add winucase_convert.pl to Documentation/ directory
    cifs: convert case-insensitive dentry ops to use new case conversion routines
    cifs: add new case-insensitive conversion routines that are based on wchar_t's
    [CIFS] Add Scott to list of cifs contributors
    cifs: Move and expand MAX_SERVER_SIZE definition
    cifs: Expand max share name length to 256
    cifs: Move string length definitions to uapi
    ...

    Linus Torvalds
     
  • Merge first patch-bomb from Andrew Morton:
    - Some pidns/fork/exec tweaks
    - OCFS2 updates
    - Most of MM - there remain quite a few memcg parts which depend on
    pending core cgroups changes. Which might have been already merged -
    I'll check tomorrow...
    - Various misc stuff all over the place
    - A few block bits which I never got around to sending to Jens -
    relatively minor things.
    - MAINTAINERS maintenance
    - A small number of lib/ updates
    - checkpatch updates
    - epoll
    - firmware/dmi-scan
    - Some kprobes work for S390
    - drivers/rtc updates
    - hfsplus feature work
    - vmcore feature work
    - rbtree upgrades
    - AOE updates
    - pktcdvd cleanups
    - PPS
    - memstick
    - w1
    - New "inittmpfs" feature, which does the obvious
    - More IPC work from Davidlohr.

    * emailed patches from Andrew Morton : (303 commits)
    lz4: fix compression/decompression signedness mismatch
    ipc: drop ipc_lock_check
    ipc, shm: drop shm_lock_check
    ipc: drop ipc_lock_by_ptr
    ipc, shm: guard against non-existant vma in shmdt(2)
    ipc: document general ipc locking scheme
    ipc,msg: drop msg_unlock
    ipc: rename ids->rw_mutex
    ipc,shm: shorten critical region for shmat
    ipc,shm: cleanup do_shmat pasta
    ipc,shm: shorten critical region for shmctl
    ipc,shm: make shmctl_nolock lockless
    ipc,shm: introduce shmctl_nolock
    ipc: drop ipcctl_pre_down
    ipc,shm: shorten critical region in shmctl_down
    ipc,shm: introduce lockless functions to obtain the ipc object
    initmpfs: use initramfs if rootfstype= or root= specified
    initmpfs: make rootfs use tmpfs when CONFIG_TMPFS enabled
    initmpfs: move rootfs code from fs/ramfs/ to init/
    initmpfs: move bdi setup from init_rootfs to init_ramfs
    ...

    Linus Torvalds
     
  • LZ4 compression and decompression functions require different in
    signedness input/output parameters: unsigned char for compression and
    signed char for decompression.

    Change decompression API to require "(const) unsigned char *".

    Signed-off-by: Sergey Senozhatsky
    Cc: Kyungsik Lee
    Cc: Geert Uytterhoeven
    Cc: Yann Collet
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sergey Senozhatsky
     
  • Since in some situations the lock can be shared for readers, we shouldn't
    be calling it a mutex, rename it to rwsem.

    Signed-off-by: Davidlohr Bueso
    Tested-by: Sedat Dilek
    Cc: Rik van Riel
    Cc: Manfred Spraul
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     
  • When the rootfs code was a wrapper around ramfs, having them in the same
    file made sense. Now that it can wrap another filesystem type, move it in
    with the init code instead.

    This also allows a subsequent patch to access rootfstype= command line
    arg.

    Signed-off-by: Rob Landley
    Cc: Jeff Layton
    Cc: Jens Axboe
    Cc: Stephen Warren
    Cc: Rusty Russell
    Cc: Jim Cromie
    Cc: Sam Ravnborg
    Cc: Greg Kroah-Hartman
    Cc: "Eric W. Biederman"
    Cc: Alexander Viro
    Cc: "H. Peter Anvin"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rob Landley
     
  • With users of radix_tree_preload() run from interrupt (block/blk-ioc.c is
    one such possible user), the following race can happen:

    radix_tree_preload()
    ...
    radix_tree_insert()
    radix_tree_node_alloc()
    if (rtp->nr) {
    ret = rtp->nodes[rtp->nr - 1];

    ...
    radix_tree_preload()
    ...
    radix_tree_insert()
    radix_tree_node_alloc()
    if (rtp->nr) {
    ret = rtp->nodes[rtp->nr - 1];

    And we give out one radix tree node twice. That clearly results in radix
    tree corruption with different results (usually OOPS) depending on which
    two users of radix tree race.

    We fix the problem by making radix_tree_node_alloc() always allocate fresh
    radix tree nodes when in interrupt. Using preloading when in interrupt
    doesn't make sense since all the allocations have to be atomic anyway and
    we cannot steal nodes from process-context users because some users rely
    on radix_tree_insert() succeeding after radix_tree_preload().
    in_interrupt() check is somewhat ugly but we cannot simply key off passed
    gfp_mask as that is acquired from root_gfp_mask() and thus the same for
    all preload users.

    Another part of the fix is to avoid node preallocation in
    radix_tree_preload() when passed gfp_mask doesn't allow waiting. Again,
    preallocation in such case doesn't make sense and when preallocation would
    happen in interrupt we could possibly leak some allocated nodes. However,
    some users of radix_tree_preload() require following radix_tree_insert()
    to succeed. To avoid unexpected effects for these users,
    radix_tree_preload() only warns if passed gfp mask doesn't allow waiting
    and we provide a new function radix_tree_maybe_preload() for those users
    which get different gfp mask from different call sites and which are
    prepared to handle radix_tree_insert() failure.

    Signed-off-by: Jan Kara
    Cc: Jens Axboe
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     
  • Because deletion (of the entire tree) is a relatively common use of the
    rbtree_postorder iteration, and because doing it safely means fiddling
    with temporary storage, provide a helper to simplify postorder rbtree
    iteration.

    Signed-off-by: Cody P Schafer
    Reviewed-by: Seth Jennings
    Cc: David Woodhouse
    Cc: Rik van Riel
    Cc: Michel Lespinasse
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cody P Schafer
     
  • Postorder iteration yields all of a node's children prior to yielding the
    node itself, and this particular implementation also avoids examining the
    leaf links in a node after that node has been yielded.

    In what I expect will be its most common usage, postorder iteration allows
    the deletion of every node in an rbtree without modifying the rbtree nodes
    (no _requirement_ that they be nulled) while avoiding referencing child
    nodes after they have been "deleted" (most commonly, freed).

    I have only updated zswap to use this functionality at this point, but
    numerous bits of code (most notably in the filesystem drivers) use a hand
    rolled postorder iteration that NULLs child links as it traverses the
    tree. Each of those instances could be replaced with this common
    implementation.

    1 & 2 add rbtree postorder iteration functions.
    3 adds testing of the iteration to the rbtree runtime tests
    4 allows building the rbtree runtime tests as builtins
    5 updates zswap.

    This patch:

    Add postorder iteration functions for rbtree. These are useful for safely
    freeing an entire rbtree without modifying the tree at all.

    Signed-off-by: Cody P Schafer
    Reviewed-by: Seth Jennings
    Cc: David Woodhouse
    Cc: Rik van Riel
    Cc: Michel Lespinasse
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cody P Schafer