01 Aug, 2012

3 commits

  • Sanity:

    CONFIG_CGROUP_MEM_RES_CTLR -> CONFIG_MEMCG
    CONFIG_CGROUP_MEM_RES_CTLR_SWAP -> CONFIG_MEMCG_SWAP
    CONFIG_CGROUP_MEM_RES_CTLR_SWAP_ENABLED -> CONFIG_MEMCG_SWAP_ENABLED
    CONFIG_CGROUP_MEM_RES_CTLR_KMEM -> CONFIG_MEMCG_KMEM

    [mhocko@suse.cz: fix missed bits]
    Cc: Glauber Costa
    Acked-by: Michal Hocko
    Cc: Johannes Weiner
    Cc: KAMEZAWA Hiroyuki
    Cc: Hugh Dickins
    Cc: Tejun Heo
    Cc: Aneesh Kumar K.V
    Cc: David Rientjes
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • Since per-BDI flusher threads were introduced in 2.6, the pdflush
    mechanism is not used any more. But the old interface exported through
    /proc/sys/vm/nr_pdflush_threads still exists and is obviously useless.

    For back-compatibility, printk warning information and return 2 to notify
    the users that the interface is removed.

    Signed-off-by: Wanpeng Li
    Cc: Wu Fengguang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wanpeng Li
     
  • vm_stat_account() accounts the shared_vm, stack_vm and reserved_vm now.
    But we can also account for total_vm in the vm_stat_account() which makes
    the code tidy.

    Even for mprotect_fixup(), we can get the right result in the end.

    Signed-off-by: Huang Shijie
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Huang Shijie
     

31 Jul, 2012

17 commits

  • When the requested range is outside of the root range the logic in
    __reserve_region_with_split will cause an infinite recursion which will
    overflow the stack as seen in the warning bellow.

    This particular stack overflow was caused by requesting the
    (100000000-107ffffff) range while the root range was (0-ffffffff). In
    this case __request_resource would return the whole root range as
    conflict range (i.e. 0-ffffffff). Then, the logic in
    __reserve_region_with_split would continue the recursion requesting the
    new range as (conflict->end+1, end) which incidentally in this case
    equals the originally requested range.

    This patch aborts looking for an usable range when the request does not
    intersect with the root range. When the request partially overlaps with
    the root range, it ajust the request to fall in the root range and then
    continues with the new request.

    When the request is modified or aborted errors and a stack trace are
    logged to allow catching the errors in the upper layers.

    [ 5.968374] WARNING: at kernel/sched.c:4129 sub_preempt_count+0x63/0x89()
    [ 5.975150] Modules linked in:
    [ 5.978184] Pid: 1, comm: swapper Not tainted 3.0.22-mid27-00004-gb72c817 #46
    [ 5.985324] Call Trace:
    [ 5.987759] [] ? console_unlock+0x17b/0x18d
    [ 5.992891] [] warn_slowpath_common+0x48/0x5d
    [ 5.998194] [] ? sub_preempt_count+0x63/0x89
    [ 6.003412] [] warn_slowpath_null+0xf/0x13
    [ 6.008453] [] sub_preempt_count+0x63/0x89
    [ 6.013499] [] _raw_spin_unlock+0x27/0x3f
    [ 6.018453] [] add_partial+0x36/0x3b
    [ 6.022973] [] deactivate_slab+0x96/0xb4
    [ 6.027842] [] __slab_alloc.isra.54.constprop.63+0x204/0x241
    [ 6.034456] [] ? kzalloc.constprop.5+0x29/0x38
    [ 6.039842] [] ? kzalloc.constprop.5+0x29/0x38
    [ 6.045232] [] kmem_cache_alloc_trace+0x51/0xb0
    [ 6.050710] [] ? kzalloc.constprop.5+0x29/0x38
    [ 6.056100] [] kzalloc.constprop.5+0x29/0x38
    [ 6.061320] [] __reserve_region_with_split+0x1c/0xd1
    [ 6.067230] [] __reserve_region_with_split+0xc6/0xd1
    ...
    [ 7.179057] [] __reserve_region_with_split+0xc6/0xd1
    [ 7.184970] [] reserve_region_with_split+0x30/0x42
    [ 7.190709] [] e820_reserve_resources_late+0xd1/0xe9
    [ 7.196623] [] pcibios_resource_survey+0x23/0x2a
    [ 7.202184] [] pcibios_init+0x23/0x35
    [ 7.206789] [] pci_subsys_init+0x3f/0x44
    [ 7.211659] [] do_one_initcall+0x72/0x122
    [ 7.216615] [] ? pci_legacy_init+0x3d/0x3d
    [ 7.221659] [] kernel_init+0xa6/0x118
    [ 7.226265] [] ? start_kernel+0x334/0x334
    [ 7.231223] [] kernel_thread_helper+0x6/0x10

    Signed-off-by: Octavian Purdila
    Signed-off-by: Ram Pai
    Cc: Jesse Barnes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Octavian Purdila
     
  • Addresses https://bugzilla.kernel.org/show_bug.cgi?id=44621

    Reported-by:
    Signed-off-by: Alan Cox
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alan Cox
     
  • register_sysctl_table() is a strange function, as it makes internal
    allocations (a header) to register a sysctl_table. This header is a
    handle to the table that is created, and can be used to unregister the
    table. But if the table is permanent and never unregistered, the header
    acts the same as a static variable.

    Unfortunately, this allocation of memory that is never expected to be
    freed fools kmemleak in thinking that we have leaked memory. For those
    sysctl tables that are never unregistered, and have no pointer referencing
    them, kmemleak will think that these are memory leaks:

    unreferenced object 0xffff880079fb9d40 (size 192):
    comm "swapper/0", pid 0, jiffies 4294667316 (age 12614.152s)
    hex dump (first 32 bytes):
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
    backtrace:
    [] kmemleak_alloc+0x73/0x98
    [] kmemleak_alloc_recursive.constprop.42+0x16/0x18
    [] __kmalloc+0x107/0x153
    [] kzalloc.constprop.8+0xe/0x10
    [] __register_sysctl_paths+0xe1/0x160
    [] register_sysctl_paths+0x1b/0x1d
    [] register_sysctl_table+0x18/0x1a
    [] sysctl_init+0x10/0x14
    [] proc_sys_init+0x2f/0x31
    [] proc_root_init+0xa5/0xa7
    [] start_kernel+0x3d0/0x40a
    [] x86_64_start_reservations+0xae/0xb2
    [] x86_64_start_kernel+0x102/0x111
    [] 0xffffffffffffffff

    The sysctl_base_table used by sysctl itself is one such instance that
    registers the table to never be unregistered.

    Use kmemleak_not_leak() to suppress the kmemleak false positive.

    Signed-off-by: Steven Rostedt
    Acked-by: Catalin Marinas
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Steven Rostedt
     
  • The last line of vmcoreinfo note does not end with \n. Parsing all the
    lines in note becomes easier if all lines end with \n instead of trying to
    special case the last line.

    I know at least one tool, vmcore-dmesg in kexec-tools tree which made the
    assumption that all lines end with \n. I think it is a good idea to fix
    it.

    Signed-off-by: Vivek Goyal
    Cc: "Eric W. Biederman"
    Cc: Atsushi Kumagai
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vivek Goyal
     
  • The function dup_task() may fail at the following function calls in the
    following order.

    0) alloc_task_struct_node()
    1) alloc_thread_info_node()
    2) arch_dup_task_struct()

    Error by 0) is not a matter, it can just return. But error by 1) requires
    releasing task_struct allocated by 0) before it returns. Likewise, error
    by 2) requires releasing task_struct and thread_info allocated by 0) and
    1).

    The existing error handling calls free_task_struct() and
    free_thread_info() which do not only release task_struct and thread_info,
    but also call architecture specific arch_release_task_struct() and
    arch_release_thread_info().

    The problem is that task_struct and thread_info are not fully initialized
    yet at this point, but arch_release_task_struct() and
    arch_release_thread_info() are called with them.

    For example, x86 defines its own arch_release_task_struct() that releases
    a task_xstate. If alloc_thread_info_node() fails in dup_task(),
    arch_release_task_struct() is called with task_struct which is just
    allocated and filled with garbage in this error handling.

    This actually happened with tools/testing/fault-injection/failcmd.sh

    # env FAILCMD_TYPE=fail_page_alloc \
    ./tools/testing/fault-injection/failcmd.sh --times=100 \
    --min-order=0 --ignore-gfp-wait=0 \
    -- make -C tools/testing/selftests/ run_tests

    In order to fix this issue, make free_{task_struct,thread_info}() not to
    call arch_release_{task_struct,thread_info}() and call
    arch_release_{task_struct,thread_info}() implicitly where needed.

    Default arch_release_task_struct() and arch_release_thread_info() are
    defined as empty by default. So this change only affects the
    architectures which implement their own arch_release_task_struct() or
    arch_release_thread_info() as listed below.

    arch_release_task_struct(): x86, sh
    arch_release_thread_info(): mn10300, tile

    Signed-off-by: Akinobu Mita
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Cc: David Howells
    Cc: Koichi Yasutake
    Cc: Paul Mundt
    Cc: Chris Metcalf
    Cc: Salman Qazi
    Cc: Peter Zijlstra
    Cc: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Akinobu Mita
     
  • To make way for "fork: fix error handling in dup_task()", which fixes the
    errors more completely.

    Cc: Salman Qazi
    Cc: Peter Zijlstra
    Cc: Ingo Molnar
    Cc: Akinobu Mita
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • The current code can be replaced by vma_pages(). So use it to simplify
    the code.

    [akpm@linux-foundation.org: initialise `len' at its definition site]
    Signed-off-by: Huang Shijie
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Huang Shijie
     
  • The system deadlocks (at least since 2.6.10) when
    call_usermodehelper(UMH_WAIT_EXEC) request triggers
    call_usermodehelper(UMH_WAIT_PROC) request.

    This is because "khelper thread is waiting for the worker thread at
    wait_for_completion() in do_fork() since the worker thread was created
    with CLONE_VFORK flag" and "the worker thread cannot call complete()
    because do_execve() is blocked at UMH_WAIT_PROC request" and "the khelper
    thread cannot start processing UMH_WAIT_PROC request because the khelper
    thread is waiting for the worker thread at wait_for_completion() in
    do_fork()".

    The easiest example to observe this deadlock is to use a corrupted
    /sbin/hotplug binary (like shown below).

    # : > /tmp/dummy
    # chmod 755 /tmp/dummy
    # echo /tmp/dummy > /proc/sys/kernel/hotplug
    # modprobe whatever

    call_usermodehelper("/tmp/dummy", UMH_WAIT_EXEC) is called from
    kobject_uevent_env() in lib/kobject_uevent.c upon loading/unloading a
    module. do_execve("/tmp/dummy") triggers a call to
    request_module("binfmt-0000") from search_binary_handler() which in turn
    calls call_usermodehelper(UMH_WAIT_PROC).

    In order to avoid deadlock, as a for-now and easy-to-backport solution, do
    not try to call wait_for_completion() in call_usermodehelper_exec() if the
    worker thread was created by khelper thread with CLONE_VFORK flag. Future
    and fundamental solution might be replacing singleton khelper thread with
    some workqueue so that recursive calls up to max_active dependency loop
    can be handled without deadlock.

    [akpm@linux-foundation.org: add comment to kmod_thread_locker]
    Signed-off-by: Tetsuo Handa
    Cc: Arjan van de Ven
    Acked-by: Rusty Russell
    Cc: Tejun Heo
    Cc: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tetsuo Handa
     
  • This function's interface is, uh, subtle. Attempt to apologise for it.

    Cc: WANG Cong
    Cc: Cyrill Gorcunov
    Cc: Kees Cook
    Cc: Serge Hallyn
    Cc: "Eric W. Biederman"
    Cc: Alan Cox
    Cc: Oleg Nesterov
    Cc: Rusty Russell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • vprintk_emit() prefix parsing should only be done for internal kernel
    messages. This allows existing behavior to be kept in all cases.

    Signed-off-by: Joe Perches
    Cc: Kay Sievers
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joe Perches
     
  • The current form of a KERN_ is "".

    Add printk_get_level and printk_skip_level functions to handle these
    formats.

    These functions centralize tests of KERN_ so a future modification
    can change the KERN_ style and shorten the number of bytes consumed
    by these headers.

    [akpm@linux-foundation.org: fix build error and warning]
    Signed-off-by: Joe Perches
    Cc: Kay Sievers
    Cc: Wu Fengguang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joe Perches
     
  • Reported-by: Andrew Morton
    Signed-off-by: Kay Sievers
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kay Sievers
     
  • If argv_split() failed, the code will end up calling argv_free(NULL). Fix
    it up and clean things up a bit.

    Addresses Coverity report 703573.

    Cc: Cyrill Gorcunov
    Cc: Kees Cook
    Cc: Serge Hallyn
    Cc: "Eric W. Biederman"
    Cc: WANG Cong
    Cc: Alan Cox
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • On the suspend/resume path the boot CPU does not go though an
    offline->online transition. This breaks the NMI detector post-resume
    since it depends on PMU state that is lost when the system gets
    suspended.

    Fix this by forcing a CPU offline->online transition for the lockup
    detector on the boot CPU during resume.

    To provide more context, we enable NMI watchdog on Chrome OS. We have
    seen several reports of systems freezing up completely which indicated
    that the NMI watchdog was not firing for some reason.

    Debugging further, we found a simple way of repro'ing system freezes --
    issuing the command 'tasket 1 sh -c "echo nmilockup > /proc/breakme"'
    after the system has been suspended/resumed one or more times.

    With this patch in place, the system freeze result in panics, as
    expected.

    These panics provide a nice stack trace for us to debug the actual issue
    causing the freeze.

    [akpm@linux-foundation.org: fiddle with code comment]
    [akpm@linux-foundation.org: make lockup_detector_bootcpu_resume() conditional on CONFIG_SUSPEND]
    [akpm@linux-foundation.org: fix section errors]
    Signed-off-by: Sameer Nanda
    Cc: Ingo Molnar
    Cc: Peter Zijlstra
    Cc: "Rafael J. Wysocki"
    Cc: Don Zickus
    Cc: Mandeep Singh Baines
    Cc: Srivatsa S. Bhat
    Cc: Anshuman Khandual
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sameer Nanda
     
  • panic_lock is meant to ensure that panic processing takes place only on
    one cpu; if any of the other cpus encounter a panic, they will spin
    waiting to be shut down.

    However, this causes a regression in this scenario:

    1. Cpu 0 encounters a panic and acquires the panic_lock
    and proceeds with the panic processing.
    2. There is an interrupt on cpu 0 that also encounters
    an error condition and invokes panic.
    3. This second invocation fails to acquire the panic_lock
    and enters the infinite while loop in panic_smp_self_stop.

    Thus all panic processing is stopped, and the cpu is stuck for eternity
    in the while(1) inside panic_smp_self_stop.

    To address this, disable local interrupts with local_irq_disable before
    acquiring the panic_lock. This will prevent interrupt handlers from
    executing during the panic processing, thus avoiding this particular
    problem.

    Signed-off-by: Vikram Mulukutla
    Reviewed-by: Stephen Boyd
    Cc: Michael Holzheu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vikram Mulukutla
     
  • When suid_dumpable=2, detect unsafe core_pattern settings and warn when
    they are seen.

    Signed-off-by: Kees Cook
    Suggested-by: Andrew Morton
    Cc: Alexander Viro
    Cc: Alan Cox
    Cc: "Eric W. Biederman"
    Cc: Doug Ledford
    Cc: Serge Hallyn
    Cc: James Morris
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kees Cook
     
  • Just setting the "error" to error number is enough on failure and It
    doesn't require to set "error" variable to zero in each switch case,
    since it was already initialized with zero. And also removed return 0
    in switch case with break statement

    Signed-off-by: Sasikantha babu
    Acked-by: Kees Cook
    Acked-by: Serge E. Hallyn
    Cc: Cyrill Gorcunov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sasikantha babu
     

27 Jul, 2012

4 commits

  • Recently, glibc made a change to suppress sign-conversion warnings in
    FD_SET (glibc commit ceb9e56b3d1). This uncovered an issue with the
    kernel's definition of __NFDBITS if applications #include
    after including . A build failure would
    be seen when passing the -Werror=sign-compare and -D_FORTIFY_SOURCE=2
    flags to gcc.

    It was suggested that the kernel should either match the glibc
    definition of __NFDBITS or remove that entirely. The current in-kernel
    uses of __NFDBITS can be replaced with BITS_PER_LONG, and there are no
    uses of the related __FDELT and __FDMASK defines. Given that, we'll
    continue the cleanup that was started with commit 8b3d1cda4f5f
    ("posix_types: Remove fd_set macros") and drop the remaining unused
    macros.

    Additionally, linux/time.h has similar macros defined that expand to
    nothing so we'll remove those at the same time.

    Reported-by: Jeff Law
    Suggested-by: Linus Torvalds
    CC:
    Signed-off-by: Josh Boyer
    [ .. and fix up whitespace as per akpm ]
    Signed-off-by: Linus Torvalds

    Josh Boyer
     
  • Pull scheduler changes from Ingo Molnar:
    "The biggest change is a performance improvement on SMP systems:

    | 4 socket 40 core + SMT Westmere box, single 30 sec tbench
    | runs, higher is better:
    |
    | clients 1 2 4 8 16 32 64 128
    |..........................................................................
    | pre 30 41 118 645 3769 6214 12233 14312
    | post 299 603 1211 2418 4697 6847 11606 14557
    |
    | A nice increase in performance.

    which speedup is particularly noticeable on heavily interacting
    few-tasks workloads, so the changes should help desktop-style Xorg
    workloads and interactivity as well, on multi-core CPUs.

    There are also cpuset suspend behavior fixes/restructuring and various
    smaller tweaks."

    * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    sched: Fix race in task_group()
    sched: Improve balance_cpu() to consider other cpus in its group as target of (pinned) task
    sched: Reset loop counters if all tasks are pinned and we need to redo load balance
    sched: Reorder 'struct lb_env' members to reduce its size
    sched: Improve scalability via 'CPU buddies', which withstand random perturbations
    cpusets: Remove/update outdated comments
    cpusets, hotplug: Restructure functions that are invoked during hotplug
    cpusets, hotplug: Implement cpuset tree traversal in a helper function
    CPU hotplug, cpusets, suspend: Don't modify cpusets during suspend/resume
    sched/x86: Remove broken power estimation

    Linus Torvalds
     
  • Pull driver core changes from Greg Kroah-Hartman:
    "Here's the big driver core pull request for 3.6-rc1.

    Unlike 3.5, this kernel should be a lot tamer, with the printk changes
    now settled down. All we have here is some extcon driver updates, w1
    driver updates, a few printk cleanups that weren't needed for 3.5, but
    are good to have now, and some other minor fixes/changes in the driver
    core.

    All of these have been in the linux-next releases for a while now.

    Signed-off-by: Greg Kroah-Hartman "

    * tag 'driver-core-3.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (38 commits)
    printk: Export struct log size and member offsets through vmcoreinfo
    Drivers: hv: Change the hex constant to a decimal constant
    driver core: don't trigger uevent after failure
    extcon: MAX77693: Add extcon-max77693 driver to support Maxim MAX77693 MUIC device
    sysfs: fail dentry revalidation after namespace change fix
    sysfs: fail dentry revalidation after namespace change
    extcon: spelling of detach in function doc
    extcon: arizona: Stop microphone detection if we give up on it
    extcon: arizona: Update cable reporting calls and split headset
    PM / Runtime: Do not increment device usage counts before probing
    kmsg - do not flush partial lines when the console is busy
    kmsg - export "continuation record" flag to /dev/kmsg
    kmsg - avoid warning for CONFIG_PRINTK=n compilations
    kmsg - properly print over-long continuation lines
    driver-core: Use kobj_to_dev instead of re-implementing it
    driver-core: Move kobj_to_dev from genhd.h to device.h
    driver core: Move deferred devices to the end of dpm_list before probing
    driver core: move uevent call to driver_register
    driver core: fix shutdown races with probe/remove(v3)
    Extcon: Arizona: Add driver for Wolfson Arizona class devices
    ...

    Linus Torvalds
     
  • Pull staging tree patches from Greg Kroah-Hartman:
    "Here's the big staging tree merge for the 3.6-rc1 merge window.

    There are some patches in here outside of drivers/staging/, notibly
    the iio code (which is still stradeling the staging / not staging
    boundry), the pstore code, and the tracing code. All of these have
    gotten acks from the various subsystem maintainers to be included in
    this tree. The pstore and tracing patches are related, and are coming
    here as they replace one of the android staging drivers.

    Otherwise, the normal staging mess. Lots of cleanups and a few new
    drivers (some iio drivers, and the large csr wireless driver
    abomination.)

    Signed-off-by: Greg Kroah-Hartman "

    Fixed up trivial conflicts in drivers/staging/comedi/drivers/s626.h and
    drivers/staging/gdm72xx/netlink_k.c

    * tag 'staging-3.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging: (1108 commits)
    staging: csr: delete a bunch of unused library functions
    staging: csr: remove csr_utf16.c
    staging: csr: remove csr_pmem.h
    staging: csr: remove CsrPmemAlloc
    staging: csr: remove CsrPmemFree()
    staging: csr: remove CsrMemAllocDma()
    staging: csr: remove CsrMemCalloc()
    staging: csr: remove CsrMemAlloc()
    staging: csr: remove CsrMemFree() and CsrMemFreeDma()
    staging: csr: remove csr_util.h
    staging: csr: remove CsrOffSetOf()
    stating: csr: remove unneeded #includes in csr_util.c
    staging: csr: make CsrUInt16ToHex static
    staging: csr: remove CsrMemCpy()
    staging: csr: remove CsrStrLen()
    staging: csr: remove CsrVsnprintf()
    staging: csr: remove CsrStrDup
    staging: csr: remove CsrStrChr()
    staging: csr: remove CsrStrNCmp
    staging: csr: remove CsrStrCmp
    ...

    Linus Torvalds
     

25 Jul, 2012

6 commits

  • Pull first round of SCSI updates from James Bottomley:
    "The most important feature of this patch set is the new async
    infrastructure that makes sure async_synchronize_full() synchronizes
    all domains and allows us to remove all the hacks (like having
    scsi_complete_async_scans() in the device base code) and means that
    the async infrastructure will "just work" in future.

    The rest is assorted driver updates (aacraid, bnx2fc, virto-scsi,
    megaraid, bfa, lpfc, qla2xxx, qla4xxx) plus a lot of infrastructure
    work in sas and FC.

    Signed-off-by: James Bottomley "

    * tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (97 commits)
    [SCSI] Revert "[SCSI] fix async probe regression"
    [SCSI] cleanup usages of scsi_complete_async_scans
    [SCSI] queue async scan work to an async_schedule domain
    [SCSI] async: make async_synchronize_full() flush all work regardless of domain
    [SCSI] async: introduce 'async_domain' type
    [SCSI] bfa: Fix to set correct return error codes and misc cleanup.
    [SCSI] aacraid: Series 7 Async. (performance) mode support
    [SCSI] aha152x: Allow use on 64bit systems
    [SCSI] virtio-scsi: Add vdrv->scan for post VIRTIO_CONFIG_S_DRIVER_OK LUN scanning
    [SCSI] bfa: squelch lockdep complaint with a spin_lock_init
    [SCSI] qla2xxx: remove unnecessary reads of PCI_CAP_ID_EXP
    [SCSI] qla4xxx: remove unnecessary read of PCI_CAP_ID_EXP
    [SCSI] ufs: fix incorrect return value about SUCCESS and FAILED
    [SCSI] ufs: reverse the ufshcd_is_device_present logic
    [SCSI] ufs: use module_pci_driver
    [SCSI] usb-storage: update usb devices for write cache quirk in quirk list.
    [SCSI] usb-storage: add support for write cache quirk
    [SCSI] set to WCE if usb cache quirk is present.
    [SCSI] virtio-scsi: hotplug support for virtio-scsi
    [SCSI] virtio-scsi: split scatterlist per target
    ...

    Linus Torvalds
     
  • Pull cgroup changes from Tejun Heo:
    "Nothing too interesting. A minor bug fix and some cleanups."

    * 'for-3.6' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
    cgroup: Update remount documentation
    cgroup: cgroup_rm_files() was calling simple_unlink() with the wrong inode
    cgroup: Remove populate() documentation
    cgroup: remove hierarchy_mutex

    Linus Torvalds
     
  • Pull workqueue changes from Tejun Heo:
    "There are three major changes.

    - WQ_HIGHPRI has been reimplemented so that high priority work items
    are served by worker threads with -20 nice value from dedicated
    highpri worker pools.

    - CPU hotplug support has been reimplemented such that idle workers
    are kept across CPU hotplug events. This makes CPU hotplug cheaper
    (for PM) and makes the code simpler.

    - flush_kthread_work() has been reimplemented so that a work item can
    be freed while executing. This removes an annoying behavior
    difference between kthread_worker and workqueue."

    * 'for-3.6' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
    workqueue: fix spurious CPU locality WARN from process_one_work()
    kthread_worker: reimplement flush_kthread_work() to allow freeing the work item being executed
    kthread_worker: reorganize to prepare for flush_kthread_work() reimplementation
    workqueue: simplify CPU hotplug code
    workqueue: remove CPU offline trustee
    workqueue: don't butcher idle workers on an offline CPU
    workqueue: reimplement CPU online rebinding to handle idle workers
    workqueue: drop @bind from create_worker()
    workqueue: use mutex for global_cwq manager exclusion
    workqueue: ROGUE workers are UNBOUND workers
    workqueue: drop CPU_DYING notifier operation
    workqueue: perform cpu down operations from low priority cpu_notifier()
    workqueue: reimplement WQ_HIGHPRI using a separate worker_pool
    workqueue: introduce NR_WORKER_POOLS and for_each_worker_pool()
    workqueue: separate out worker_pool flags
    workqueue: use @pool instead of @gcwq or @cpu where applicable
    workqueue: factor out worker_pool from global_cwq
    workqueue: don't use WQ_HIGHPRI for unbound workqueues

    Linus Torvalds
     
  • Pull PCI changes from Bjorn Helgaas:
    "Host bridge hotplug:
    - Add MMCONFIG support for hot-added host bridges (Jiang Liu)
    Device hotplug:
    - Move fixups from __init to __devinit (Sebastian Andrzej Siewior)
    - Call FINAL fixups for hot-added devices, too (Myron Stowe)
    - Factor out generic code for P2P bridge hot-add (Yinghai Lu)
    - Remove all functions in a slot, not just those with _EJx (Amos
    Kong)
    Dynamic resource management:
    - Track bus number allocation (struct resource tree per domain)
    (Yinghai Lu)
    - Make P2P bridge 1K I/O windows work with resource reassignment
    (Bjorn Helgaas, Yinghai Lu)
    - Disable decoding while updating 64-bit BARs (Bjorn Helgaas)
    Power management:
    - Add PCIe runtime D3cold support (Huang Ying)
    Virtualization:
    - Add VFIO infrastructure (ACS, DMA source ID quirks) (Alex
    Williamson)
    - Add quirks for devices with broken INTx masking (Jan Kiszka)
    Miscellaneous:
    - Fix some PCI Express capability version issues (Myron Stowe)
    - Factor out some arch code with a weak, generic, pcibios_setup()
    (Myron Stowe)"

    * tag 'for-3.6' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci: (122 commits)
    PCI: hotplug: ensure a consistent return value in error case
    PCI: fix undefined reference to 'pci_fixup_final_inited'
    PCI: build resource code for M68K architecture
    PCI: pciehp: remove unused pciehp_get_max_lnk_width(), pciehp_get_cur_lnk_width()
    PCI: reorder __pci_assign_resource() (no change)
    PCI: fix truncation of resource size to 32 bits
    PCI: acpiphp: merge acpiphp_debug and debug
    PCI: acpiphp: remove unused res_lock
    sparc/PCI: replace pci_cfg_fake_ranges() with pci_read_bridge_bases()
    PCI: call final fixups hot-added devices
    PCI: move final fixups from __init to __devinit
    x86/PCI: move final fixups from __init to __devinit
    MIPS/PCI: move final fixups from __init to __devinit
    PCI: support sizing P2P bridge I/O windows with 1K granularity
    PCI: reimplement P2P bridge 1K I/O windows (Intel P64H2)
    PCI: disable MEM decoding while updating 64-bit MEM BARs
    PCI: leave MEM and IO decoding disabled during 64-bit BAR sizing, too
    PCI: never discard enable/suspend/resume_early/resume fixups
    PCI: release temporary reference in __nv_msi_ht_cap_quirk()
    PCI: restructure 'pci_do_fixups()'
    ...

    Linus Torvalds
     
  • Pull devicetree updates from Rob Herring:
    "A small set of changes for devicetree:
    - Couple of Documentation fixes
    - Addition of new helper function of_node_full_name
    - Improve of_parse_phandle_with_args return values
    - Some NULL related sparse fixes"

    Grant's busy packing.

    * tag 'dt-for-3.6' of git://sources.calxeda.com/kernel/linux:
    of: mtd: nuke useless const qualifier
    devicetree: add helper inline for retrieving a node's full name
    of: return -ENOENT when no property
    usage-model.txt: fix typo machine_init->init_machine
    of: Fix null pointer related warnings in base.c file
    LED: Fix missing semicolon in OF documentation
    of: fix a few typos in the binding documentation

    Linus Torvalds
     
  • Pull networking changes from David S Miller:

    1) Remove the ipv4 routing cache. Now lookups go directly into the FIB
    trie and use prebuilt routes cached there.

    No more garbage collection, no more rDOS attacks on the routing
    cache. Instead we now get predictable and consistent performance,
    no matter what the pattern of traffic we service.

    This has been almost 2 years in the making. Special thanks to
    Julian Anastasov, Eric Dumazet, Steffen Klassert, and others who
    have helped along the way.

    I'm sure that with a change of this magnitude there will be some
    kind of fallout, but such things ought the be simple to fix at this
    point. Luckily I'm not European so I'll be around all of August to
    fix things :-)

    The major stages of this work here are each fronted by a forced
    merge commit whose commit message contains a top-level description
    of the motivations and implementation issues.

    2) Pre-demux of established ipv4 TCP sockets, saves a route demux on
    input.

    3) TCP SYN/ACK performance tweaks from Eric Dumazet.

    4) Add namespace support for netfilter L4 conntrack helpers, from Gao
    Feng.

    5) Add config mechanism for Energy Efficient Ethernet to ethtool, from
    Yuval Mintz.

    6) Remove quadratic behavior from /proc/net/unix, from Eric Dumazet.

    7) Support for connection tracker helpers in userspace, from Pablo
    Neira Ayuso.

    8) Allow userspace driven TX load balancing functions in TEAM driver,
    from Jiri Pirko.

    9) Kill off NLMSG_PUT and RTA_PUT macros, more gross stuff with
    embedded gotos.

    10) TCP Small Queues, essentially minimize the amount of TCP data queued
    up in the packet scheduler layer. Whereas the existing BQL (Byte
    Queue Limits) limits the pkt_sched --> netdevice queuing levels,
    this controls the TCP --> pkt_sched queueing levels.

    From Eric Dumazet.

    11) Reduce the number of get_page/put_page ops done on SKB fragments,
    from Alexander Duyck.

    12) Implement protection against blind resets in TCP (RFC 5961), from
    Eric Dumazet.

    13) Support the client side of TCP Fast Open, basically the ability to
    send data in the SYN exchange, from Yuchung Cheng.

    Basically, the sender queues up data with a sendmsg() call using
    MSG_FASTOPEN, then they do the connect() which emits the queued up
    fastopen data.

    14) Avoid all the problems we get into in TCP when timers or PMTU events
    hit a locked socket. The TCP Small Queues changes added a
    tcp_release_cb() that allows us to queue work up to the
    release_sock() caller, and that's what we use here too. From Eric
    Dumazet.

    15) Zero copy on TX support for TUN driver, from Michael S. Tsirkin.

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1870 commits)
    genetlink: define lockdep_genl_is_held() when CONFIG_LOCKDEP
    r8169: revert "add byte queue limit support".
    ipv4: Change rt->rt_iif encoding.
    net: Make skb->skb_iif always track skb->dev
    ipv4: Prepare for change of rt->rt_iif encoding.
    ipv4: Remove all RTCF_DIRECTSRC handliing.
    ipv4: Really ignore ICMP address requests/replies.
    decnet: Don't set RTCF_DIRECTSRC.
    net/ipv4/ip_vti.c: Fix __rcu warnings detected by sparse.
    ipv4: Remove redundant assignment
    rds: set correct msg_namelen
    openvswitch: potential NULL deref in sample()
    tcp: dont drop MTU reduction indications
    bnx2x: Add new 57840 device IDs
    tcp: avoid oops in tcp_metrics and reset tcpm_stamp
    niu: Change niu_rbr_fill() to use unlikely() to check niu_rbr_add_page() return value
    niu: Fix to check for dma mapping errors.
    net: Fix references to out-of-scope variables in put_cmsg_compat()
    net: ethernet: davinci_emac: add pm_runtime support
    net: ethernet: davinci_emac: Remove unnecessary #include
    ...

    Linus Torvalds
     

24 Jul, 2012

10 commits

  • Stefan reported a crash on a kernel before a3e5d1091c1 ("sched:
    Don't call task_group() too many times in set_task_rq()"), he
    found the reason to be that the multiple task_group()
    invocations in set_task_rq() returned different values.

    Looking at all that I found a lack of serialization and plain
    wrong comments.

    The below tries to fix it using an extra pointer which is
    updated under the appropriate scheduler locks. Its not pretty,
    but I can't really see another way given how all the cgroup
    stuff works.

    Reported-and-tested-by: Stefan Bader
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1340364965.18025.71.camel@twins
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Current load balance scheme requires only one cpu in a
    sched_group (balance_cpu) to look at other peer sched_groups for
    imbalance and pull tasks towards itself from a busy cpu. Tasks
    thus pulled by balance_cpu could later get picked up by cpus
    that are in the same sched_group as that of balance_cpu.

    This scheme however fails to pull tasks that are not allowed to
    run on balance_cpu (but are allowed to run on other cpus in its
    sched_group). That can affect fairness and in some worst case
    scenarios cause starvation.

    Consider a two core (2 threads/core) system running tasks as
    below:

    Core0 Core1
    / \ / \
    C0 C1 C2 C3
    | | | |
    v v v v
    F0 T1 F1 [idle]
    T2

    F0 = SCHED_FIFO task (pinned to C0)
    F1 = SCHED_FIFO task (pinned to C2)
    T1 = SCHED_OTHER task (pinned to C1)
    T2 = SCHED_OTHER task (pinned to C1 and C2)

    F1 could become a cpu hog, which will starve T2 unless C1 pulls
    it. Between C0 and C1 however, C0 is required to look for
    imbalance between cores, which will fail to pull T2 towards
    Core0. T2 will starve eternally in this case. The same scenario
    can arise in presence of non-rt tasks as well (say we replace F1
    with high irq load).

    We tackle this problem by having balance_cpu move pinned tasks
    to one of its sibling cpus (where they can run). We first check
    if load balance goal can be met by ignoring pinned tasks,
    failing which we retry move_tasks() with a new env->dst_cpu.

    This patch modifies load balance semantics on who can move load
    towards a given cpu in a given sched_domain.

    Before this patch, a given_cpu or a ilb_cpu acting on behalf of
    an idle given_cpu is responsible for moving load to given_cpu.

    With this patch applied, balance_cpu can in addition decide on
    moving some load to a given_cpu.

    There is a remote possibility that excess load could get moved
    as a result of this (balance_cpu and given_cpu/ilb_cpu deciding
    *independently* and at *same* time to move some load to a
    given_cpu). However we should see less of such conflicting
    decisions in practice and moreover subsequent load balance
    cycles should correct the excess load moved to given_cpu.

    Signed-off-by: Srivatsa Vaddagiri
    Signed-off-by: Prashanth Nageshappa
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/4FE06CDB.2060605@linux.vnet.ibm.com
    [ minor edits ]
    Signed-off-by: Ingo Molnar

    Srivatsa Vaddagiri
     
  • While load balancing, if all tasks on the source runqueue are pinned,
    we retry after excluding the corresponding source cpu. However, loop counters
    env.loop and env.loop_break are not reset before retrying, which can lead
    to failure in moving the tasks. In this patch we reset env.loop and
    env.loop_break to their inital values before we retry.

    Signed-off-by: Prashanth Nageshappa
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/4FE06EEF.2090709@linux.vnet.ibm.com
    Signed-off-by: Ingo Molnar

    Prashanth Nageshappa
     
  • Members of 'struct lb_env' are not in appropriate order to reuse compiler
    added padding on 64bit architectures. In this patch we reorder those struct
    members and help reduce the size of the structure from 96 bytes to 80
    bytes on 64 bit architectures.

    Suggested-by: Srivatsa Vaddagiri
    Signed-off-by: Prashanth Nageshappa
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/4FE06DDE.7000403@linux.vnet.ibm.com
    Signed-off-by: Ingo Molnar

    Prashanth Nageshappa
     
  • Traversing an entire package is not only expensive, it also leads to tasks
    bouncing all over a partially idle and possible quite large package. Fix
    that up by assigning a 'buddy' CPU to try to motivate. Each buddy may try
    to motivate that one other CPU, if it's busy, tough, it may then try its
    SMT sibling, but that's all this optimization is allowed to cost.

    Sibling cache buddies are cross-wired to prevent bouncing.

    4 socket 40 core + SMT Westmere box, single 30 sec tbench runs, higher is better:

    clients 1 2 4 8 16 32 64 128
    ..........................................................................
    pre 30 41 118 645 3769 6214 12233 14312
    post 299 603 1211 2418 4697 6847 11606 14557

    A nice increase in performance.

    Signed-off-by: Mike Galbraith
    Signed-off-by: Peter Zijlstra
    Cc: Linus Torvalds
    Cc: Andrew Morton
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/1339471112.7352.32.camel@marge.simpson.net
    Signed-off-by: Ingo Molnar

    Mike Galbraith
     
  • cpuset_track_online_cpus() is no longer present. So remove the
    outdated comment and replace it with reference to cpuset_update_active_cpus()
    which is its equivalent.

    Also, we don't lack memory hot-unplug anymore. And David Rientjes pointed
    out how it is dealt with. So update that comment as well.

    Signed-off-by: Srivatsa S. Bhat
    Signed-off-by: Peter Zijlstra
    Cc: Linus Torvalds
    Cc: Andrew Morton
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/20120524141700.3692.98192.stgit@srivatsabhat.in.ibm.com
    Signed-off-by: Ingo Molnar

    Srivatsa S. Bhat
     
  • Separate out the cpuset related handling for CPU/Memory online/offline.
    This also helps us exploit the most obvious and basic level of optimization
    that any notification mechanism (CPU/Mem online/offline) has to offer us:
    "We *know* why we have been invoked. So stop pretending that we are lost,
    and do only the necessary amount of processing!".

    And while at it, rename scan_for_empty_cpusets() to
    scan_cpusets_upon_hotplug(), which is more appropriate considering how
    it is restructured.

    Signed-off-by: Srivatsa S. Bhat
    Signed-off-by: Peter Zijlstra
    Cc: Linus Torvalds
    Cc: Andrew Morton
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/20120524141650.3692.48637.stgit@srivatsabhat.in.ibm.com
    Signed-off-by: Ingo Molnar

    Srivatsa S. Bhat
     
  • At present, the functions that deal with cpusets during CPU/Mem hotplug
    are quite messy, since a lot of the functionality is mixed up without clear
    separation. And this takes a toll on optimization as well. For example,
    the function cpuset_update_active_cpus() is called on both CPU offline and CPU
    online events; and it invokes scan_for_empty_cpusets(), which makes sense
    only for CPU offline events. And hence, the current code ends up unnecessarily
    traversing the cpuset tree during CPU online also.

    As a first step towards cleaning up those functions, encapsulate the cpuset
    tree traversal in a helper function, so as to facilitate upcoming changes.

    Signed-off-by: Srivatsa S. Bhat
    Signed-off-by: Peter Zijlstra
    Cc: Linus Torvalds
    Cc: Andrew Morton
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/20120524141635.3692.893.stgit@srivatsabhat.in.ibm.com
    Signed-off-by: Ingo Molnar

    Srivatsa S. Bhat
     
  • In the event of CPU hotplug, the kernel modifies the cpusets' cpus_allowed
    masks as and when necessary to ensure that the tasks belonging to the cpusets
    have some place (online CPUs) to run on. And regular CPU hotplug is
    destructive in the sense that the kernel doesn't remember the original cpuset
    configurations set by the user, across hotplug operations.

    However, suspend/resume (which uses CPU hotplug) is a special case in which
    the kernel has the responsibility to restore the system (during resume), to
    exactly the same state it was in before suspend.

    In order to achieve that, do the following:

    1. Don't modify cpusets during suspend/resume. At all.
    In particular, don't move the tasks from one cpuset to another, and
    don't modify any cpuset's cpus_allowed mask. So, simply ignore cpusets
    during the CPU hotplug operations that are carried out in the
    suspend/resume path.

    2. However, cpusets and sched domains are related. We just want to avoid
    altering cpusets alone. So, to keep the sched domains updated, build
    a single sched domain (containing all active cpus) during each of the
    CPU hotplug operations carried out in s/r path, effectively ignoring
    the cpusets' cpus_allowed masks.

    (Since userspace is frozen while doing all this, it will go unnoticed.)

    3. During the last CPU online operation during resume, build the sched
    domains by looking up the (unaltered) cpusets' cpus_allowed masks.
    That will bring back the system to the same original state as it was in
    before suspend.

    Ultimately, this will not only solve the cpuset problem related to suspend
    resume (ie., restores the cpusets to exactly what it was before suspend, by
    not touching it at all) but also speeds up suspend/resume because we avoid
    running cpuset update code for every CPU being offlined/onlined.

    Signed-off-by: Srivatsa S. Bhat
    Signed-off-by: Peter Zijlstra
    Cc: Linus Torvalds
    Cc: Andrew Morton
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/20120524141611.3692.20155.stgit@srivatsabhat.in.ibm.com
    Signed-off-by: Ingo Molnar

    Srivatsa S. Bhat
     
  • Pull the big VFS changes from Al Viro:
    "This one is *big* and changes quite a few things around VFS. What's in there:

    - the first of two really major architecture changes - death to open
    intents.

    The former is finally there; it was very long in making, but with
    Miklos getting through really hard and messy final push in
    fs/namei.c, we finally have it. Unlike his variant, this one
    doesn't introduce struct opendata; what we have instead is
    ->atomic_open() taking preallocated struct file * and passing
    everything via its fields.

    Instead of returning struct file *, it returns -E... on error, 0
    on success and 1 in "deal with it yourself" case (e.g. symlink
    found on server, etc.).

    See comments before fs/namei.c:atomic_open(). That made a lot of
    goodies finally possible and quite a few are in that pile:
    ->lookup(), ->d_revalidate() and ->create() do not get struct
    nameidata * anymore; ->lookup() and ->d_revalidate() get lookup
    flags instead, ->create() gets "do we want it exclusive" flag.

    With the introduction of new helper (kern_path_locked()) we are rid
    of all struct nameidata instances outside of fs/namei.c; it's still
    visible in namei.h, but not for long. Come the next cycle,
    declaration will move either to fs/internal.h or to fs/namei.c
    itself. [me, miklos, hch]

    - The second major change: behaviour of final fput(). Now we have
    __fput() done without any locks held by caller *and* not from deep
    in call stack.

    That obviously lifts a lot of constraints on the locking in there.
    Moreover, it's legal now to call fput() from atomic contexts (which
    has immediately simplified life for aio.c). We also don't need
    anti-recursion logics in __scm_destroy() anymore.

    There is a price, though - the damn thing has become partially
    asynchronous. For fput() from normal process we are guaranteed
    that pending __fput() will be done before the caller returns to
    userland, exits or gets stopped for ptrace.

    For kernel threads and atomic contexts it's done via
    schedule_work(), so theoretically we might need a way to make sure
    it's finished; so far only one such place had been found, but there
    might be more.

    There's flush_delayed_fput() (do all pending __fput()) and there's
    __fput_sync() (fput() analog doing __fput() immediately). I hope
    we won't need them often; see warnings in fs/file_table.c for
    details. [me, based on task_work series from Oleg merged last
    cycle]

    - sync series from Jan

    - large part of "death to sync_supers()" work from Artem; the only
    bits missing here are exofs and ext4 ones. As far as I understand,
    those are going via the exofs and ext4 trees resp.; once they are
    in, we can put ->write_super() to the rest, along with the thread
    calling it.

    - preparatory bits from unionmount series (from dhowells).

    - assorted cleanups and fixes all over the place, as usual.

    This is not the last pile for this cycle; there's at least jlayton's
    ESTALE work and fsfreeze series (the latter - in dire need of fixes,
    so I'm not sure it'll make the cut this cycle). I'll probably throw
    symlink/hardlink restrictions stuff from Kees into the next pile, too.
    Plus there's a lot of misc patches I hadn't thrown into that one -
    it's large enough as it is..."

    * 'for-linus-2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (127 commits)
    ext4: switch EXT4_IOC_RESIZE_FS to mnt_want_write_file()
    btrfs: switch btrfs_ioctl_balance() to mnt_want_write_file()
    switch dentry_open() to struct path, make it grab references itself
    spufs: shift dget/mntget towards dentry_open()
    zoran: don't bother with struct file * in zoran_map
    ecryptfs: don't reinvent the wheels, please - use struct completion
    don't expose I_NEW inodes via dentry->d_inode
    tidy up namei.c a bit
    unobfuscate follow_up() a bit
    ext3: pass custom EOF to generic_file_llseek_size()
    ext4: use core vfs llseek code for dir seeks
    vfs: allow custom EOF in generic_file_llseek code
    vfs: Avoid unnecessary WB_SYNC_NONE writeback during sys_sync and reorder sync passes
    vfs: Remove unnecessary flushing of block devices
    vfs: Make sys_sync writeout also block device inodes
    vfs: Create function for iterating over block devices
    vfs: Reorder operations during sys_sync
    quota: Move quota syncing to ->sync_fs method
    quota: Split dquot_quota_sync() to writeback and cache flushing part
    vfs: Move noop_backing_dev_info check from sync into writeback
    ...

    Linus Torvalds