06 May, 2017

1 commit

  • Whole point of randomization was to hide server uptime, but an attacker
    can simply start a syn flood and TCP generates 'old style' timestamps,
    directly revealing server jiffies value.

    Also, TSval sent by the server to a particular remote address vary
    depending on syncookies being sent or not, potentially triggering PAWS
    drops for innocent clients.

    Lets implement proper randomization, including for SYNcookies.

    Also we do not need to export sysctl_tcp_timestamps, since it is not
    used from a module.

    In v2, I added Florian feedback and contribution, adding tsoff to
    tcp_get_cookie_sock().

    v3 removed one unused variable in tcp_v4_connect() as Florian spotted.

    Fixes: 95a22caee396c ("tcp: randomize tcp timestamp offsets for each connection")
    Signed-off-by: Eric Dumazet
    Reviewed-by: Florian Westphal
    Tested-by: Florian Westphal
    Cc: Yuchung Cheng
    Signed-off-by: David S. Miller

    Eric Dumazet
     

05 May, 2017

11 commits

  • Pull char/misc driver updates from Greg KH:
    "Here is the big set of new char/misc driver drivers and features for
    4.12-rc1.

    There's lots of new drivers added this time around, new firmware
    drivers from Google, more auxdisplay drivers, extcon drivers, fpga
    drivers, and a bunch of other driver updates. Nothing major, except if
    you happen to have the hardware for these drivers, and then you will
    be happy :)

    All of these have been in linux-next for a while with no reported
    issues"

    * tag 'char-misc-4.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc: (136 commits)
    firmware: google memconsole: Fix return value check in platform_memconsole_init()
    firmware: Google VPD: Fix return value check in vpd_platform_init()
    goldfish_pipe: fix build warning about using too much stack.
    goldfish_pipe: An implementation of more parallel pipe
    fpga fr br: update supported version numbers
    fpga: region: release FPGA region reference in error path
    fpga altera-hps2fpga: disable/unprepare clock on error in alt_fpga_bridge_probe()
    mei: drop the TODO from samples
    firmware: Google VPD sysfs driver
    firmware: Google VPD: import lib_vpd source files
    misc: lkdtm: Add volatile to intentional NULL pointer reference
    eeprom: idt_89hpesx: Add OF device ID table
    misc: ds1682: Add OF device ID table
    misc: tsl2550: Add OF device ID table
    w1: Remove unneeded use of assert() and remove w1_log.h
    w1: Use kernel common min() implementation
    uio_mf624: Align memory regions to page size and set correct offsets
    uio_mf624: Refactor memory info initialization
    uio: Allow handling of non page-aligned memory regions
    hangcheck-timer: Fix typo in comment
    ...

    Linus Torvalds
     
  • Pull driver core updates from Greg KH:
    "Very tiny pull request for 4.12-rc1 for the driver core this time
    around.

    There are some documentation fixes, an eventpoll.h fixup to make it
    easier for the libc developers to take our header files directly, and
    some very minor driver core fixes and changes.

    All have been in linux-next for a very long time with no reported
    issues"

    * tag 'driver-core-4.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core:
    Revert "kref: double kref_put() in my_data_handler()"
    driver core: don't initialize 'parent' in device_add()
    drivers: base: dma-mapping: use nth_page helper
    Documentation/ABI: add information about cpu_capacity
    debugfs: set no_llseek in DEFINE_DEBUGFS_ATTRIBUTE
    eventpoll.h: add missing epoll event masks
    eventpoll.h: fix epoll event masks

    Linus Torvalds
     
  • Pull USB updates from Greg KH:
    "Here is the big USB patchset for 4.12-rc1.

    Lots of good stuff here, after many many many attempts, the kernel
    finally has a working typeC interface, many thanks to Heikki and
    Guenter and others who have taken the time to get this merged. It
    wasn't an easy path for them at all.

    There's also a staging driver that uses this new api, which is why
    it's coming in through this tree.

    Along with that, there's the usual huge number of changes for gadget
    drivers, xhci, and other stuff. Johan also finally refactored pretty
    much every driver that was looking at USB endpoints to do it in a
    common way, which will help prevent any "badly-formed" devices from
    causing problems in drivers. That too wasn't a simple task.

    All of these have been in linux-next for a while with no reported
    issues"

    * tag 'usb-4.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb: (263 commits)
    staging: typec: Fairchild FUSB302 Type-c chip driver
    staging: typec: Type-C Port Controller Interface driver (tcpci)
    staging: typec: USB Type-C Port Manager (tcpm)
    usb: host: xhci: remove #ifdef around PM functions
    usb: musb: don't mark of_dev_auxdata as initdata
    usb: misc: legousbtower: Fix buffers on stack
    USB: Revert "cdc-wdm: fix "out-of-sync" due to missing notifications"
    usb: Make sure usb/phy/of gets built-in
    USB: storage: e-mail update in drivers/usb/storage/unusual_devs.h
    usb: host: xhci: print correct command ring address
    usb: host: xhci: delete sp_dma_buffers for scratchpad
    usb: host: xhci: using correct specification chapter reference for DCBAAP
    xhci: switch to pci_alloc_irq_vectors
    usb: host: xhci-plat: set resume_quirk() for R-Car controllers
    usb: host: xhci-plat: add resume_quirk()
    usb: host: xhci-plat: enable clk in resume timing
    usb: host: plat: Enable xHCI plat runtime PM
    USB: serial: ftdi_sio: add device ID for Microsemi/Arrow SF2PLUS Dev Kit
    USB: serial: constify static arrays
    usb: fix some references for /proc/bus/usb
    ...

    Linus Torvalds
     
  • Pull networking fixes from David Miller:

    1) The wireless rate info fix from Johannes Berg.

    2) When a RAW socket is in hdrincl mode, we need to make sure that the
    user provided at least a minimally sized ipv4/ipv6 header. Fix from
    Alexander Potapenko.

    3) We must emit IFLA_PHYS_PORT_NAME netlink attributes using
    nla_put_string() so that it is NULL terminated.

    4) Fix a bug in TCP fastopen handling, wherein child sockets
    erroneously inherit the fastopen_req from the parent, and later can
    end up derefencing freed memory or doing a double free. From Eric
    Dumazet.

    5) Don't clear out netdev stats at close time in tg3 driver, from
    YueHaibing.

    6) Fix refcount leak in xt_CT, from Gao Feng.

    7) In nft_set_bitmap() don't leak dummy elements, from Liping Zhang.

    8) Fix deadlock due to taking the expectation lock twice, also from
    Liping Zhang.

    9) Make xt_socket work again with ipv6, from Peter Tirsek.

    10) Don't allow IPV6 to be used with IPVS if ipv6.disable=1, from Paolo
    Abeni.

    11) Make the BPF loader more flexible wrt. changes to the bpf MAP entry
    layout. From Jesper Dangaard Brouer.

    12) Fix ethtool reported device name in aquantia driver, from Pavel
    Belous.

    13) Fix build failures due to the compile time size test not working in
    netfilter conntrack. From Geert Uytterhoeven.

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (52 commits)
    cfg80211: make RATE_INFO_BW_20 the default
    ipv6: initialize route null entry in addrconf_init()
    qede: Fix possible misconfiguration of advertised autoneg value.
    qed: Fix overriding of supported autoneg value.
    qed*: Fix possible overflow for status block id field.
    rtnetlink: NUL-terminate IFLA_PHYS_PORT_NAME string
    netvsc: make sure napi enabled before vmbus_open
    aquantia: Fix driver name reported by ethtool
    ipv4, ipv6: ensure raw socket message is big enough to hold an IP header
    net/sched: remove redundant null check on head
    tcp: do not inherit fastopen_req from parent
    forcedeth: remove unnecessary carrier status check
    ibmvnic: Move queue restarting in ibmvnic_tx_complete
    ibmvnic: Record SKB RX queue during poll
    ibmvnic: Continue skb processing after skb completion error
    ibmvnic: Check for driver reset first in ibmvnic_xmit
    ibmvnic: Wait for any pending scrqs entries at driver close
    ibmvnic: Clean up tx pools when closing
    ibmvnic: Whitespace correction in release_rx_pools
    ibmvnic: Delete napi's when releasing driver resources
    ...

    Linus Torvalds
     
  • Pull SCSI updates from James Bottomley:
    "This update includes the usual round of major driver updates
    (hisi_sas, ufs, fnic, cxlflash, be2iscsi, ipr, stex). There's also the
    usual amount of cosmetic and spelling stuff"

    * tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (155 commits)
    scsi: qla4xxx: fix spelling mistake: "Tempalate" -> "Template"
    scsi: stex: make S6flag static
    scsi: mac_esp: fix to pass correct device identity to free_irq()
    scsi: aacraid: pci_alloc_consistent() failures on ARM64
    scsi: ufs: make ufshcd_get_lists_status() register operation obvious
    scsi: ufs: use MASK_EE_STATUS
    scsi: mac_esp: Replace bogus memory barrier with spinlock
    scsi: fcoe: make fcoe_e_d_tov and fcoe_r_a_tov static
    scsi: sd_zbc: Do not write lock zones for reset
    scsi: sd_zbc: Remove superfluous assignments
    scsi: sd: sd_zbc: Rename sd_zbc_setup_write_cmnd
    scsi: Improve scsi_get_sense_info_fld
    scsi: sd: Cleanup sd_done sense data handling
    scsi: sd: Improve sd_completed_bytes
    scsi: sd: Fix function descriptions
    scsi: mpt3sas: remove redundant wmb
    scsi: mpt: Move scsi_remove_host() out of mptscsih_remove_host()
    scsi: sg: reset 'res_in_use' after unlinking reserved array
    scsi: mvumi: remove code handling zero scsi_sg_count(scmd) case
    scsi: fusion: fix spelling mistake: "Persistancy" -> "Persistency"
    ...

    Linus Torvalds
     
  • Pull GPIO updates from Linus Walleij:
    "This is the bulk of GPIO changes for the v4.12 kernel cycle.

    Core changes:

    - Return NULL from gpiod_get_optional() when GPIOLIB is disabled.
    This was a much discussed change. It affects use cases where people
    write drivers that might or might not be using GPIO resources. I
    have decided that this is the lesser evil right now.

    - Make gpiod_count() behave consistently across different hardware
    descriptions.

    - Fix the syntax around open drain/open source to not infer active
    high/low semantics.

    New drivers:

    - A new single-register fixed-direction framework driver for hardware
    that have lines controlled by a single register that just work in
    one direction (out or in), including IRQ support.

    - Support the Fintek F71889A GPIO SuperIO controller.

    - Support the National NI 169445 MMIO GPIO.

    - Support for the X-Gene derivative of the DWC GPIO controller

    - Support for the Rohm BD9571MWV-M PMIC GPIO controller.

    - Refactor the Gemini GPIO driver to a generic Faraday FTGPIO driver
    and replace both the Gemini and the Moxa ART custom drivers with
    this driver.

    Driver improvements:

    - A whole slew of drivers have their spinlocks chaned to raw
    spinlocks as they provide irqchips, and thus we are progressing on
    realtime compliance.

    - Use devm_irq_alloc_descs() in a slew of drivers, getting managed
    resources.

    - Support for the embedded PWM controller inside the MVEBU driver.

    - Debounce, open source and open drain support for the Aspeed driver.

    - Misc smaller fixes like spelling and syntax and whatnot"

    * tag 'gpio-v4.12-1' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-gpio: (77 commits)
    gpio: f7188x: Add a missing break
    gpio: omap: return error if requested debounce time is not possible
    gpio: Add ROHM BD9571MWV-M PMIC GPIO driver
    gpio: gpio-wcove: fix GPIO IRQ status mask
    gpio: DT bindings, move tca9554 from pcf857x to pca953x
    gpio: move tca9554 from pcf857x to pca953x
    gpio: arizona: Correct check whether the pin is an input
    gpio: Add XRA1403 DTS binding documentation
    dt-bindings: add exar to vendor prefixes list
    gpio: gpio-wcove: fix irq pending status bit width
    gpio: dwapb: use dwapb_read instead of readl_relaxed
    gpio: aspeed: Add open-source and open-drain support
    gpio: aspeed: Add debounce support
    gpio: aspeed: dt: Add optional clocks property
    gpio: aspeed: dt: Fix description alignment in bindings document
    gpio: mvebu: Add limited PWM support
    gpio: Use unsigned int for interrupt numbers
    gpio: f7188x: Add F71889A GPIO support.
    gpio: core: Decouple open drain/source flag with active low/high
    gpio: arizona: Correct handling for reading input GPIOs
    ...

    Linus Torvalds
     
  • Pull x86 platform-drivers update from Darren Hart:
    "This represents a significantly larger and more complex set of changes
    than those of prior merge windows.

    In particular, we had several changes with dependencies on other
    subsystems which we felt were best managed through merges of immutable
    branches, including one each from input, i2c, and leds. Two patches
    for the watchdog subsystem are included after discussion with Wim and
    Guenter following a collision in linux-next (this should be resolved
    and you should only see these two appear in this pull request). These
    are called out in the "External" section below.

    Summary of changes:
    - significant further cleanup of fujitsu-laptop and hp-wmi
    - new model support for ideapad, asus, silead, and xiaomi
    - new hotkeys for thinkpad and models using intel-vbtn
    - dell keyboard backlight improvements
    - build and dependency improvements
    - intel * ipc fixes, cleanups, and api updates
    - single isolated fixes noted below

    External:
    - watchdog: iTCO_wdt: Add PMC specific noreboot update api
    - watchdog: iTCO_wdt: cleanup set/unset no_reboot_bit functions
    - Merge branch 'ib/4.10-sparse-keymap-managed'
    - Merge branch 'i2c/for-INT33FE'
    - Merge branch 'linux-leds/dell-laptop-changes-for-4.12'

    platform/x86:
    - Add Intel Cherry Trail ACPI INT33FE device driver
    - remove sparse_keymap_free() calls
    - Make SILEAD_DMI depend on TOUCHSCREEN_SILEAD

    asus-wmi:
    - try to set als by default
    - fix cpufv sysfs file permission

    acer-wmi:
    - setup accelerometer when ACPI device was found

    ideapad-laptop:
    - Add IdeaPad V310-15ISK to no_hw_rfkill
    - Add IdeaPad 310-15IKB to no_hw_rfkill

    intel_pmc_ipc:
    - use gcr mem base for S0ix counter read
    - Fix iTCO_wdt GCS memory mapping failure
    - Add pmc gcr read/write/update api's
    - fix gcr offset

    dell-laptop:
    - Add keyboard backlight timeout AC settings
    - Handle return error form dell_get_intensity.
    - Protect kbd_state against races
    - Refactor kbd_led_triggers_store()

    hp-wireless:
    - reuse module_acpi_driver
    - add Xiaomi's hardware id to the supported list

    intel-vbtn:
    - add volume up and down

    INT33FE:
    - add i2c dependency

    hp-wmi:
    - Cleanup exit paths
    - Do not shadow errors in sysfs show functions
    - Use DEVICE_ATTR_(RO|RW) helper macros
    - Refactor dock and tablet state fetchers
    - Cleanup wireless get_(hw|sw)state functions
    - Refactor redundant HPWMI_READ functions
    - Standardize enum usage for constants
    - Cleanup local variable declarations
    - Do not shadow error values
    - Fix detection for dock and tablet mode
    - Fix error value for hp_wmi_tablet_state

    fujitsu-laptop:
    - simplify error handling in acpi_fujitsu_laptop_add()
    - do not log LED registration failures
    - switch to managed LED class devices
    - reorganize LED-related code
    - refactor LED registration
    - select LEDS_CLASS
    - remove redundant fields from struct fujitsu_bl
    - account for backlight power when determining brightness
    - do not log set_lcd_level() failures in bl_update_status()
    - ignore errors when setting backlight power
    - make disable_brightness_adjust a boolean
    - clean up use_alt_lcd_levels handling
    - sync brightness in set_lcd_level()
    - simplify set_lcd_level()
    - merge set_lcd_level_alt() into set_lcd_level()
    - switch to a managed backlight device
    - only handle backlight when appropriate
    - update debug message logged by call_fext_func()
    - rename call_fext_func() arguments
    - simplify call_fext_func()
    - clean up local variables in call_fext_func()
    - remove keycode fields from struct fujitsu_bl
    - model-dependent sparse keymap overrides
    - use a sparse keymap for hotkey event generation
    - switch to a managed hotkey input device
    - refactor hotkey input device setup
    - use a sparse keymap for brightness key events
    - switch to a managed backlight input device
    - refactor backlight input device setup
    - remove pf_device field from struct fujitsu_bl
    - only register platform device if FUJ02E3 is present
    - add and remove platform device in separate functions
    - simplify platform device attribute definitions
    - remove backlight-related attributes from the platform device
    - cleanup error labels in fujitsu_init()
    - only register backlight device if FUJ02B1 is present
    - sync backlight power status in acpi_fujitsu_laptop_add()
    - register backlight device in a separate function
    - simplify brightness key event generation logic
    - decrease indentation in acpi_fujitsu_bl_notify()

    intel-hid:
    - Add missing ->thaw callback
    - do not set parents of input devices explicitly
    - remove redundant set_bit() call
    - use devm_input_allocate_device() for HID events input device
    - make intel_hid_set_enable() take a boolean argument
    - simplify enabling/disabling HID events

    silead_dmi:
    - Add touchscreen info for Surftab Wintron 7.0
    - Abort early if DMI does not match
    - Do not treat all devices as i2c_clients
    - Add entry for Insyde 7W tablets
    - Constify properties arrays

    intel_scu_ipc:
    - Introduce intel_scu_ipc_raw_command()
    - Introduce SCU_DEVICE() macro
    - Remove redundant subarch check
    - Rearrange init sequence
    - Platform data is mandatory

    asus-nb-wmi:
    - Add wapf4 quirk for the X302UA

    dell-*:
    - Call new led hw_changed API on kbd brightness change
    - Add a generic dell-laptop notifier chain

    eeepc-laptop:
    - Skip unknown key messages 0x50 0x51

    thinkpad_acpi:
    - add mapping for new hotkeys
    - guard generic hotkey case"

    * tag 'platform-drivers-x86-v4.12-1' of git://git.infradead.org/linux-platform-drivers-x86: (108 commits)
    platform/x86: Make SILEAD_DMI depend on TOUCHSCREEN_SILEAD
    platform/x86: asus-wmi: try to set als by default
    platform/x86: asus-wmi: fix cpufv sysfs file permission
    platform/x86: acer-wmi: setup accelerometer when ACPI device was found
    platform/x86: ideapad-laptop: Add IdeaPad V310-15ISK to no_hw_rfkill
    platform/x86: intel_pmc_ipc: use gcr mem base for S0ix counter read
    platform/x86: intel_pmc_ipc: Fix iTCO_wdt GCS memory mapping failure
    watchdog: iTCO_wdt: Add PMC specific noreboot update api
    watchdog: iTCO_wdt: cleanup set/unset no_reboot_bit functions
    platform/x86: intel_pmc_ipc: Add pmc gcr read/write/update api's
    platform/x86: intel_pmc_ipc: fix gcr offset
    platform/x86: dell-laptop: Add keyboard backlight timeout AC settings
    platform/x86: dell-laptop: Handle return error form dell_get_intensity.
    platform/x86: hp-wireless: reuse module_acpi_driver
    platform/x86: intel-vbtn: add volume up and down
    platform/x86: INT33FE: add i2c dependency
    platform/x86: hp-wmi: Cleanup exit paths
    platform/x86: hp-wmi: Do not shadow errors in sysfs show functions
    platform/x86: hp-wmi: Use DEVICE_ATTR_(RO|RW) helper macros
    platform/x86: hp-wmi: Refactor dock and tablet state fetchers
    ...

    Linus Torvalds
     
  • Pull xen updates from Juergen Gross:
    "Xen fixes and featrues for 4.12. The main changes are:

    - enable building the kernel with Xen support but without enabling
    paravirtualized mode (Vitaly Kuznetsov)

    - add a new 9pfs xen frontend driver (Stefano Stabellini)

    - simplify Xen's cpuid handling by making use of cpu capabilities
    (Juergen Gross)

    - add/modify some headers for new Xen paravirtualized devices
    (Oleksandr Andrushchenko)

    - EFI reset_system support under Xen (Julien Grall)

    - and the usual cleanups and corrections"

    * tag 'for-linus-4.12b-rc0b-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip: (57 commits)
    xen: Move xen_have_vector_callback definition to enlighten.c
    xen: Implement EFI reset_system callback
    arm/xen: Consolidate calls to shutdown hypercall in a single helper
    xen: Export xen_reboot
    xen/x86: Call xen_smp_intr_init_pv() on BSP
    xen: Revert commits da72ff5bfcb0 and 72a9b186292d
    xen/pvh: Do not fill kernel's e820 map in init_pvh_bootparams()
    xen/scsifront: use offset_in_page() macro
    xen/arm,arm64: rename __generic_dma_ops to xen_get_dma_ops
    xen/arm,arm64: fix xen_dma_ops after 815dd18 "Consolidate get_dma_ops..."
    xen/9pfs: select CONFIG_XEN_XENBUS_FRONTEND
    x86/cpu: remove hypervisor specific set_cpu_features
    vmware: set cpu capabilities during platform initialization
    x86/xen: use capabilities instead of fake cpuid values for xsave
    x86/xen: use capabilities instead of fake cpuid values for x2apic
    x86/xen: use capabilities instead of fake cpuid values for mwait
    x86/xen: use capabilities instead of fake cpuid values for acpi
    x86/xen: use capabilities instead of fake cpuid values for acc
    x86/xen: use capabilities instead of fake cpuid values for mtrr
    x86/xen: use capabilities instead of fake cpuid values for aperf
    ...

    Linus Torvalds
     
  • Due to the way I did the RX bitrate conversions in mac80211 with
    spatch, going setting flags to setting the value, many drivers now
    don't set the bandwidth value for 20 MHz, since with the flags it
    wasn't necessary to (there was no 20 MHz flag, only the others.)

    Rather than go through and try to fix up all the drivers, instead
    renumber the enum so that 20 MHz, which is the typical bandwidth,
    actually has the value 0, making those drivers all work again.

    If VHT was hit used with a driver not reporting it, e.g. iwlmvm,
    this manifested in hitting the bandwidth warning in
    cfg80211_calculate_bitrate_vht().

    Reported-by: Linus Torvalds
    Tested-by: Jens Axboe
    Signed-off-by: Johannes Berg
    Signed-off-by: David S. Miller

    Johannes Berg
     
  • Andrey reported a crash on init_net.ipv6.ip6_null_entry->rt6i_idev
    since it is always NULL.

    This is clearly wrong, we have code to initialize it to loopback_dev,
    unfortunately the order is still not correct.

    loopback_dev is registered very early during boot, we lose a chance
    to re-initialize it in notifier. addrconf_init() is called after
    ip6_route_init(), which means we have no chance to correct it.

    Fix it by moving this initialization explicitly after
    ipv6_add_dev(init_net.loopback_dev) in addrconf_init().

    Reported-by: Andrey Konovalov
    Signed-off-by: Cong Wang
    Tested-by: Andrey Konovalov
    Signed-off-by: David S. Miller

    WANG Cong
     
  • Value for status block id could be more than 256 in 100G mode, need to
    update its data type from u8 to u16.

    Signed-off-by: Sudarsana Reddy Kalluru
    Signed-off-by: Yuval Mintz
    Signed-off-by: David S. Miller

    sudarsana.kalluru@cavium.com
     

04 May, 2017

28 commits

  • Pull modules updates from Jessica Yu:

    - Minor code cleanups

    - Fix section alignment for .init_array

    * tag 'modules-for-v4.12' of git://git.kernel.org/pub/scm/linux/kernel/git/jeyu/linux:
    kallsyms: Use bounded strnchr() when parsing string
    module: Unify the return value type of try_module_get
    module: set .init_array alignment to 8

    Linus Torvalds
     
  • Pull tracing updates from Steven Rostedt:
    "New features for this release:

    - Pretty much a full rewrite of the processing of function plugins.
    i.e. echo do_IRQ:stacktrace > set_ftrace_filter

    - The rewrite was needed to add plugins to be unique to tracing
    instances. i.e. mkdir instance/foo; cd instances/foo; echo
    do_IRQ:stacktrace > set_ftrace_filter The old way was written very
    hacky. This removes a lot of those hacks.

    - New "function-fork" tracing option. When set, pids in the
    set_ftrace_pid will have their children added when the processes
    with their pids listed in the set_ftrace_pid file forks.

    - Exposure of "maxactive" for kretprobe in kprobe_events

    - Allow for builtin init functions to be traced by the function
    tracer (via the kernel command line). Module init function tracing
    will come in the next release.

    - Added more selftests, and have selftests also test in an instance"

    * tag 'trace-v4.12' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (60 commits)
    ring-buffer: Return reader page back into existing ring buffer
    selftests: ftrace: Allow some event trigger tests to run in an instance
    selftests: ftrace: Have some basic tests run in a tracing instance too
    selftests: ftrace: Have event tests also run in an tracing instance
    selftests: ftrace: Make func_event_triggers and func_traceonoff_triggers tests do instances
    selftests: ftrace: Allow some tests to be run in a tracing instance
    tracing/ftrace: Allow for instances to trigger their own stacktrace probes
    tracing/ftrace: Allow for the traceonoff probe be unique to instances
    tracing/ftrace: Enable snapshot function trigger to work with instances
    tracing/ftrace: Allow instances to have their own function probes
    tracing/ftrace: Add a better way to pass data via the probe functions
    ftrace: Dynamically create the probe ftrace_ops for the trace_array
    tracing: Pass the trace_array into ftrace_probe_ops functions
    tracing: Have the trace_array hold the list of registered func probes
    ftrace: If the hash for a probe fails to update then free what was initialized
    ftrace: Have the function probes call their own function
    ftrace: Have each function probe use its own ftrace_ops
    ftrace: Have unregister_ftrace_function_probe_func() return a value
    ftrace: Add helper function ftrace_hash_move_and_update_ops()
    ftrace: Remove data field from ftrace_func_probe structure
    ...

    Linus Torvalds
     
  • Merge misc updates from Andrew Morton:

    - a few misc things

    - most of MM

    - KASAN updates

    * emailed patches from Andrew Morton : (102 commits)
    kasan: separate report parts by empty lines
    kasan: improve double-free report format
    kasan: print page description after stacks
    kasan: improve slab object description
    kasan: change report header
    kasan: simplify address description logic
    kasan: change allocation and freeing stack traces headers
    kasan: unify report headers
    kasan: introduce helper functions for determining bug type
    mm: hwpoison: call shake_page() after try_to_unmap() for mlocked page
    mm: hwpoison: call shake_page() unconditionally
    mm/swapfile.c: fix swap space leak in error path of swap_free_entries()
    mm/gup.c: fix access_ok() argument type
    mm/truncate: avoid pointless cleancache_invalidate_inode() calls.
    mm/truncate: bail out early from invalidate_inode_pages2_range() if mapping is empty
    fs/block_dev: always invalidate cleancache in invalidate_bdev()
    fs: fix data invalidation in the cleancache during direct IO
    zram: reduce load operation in page_same_filled
    zram: use zram_free_page instead of open-coded
    zram: introduce zram data accessor
    ...

    Linus Torvalds
     
  • This is a code cleanup patch, no functionality changes. There are 2
    unused function prototype in swap.h, they are removed.

    Link: http://lkml.kernel.org/r/20170405071017.23677-1-ying.huang@intel.com
    Signed-off-by: "Huang, Ying"
    Cc: Tim Chen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Huang Ying
     
  • The memory controllers stat function names are awkwardly long and
    arbitrarily different from the zone and node stat functions.

    The current interface is named:

    mem_cgroup_read_stat()
    mem_cgroup_update_stat()
    mem_cgroup_inc_stat()
    mem_cgroup_dec_stat()
    mem_cgroup_update_page_stat()
    mem_cgroup_inc_page_stat()
    mem_cgroup_dec_page_stat()

    This patch renames it to match the corresponding node stat functions:

    memcg_page_state() [node_page_state()]
    mod_memcg_state() [mod_node_state()]
    inc_memcg_state() [inc_node_state()]
    dec_memcg_state() [dec_node_state()]
    mod_memcg_page_state() [mod_node_page_state()]
    inc_memcg_page_state() [inc_node_page_state()]
    dec_memcg_page_state() [dec_node_page_state()]

    Link: http://lkml.kernel.org/r/20170404220148.28338-4-hannes@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Acked-by: Vladimir Davydov
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • The current duplication is a high-maintenance mess, and it's painful to
    add new items or query memcg state from the rest of the VM.

    This increases the size of the stat array marginally, but we should aim
    to track all these stats on a per-cgroup level anyway.

    Link: http://lkml.kernel.org/r/20170404220148.28338-3-hannes@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Acked-by: Vladimir Davydov
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • The current duplication is a high-maintenance mess, and it's painful to
    add new items.

    This increases the size of the event array, but we'll eventually want
    most of the VM events tracked on a per-cgroup basis anyway.

    Link: http://lkml.kernel.org/r/20170404220148.28338-2-hannes@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Acked-by: Vladimir Davydov
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • We only ever count single events, drop the @nr parameter. Rename the
    function accordingly. Remove low-information kerneldoc.

    Link: http://lkml.kernel.org/r/20170404220148.28338-1-hannes@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Acked-by: Vladimir Davydov
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Since commit 59dc76b0d4df ("mm: vmscan: reduce size of inactive file
    list") we noticed bigger IO spikes during changes in cache access
    patterns.

    The patch in question shrunk the inactive list size to leave more room
    for the current workingset in the presence of streaming IO. However,
    workingset transitions that previously happened on the inactive list are
    now pushed out of memory and incur more refaults to complete.

    This patch disables active list protection when refaults are being
    observed. This accelerates workingset transitions, and allows more of
    the new set to establish itself from memory, without eating into the
    ability to protect the established workingset during stable periods.

    The workloads that were measurably affected for us were hit pretty bad
    by it, with refault/majfault rates doubling and tripling during cache
    transitions, and the machines sustaining half-hour periods of 100% IO
    utilization, where they'd previously have sub-minute peaks at 60-90%.

    Stateful services that handle user data tend to be more conservative
    with kernel upgrades. As a result we hit most page cache issues with
    some delay, as was the case here.

    The severity seemed to warrant a stable tag.

    Fixes: 59dc76b0d4df ("mm: vmscan: reduce size of inactive file list")
    Link: http://lkml.kernel.org/r/20170404220052.27593-1-hannes@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Vladimir Davydov
    Cc: [4.7+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Fix variable name error in comments. No code changes.

    Link: http://lkml.kernel.org/r/20170403161655.5081-1-haolee.swjtu@gmail.com
    Signed-off-by: Hao Lee
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hao Lee
     
  • It is preferred, and the rest of migrate.h gets it right.

    Link: http://lkml.kernel.org/r/1490336009-8024-1-git-send-email-pushkar.iit@gmail.com
    Signed-off-by: Pushkar Jambhlekar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pushkar Jambhlekar
     
  • On SPARSEMEM systems page poisoning is enabled after buddy is up,
    because of the dependency on page extension init. This causes the pages
    released by free_all_bootmem not to be poisoned. This either delays or
    misses the identification of some issues because the pages have to
    undergo another cycle of alloc-free-alloc for any corruption to be
    detected.

    Enable page poisoning early by getting rid of the PAGE_EXT_DEBUG_POISON
    flag. Since all the free pages will now be poisoned, the flag need not
    be verified before checking the poison during an alloc.

    [vinmenon@codeaurora.org: fix Kconfig]
    Link: http://lkml.kernel.org/r/1490878002-14423-1-git-send-email-vinmenon@codeaurora.org
    Link: http://lkml.kernel.org/r/1490358246-11001-1-git-send-email-vinmenon@codeaurora.org
    Signed-off-by: Vinayak Menon
    Acked-by: Laura Abbott
    Tested-by: Laura Abbott
    Cc: Joonsoo Kim
    Cc: Michal Hocko
    Cc: Akinobu Mita
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vinayak Menon
     
  • There is no user for it. Remove it.

    [minchan@kernel.org: use false instead of SWAP_FAIL]
    Link: http://lkml.kernel.org/r/20170316053313.GA19241@bbox
    Link: http://lkml.kernel.org/r/1489555493-14659-11-git-send-email-minchan@kernel.org
    Signed-off-by: Minchan Kim
    Cc: Anshuman Khandual
    Cc: Hillf Danton
    Cc: Johannes Weiner
    Cc: Kirill A. Shutemov
    Cc: Michal Hocko
    Cc: Naoya Horiguchi
    Cc: Vlastimil Babka
    Cc: Sergey Senozhatsky
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • rmap_one's return value controls whether rmap_work should contine to
    scan other ptes or not so it's target for changing to boolean. Return
    true if the scan should be continued. Otherwise, return false to stop
    the scanning.

    This patch makes rmap_one's return value to boolean.

    Link: http://lkml.kernel.org/r/1489555493-14659-10-git-send-email-minchan@kernel.org
    Signed-off-by: Minchan Kim
    Cc: Anshuman Khandual
    Cc: Hillf Danton
    Cc: Johannes Weiner
    Cc: Kirill A. Shutemov
    Cc: Michal Hocko
    Cc: Naoya Horiguchi
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • There is no user of the return value from rmap_walk() and friends so
    this patch makes them void-returning functions.

    Link: http://lkml.kernel.org/r/1489555493-14659-9-git-send-email-minchan@kernel.org
    Signed-off-by: Minchan Kim
    Cc: Anshuman Khandual
    Cc: Hillf Danton
    Cc: Johannes Weiner
    Cc: Kirill A. Shutemov
    Cc: Michal Hocko
    Cc: Naoya Horiguchi
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • try_to_unmap() returns SWAP_SUCCESS or SWAP_FAIL so it's suitable for
    boolean return. This patch changes it.

    Link: http://lkml.kernel.org/r/1489555493-14659-8-git-send-email-minchan@kernel.org
    Signed-off-by: Minchan Kim
    Cc: Naoya Horiguchi
    Cc: Anshuman Khandual
    Cc: Hillf Danton
    Cc: Johannes Weiner
    Cc: Kirill A. Shutemov
    Cc: Michal Hocko
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • ttu doesn't need to return SWAP_MLOCK. Instead, just return SWAP_FAIL
    because it means the page is not-swappable so it should move to another
    LRU list(active or unevictable). putback friends will move it to right
    list depending on the page's LRU flag.

    Link: http://lkml.kernel.org/r/1489555493-14659-6-git-send-email-minchan@kernel.org
    Signed-off-by: Minchan Kim
    Cc: Anshuman Khandual
    Cc: Hillf Danton
    Cc: Johannes Weiner
    Cc: Kirill A. Shutemov
    Cc: Michal Hocko
    Cc: Naoya Horiguchi
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • try_to_munlock returns SWAP_MLOCK if the one of VMAs mapped the page has
    VM_LOCKED flag. In that time, VM set PG_mlocked to the page if the page
    is not pte-mapped THP which cannot be mlocked, either.

    With that, __munlock_isolated_page can use PageMlocked to check whether
    try_to_munlock is successful or not without relying on try_to_munlock's
    retval. It helps to make try_to_unmap/try_to_unmap_one simple with
    upcoming patches.

    [minchan@kernel.org: remove PG_Mlocked VM_BUG_ON check]
    Link: http://lkml.kernel.org/r/20170411025615.GA6545@bbox
    Link: http://lkml.kernel.org/r/1489555493-14659-5-git-send-email-minchan@kernel.org
    Signed-off-by: Minchan Kim
    Acked-by: Kirill A. Shutemov
    Acked-by: Vlastimil Babka
    Cc: Anshuman Khandual
    Cc: Hillf Danton
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Naoya Horiguchi
    Cc: Sasha Levin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • If we found lazyfree page is dirty, try_to_unmap_one can just
    SetPageSwapBakced in there like PG_mlocked page and just return with
    SWAP_FAIL which is very natural because the page is not swappable right
    now so that vmscan can activate it. There is no point to introduce new
    return value SWAP_DIRTY in try_to_unmap at the moment.

    Link: http://lkml.kernel.org/r/1489555493-14659-3-git-send-email-minchan@kernel.org
    Signed-off-by: Minchan Kim
    Acked-by: Hillf Danton
    Acked-by: Kirill A. Shutemov
    Cc: Anshuman Khandual
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Naoya Horiguchi
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • Since commit 3ad38ceb2769 ("x86/mm: Remove CONFIG_DEBUG_NX_TEST"),
    nothing is using the exported rodata_test_data variable, so drop the
    export.

    This additionally updates the pr_fmt to avoid redundant strings and
    adjusts some whitespace.

    Link: http://lkml.kernel.org/r/20170307005313.GA85809@beast
    Signed-off-by: Kees Cook
    Cc: Jinbum Park
    Cc: Arjan van de Ven
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kees Cook
     
  • now that we have memalloc_nofs_{save,restore} api we can mark the whole
    transaction context as implicitly GFP_NOFS. All allocations will
    automatically inherit GFP_NOFS this way. This means that we do not have
    to mark any of those requests with GFP_NOFS and moreover all the
    ext4_kv[mz]alloc(GFP_NOFS) are also safe now because even the hardcoded
    GFP_KERNEL allocations deep inside the vmalloc will be NOFS now.

    [akpm@linux-foundation.org: tweak comments]
    Link: http://lkml.kernel.org/r/20170306131408.9828-7-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Reviewed-by: Jan Kara
    Cc: Dave Chinner
    Cc: Theodore Ts'o
    Cc: Chris Mason
    Cc: David Sterba
    Cc: Brian Foster
    Cc: Darrick J. Wong
    Cc: Nikolay Borisov
    Cc: Peter Zijlstra
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • GFP_NOFS context is used for the following 5 reasons currently:

    - to prevent from deadlocks when the lock held by the allocation
    context would be needed during the memory reclaim

    - to prevent from stack overflows during the reclaim because the
    allocation is performed from a deep context already

    - to prevent lockups when the allocation context depends on other
    reclaimers to make a forward progress indirectly

    - just in case because this would be safe from the fs POV

    - silence lockdep false positives

    Unfortunately overuse of this allocation context brings some problems to
    the MM. Memory reclaim is much weaker (especially during heavy FS
    metadata workloads), OOM killer cannot be invoked because the MM layer
    doesn't have enough information about how much memory is freeable by the
    FS layer.

    In many cases it is far from clear why the weaker context is even used
    and so it might be used unnecessarily. We would like to get rid of
    those as much as possible. One way to do that is to use the flag in
    scopes rather than isolated cases. Such a scope is declared when really
    necessary, tracked per task and all the allocation requests from within
    the context will simply inherit the GFP_NOFS semantic.

    Not only this is easier to understand and maintain because there are
    much less problematic contexts than specific allocation requests, this
    also helps code paths where FS layer interacts with other layers (e.g.
    crypto, security modules, MM etc...) and there is no easy way to convey
    the allocation context between the layers.

    Introduce memalloc_nofs_{save,restore} API to control the scope of
    GFP_NOFS allocation context. This is basically copying
    memalloc_noio_{save,restore} API we have for other restricted allocation
    context GFP_NOIO. The PF_MEMALLOC_NOFS flag already exists and it is
    just an alias for PF_FSTRANS which has been xfs specific until recently.
    There are no more PF_FSTRANS users anymore so let's just drop it.

    PF_MEMALLOC_NOFS is now checked in the MM layer and drops __GFP_FS
    implicitly same as PF_MEMALLOC_NOIO drops __GFP_IO. memalloc_noio_flags
    is renamed to current_gfp_context because it now cares about both
    PF_MEMALLOC_NOFS and PF_MEMALLOC_NOIO contexts. Xfs code paths preserve
    their semantic. kmem_flags_convert() doesn't need to evaluate the flag
    anymore.

    This patch shouldn't introduce any functional changes.

    Let's hope that filesystems will drop direct GFP_NOFS (resp. ~__GFP_FS)
    usage as much as possible and only use a properly documented
    memalloc_nofs_{save,restore} checkpoints where they are appropriate.

    [akpm@linux-foundation.org: fix comment typo, reflow comment]
    Link: http://lkml.kernel.org/r/20170306131408.9828-5-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Cc: Dave Chinner
    Cc: Theodore Ts'o
    Cc: Chris Mason
    Cc: David Sterba
    Cc: Jan Kara
    Cc: Brian Foster
    Cc: Darrick J. Wong
    Cc: Nikolay Borisov
    Cc: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • xfs has defined PF_FSTRANS to declare a scope GFP_NOFS semantic quite
    some time ago. We would like to make this concept more generic and use
    it for other filesystems as well. Let's start by giving the flag a more
    generic name PF_MEMALLOC_NOFS which is in line with an exiting
    PF_MEMALLOC_NOIO already used for the same purpose for GFP_NOIO
    contexts. Replace all PF_FSTRANS usage from the xfs code in the first
    step before we introduce a full API for it as xfs uses the flag directly
    anyway.

    This patch doesn't introduce any functional change.

    Link: http://lkml.kernel.org/r/20170306131408.9828-4-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Reviewed-by: Darrick J. Wong
    Reviewed-by: Brian Foster
    Acked-by: Vlastimil Babka
    Cc: Dave Chinner
    Cc: Theodore Ts'o
    Cc: Chris Mason
    Cc: David Sterba
    Cc: Jan Kara
    Cc: Nikolay Borisov
    Cc: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • The current implementation of the reclaim lockup detection can lead to
    false positives and those even happen and usually lead to tweak the code
    to silence the lockdep by using GFP_NOFS even though the context can use
    __GFP_FS just fine.

    See

    http://lkml.kernel.org/r/20160512080321.GA18496@dastard

    as an example.

    =================================
    [ INFO: inconsistent lock state ]
    4.5.0-rc2+ #4 Tainted: G O
    ---------------------------------
    inconsistent {RECLAIM_FS-ON-R} -> {IN-RECLAIM_FS-W} usage.
    kswapd0/543 [HC0[0]:SC0[0]:HE1:SE1] takes:

    (&xfs_nondir_ilock_class){++++-+}, at: xfs_ilock+0x177/0x200 [xfs]

    {RECLAIM_FS-ON-R} state was registered at:
    mark_held_locks+0x79/0xa0
    lockdep_trace_alloc+0xb3/0x100
    kmem_cache_alloc+0x33/0x230
    kmem_zone_alloc+0x81/0x120 [xfs]
    xfs_refcountbt_init_cursor+0x3e/0xa0 [xfs]
    __xfs_refcount_find_shared+0x75/0x580 [xfs]
    xfs_refcount_find_shared+0x84/0xb0 [xfs]
    xfs_getbmap+0x608/0x8c0 [xfs]
    xfs_vn_fiemap+0xab/0xc0 [xfs]
    do_vfs_ioctl+0x498/0x670
    SyS_ioctl+0x79/0x90
    entry_SYSCALL_64_fastpath+0x12/0x6f

    CPU0
    ----
    lock(&xfs_nondir_ilock_class);

    lock(&xfs_nondir_ilock_class);

    *** DEADLOCK ***

    3 locks held by kswapd0/543:

    stack backtrace:
    CPU: 0 PID: 543 Comm: kswapd0 Tainted: G O 4.5.0-rc2+ #4
    Call Trace:
    lock_acquire+0xd8/0x1e0
    down_write_nested+0x5e/0xc0
    xfs_ilock+0x177/0x200 [xfs]
    xfs_reflink_cancel_cow_range+0x150/0x300 [xfs]
    xfs_fs_evict_inode+0xdc/0x1e0 [xfs]
    evict+0xc5/0x190
    dispose_list+0x39/0x60
    prune_icache_sb+0x4b/0x60
    super_cache_scan+0x14f/0x1a0
    shrink_slab.part.63.constprop.79+0x1e9/0x4e0
    shrink_zone+0x15e/0x170
    kswapd+0x4f1/0xa80
    kthread+0xf2/0x110
    ret_from_fork+0x3f/0x70

    To quote Dave:
    "Ignoring whether reflink should be doing anything or not, that's a
    "xfs_refcountbt_init_cursor() gets called both outside and inside
    transactions" lockdep false positive case. The problem here is lockdep
    has seen this allocation from within a transaction, hence a GFP_NOFS
    allocation, and now it's seeing it in a GFP_KERNEL context. Also note
    that we have an active reference to this inode.

    So, because the reclaim annotations overload the interrupt level
    detections and it's seen the inode ilock been taken in reclaim
    ("interrupt") context, this triggers a reclaim context warning where
    it thinks it is unsafe to do this allocation in GFP_KERNEL context
    holding the inode ilock..."

    This sounds like a fundamental problem of the reclaim lock detection.
    It is really impossible to annotate such a special usecase IMHO unless
    the reclaim lockup detection is reworked completely. Until then it is
    much better to provide a way to add "I know what I am doing flag" and
    mark problematic places. This would prevent from abusing GFP_NOFS flag
    which has a runtime effect even on configurations which have lockdep
    disabled.

    Introduce __GFP_NOLOCKDEP flag which tells the lockdep gfp tracking to
    skip the current allocation request.

    While we are at it also make sure that the radix tree doesn't
    accidentaly override tags stored in the upper part of the gfp_mask.

    Link: http://lkml.kernel.org/r/20170306131408.9828-3-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Suggested-by: Peter Zijlstra
    Acked-by: Peter Zijlstra (Intel)
    Acked-by: Vlastimil Babka
    Cc: Dave Chinner
    Cc: Theodore Ts'o
    Cc: Chris Mason
    Cc: David Sterba
    Cc: Jan Kara
    Cc: Brian Foster
    Cc: Darrick J. Wong
    Cc: Nikolay Borisov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • Introduce two helpers, is_migrate_highatomic() and is_migrate_highatomic_page().

    Simplify the code, no functional changes.

    [akpm@linux-foundation.org: use static inlines rather than macros, per mhocko]
    Link: http://lkml.kernel.org/r/58B94F15.6060606@huawei.com
    Signed-off-by: Xishi Qiu
    Acked-by: Michal Hocko
    Cc: Vlastimil Babka
    Cc: Mel Gorman
    Cc: Minchan Kim
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Xishi Qiu
     
  • Cgroups currently don't report how much shmem they use, which can be
    useful data to have, in particular since shmem is included in the
    cache/file item while being reclaimed like anonymous memory.

    Add a counter to track shmem pages during charging and uncharging.

    Link: http://lkml.kernel.org/r/20170221164343.32252-1-hannes@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Reported-by: Chris Down
    Cc: Michal Hocko
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • When memory pressure is high, we free MADV_FREE pages. If the pages are
    not dirty in pte, the pages could be freed immediately. Otherwise we
    can't reclaim them. We put the pages back to anonumous LRU list (by
    setting SwapBacked flag) and the pages will be reclaimed in normal
    swapout way.

    We use normal page reclaim policy. Since MADV_FREE pages are put into
    inactive file list, such pages and inactive file pages are reclaimed
    according to their age. This is expected, because we don't want to
    reclaim too many MADV_FREE pages before used once pages.

    Based on Minchan's original patch

    [minchan@kernel.org: clean up lazyfree page handling]
    Link: http://lkml.kernel.org/r/20170303025237.GB3503@bbox
    Link: http://lkml.kernel.org/r/14b8eb1d3f6bf6cc492833f183ac8c304e560484.1487965799.git.shli@fb.com
    Signed-off-by: Shaohua Li
    Signed-off-by: Minchan Kim
    Acked-by: Minchan Kim
    Acked-by: Michal Hocko
    Acked-by: Johannes Weiner
    Acked-by: Hillf Danton
    Cc: Hugh Dickins
    Cc: Rik van Riel
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shaohua Li
     
  • madv()'s MADV_FREE indicate pages are 'lazyfree'. They are still
    anonymous pages, but they can be freed without pageout. To distinguish
    these from normal anonymous pages, we clear their SwapBacked flag.

    MADV_FREE pages could be freed without pageout, so they pretty much like
    used once file pages. For such pages, we'd like to reclaim them once
    there is memory pressure. Also it might be unfair reclaiming MADV_FREE
    pages always before used once file pages and we definitively want to
    reclaim the pages before other anonymous and file pages.

    To speed up MADV_FREE pages reclaim, we put the pages into
    LRU_INACTIVE_FILE list. The rationale is LRU_INACTIVE_FILE list is tiny
    nowadays and should be full of used once file pages. Reclaiming
    MADV_FREE pages will not have much interfere of anonymous and active
    file pages. And the inactive file pages and MADV_FREE pages will be
    reclaimed according to their age, so we don't reclaim too many MADV_FREE
    pages too. Putting the MADV_FREE pages into LRU_INACTIVE_FILE_LIST also
    means we can reclaim the pages without swap support. This idea is
    suggested by Johannes.

    This patch doesn't move MADV_FREE pages to LRU_INACTIVE_FILE list yet to
    avoid bisect failure, next patch will do it.

    The patch is based on Minchan's original patch.

    [akpm@linux-foundation.org: coding-style fixes]
    Link: http://lkml.kernel.org/r/2f87063c1e9354677b7618c647abde77b07561e5.1487965799.git.shli@fb.com
    Signed-off-by: Shaohua Li
    Suggested-by: Johannes Weiner
    Acked-by: Johannes Weiner
    Acked-by: Minchan Kim
    Acked-by: Michal Hocko
    Acked-by: Hillf Danton
    Cc: Hugh Dickins
    Cc: Rik van Riel
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shaohua Li