15 Dec, 2014

9 commits

  • Pull driver core update from Greg KH:
    "Here's the set of driver core patches for 3.19-rc1.

    They are dominated by the removal of the .owner field in platform
    drivers. They touch a lot of files, but they are "simple" changes,
    just removing a line in a structure.

    Other than that, a few minor driver core and debugfs changes. There
    are some ath9k patches coming in through this tree that have been
    acked by the wireless maintainers as they relied on the debugfs
    changes.

    Everything has been in linux-next for a while"

    * tag 'driver-core-3.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (324 commits)
    Revert "ath: ath9k: use debugfs_create_devm_seqfile() helper for seq_file entries"
    fs: debugfs: add forward declaration for struct device type
    firmware class: Deletion of an unnecessary check before the function call "vunmap"
    firmware loader: fix hung task warning dump
    devcoredump: provide a one-way disable function
    device: Add dev__once variants
    ath: ath9k: use debugfs_create_devm_seqfile() helper for seq_file entries
    ath: use seq_file api for ath9k debugfs files
    debugfs: add helper function to create device related seq_file
    drivers/base: cacheinfo: remove noisy error boot message
    Revert "core: platform: add warning if driver has no owner"
    drivers: base: support cpu cache information interface to userspace via sysfs
    drivers: base: add cpu_device_create to support per-cpu devices
    topology: replace custom attribute macros with standard DEVICE_ATTR*
    cpumask: factor out show_cpumap into separate helper function
    driver core: Fix unbalanced device reference in drivers_probe
    driver core: fix race with userland in device_add()
    sysfs/kernfs: make read requests on pre-alloc files use the buffer.
    sysfs/kernfs: allow attributes to request write buffer be pre-allocated.
    fs: sysfs: return EGBIG on write if offset is larger than file size
    ...

    Linus Torvalds
     
  • Pull tty/serial driver updates from Greg KH:
    "Here's the big tty/serial driver update for 3.19-rc1.

    There are a number of TTY core changes/fixes in here from Peter Hurley
    that have all been teted in linux-next for a long time now. There are
    also the normal serial driver updates as well, full details in the
    changelog below"

    * tag 'tty-3.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty: (219 commits)
    serial: pxa: hold port.lock when reporting modem line changes
    tty-hvsi_lib: Deletion of an unnecessary check before the function call "tty_kref_put"
    tty: Deletion of unnecessary checks before two function calls
    n_tty: Fix read_buf race condition, increment read_head after pushing data
    serial: of-serial: add PM suspend/resume support
    Revert "serial: of-serial: add PM suspend/resume support"
    Revert "serial: of-serial: fix up PM ops on no_console_suspend and port type"
    serial: 8250: don't attempt a trylock if in sysrq
    serial: core: Add big-endian iotype
    serial: samsung: use port->fifosize instead of hardcoded values
    serial: samsung: prefer to use fifosize from driver data
    serial: samsung: fix style problems
    serial: samsung: wait for transfer completion before clock disable
    serial: icom: fix error return code
    serial: tegra: clean up tty-flag assignments
    serial: Fix io address assign flow with Fintek PCI-to-UART Product
    serial: mxs-auart: fix tx_empty against shift register
    serial: mxs-auart: fix gpio change detection on interrupt
    serial: mxs-auart: Fix mxs_auart_set_ldisc()
    serial: 8250_dw: Use 64-bit access for OCTEON.
    ...

    Linus Torvalds
     
  • Pull USB updates from Greg KH:
    "Here's the big set of USB and PHY patches for 3.19-rc1.

    The normal churn in the USB gadget area is in here, as well as xhci
    and other individual USB driver updates. The PHY tree is also in
    here, as there were dependancies on the USB tree.

    All of these have been in linux-next"

    * tag 'usb-3.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb: (351 commits)
    arm: omap3: twl: remove usb phy init data
    usbip: fix error handling in stub_probe()
    usb: gadget: udc: missing curly braces
    USB: mos7720: delete some unneeded code
    wusb: replace memset by memzero_explicit
    usbip: remove unneeded structure
    usb: xhci: fix comment for PORT_DEV_REMOVE
    xhci: don't use the same variable for stopped and halted rings current TD
    xhci: clear extra bits from slot context when setting max exit latency
    xhci: cleanup finish_td function
    USB: adutux: NULL dereferences on disconnect
    usb: chipidea: fix platform_no_drv_owner.cocci warnings
    usb: chipidea: Fixed a few typos in comments
    Documentation: bindings: add doc for the USB2 ChipIdea USB driver
    usb: chipidea: add a usb2 driver for ci13xxx
    usb: chipidea: fix phy handling
    usb: chipidea: remove duplicate dev_set_drvdata for host_start
    usb: chipidea: parameter 'mode' isn't needed for hw_device_reset
    usb: chipidea: add controller reset API
    usb: chipidea: remove flag CI_HDRC_REQUIRE_TRANSCEIVER
    ...

    Linus Torvalds
     
  • Pull squashfs update from Phillip Lougher:
    "These patches optionally add LZ4 compression support to Squashfs.

    LZ4 is a lightweight compression algorithm which can be used on
    embedded systems to reduce CPU and memory overhead (in comparison to
    the standard zlib compression).

    These patches add the wrapper code to allow Squashfs to use the
    existing LZ4 decompression code, and the necessary configuration
    option"

    * tag 'squashfs-updates' of git://git.kernel.org/pub/scm/linux/kernel/git/pkl/squashfs-next:
    Squashfs: Add LZ4 compression configuration option
    Squashfs: add LZ4 compression support

    Linus Torvalds
     
  • Pull take two of the GPIO updates:
    "Same stuff as last time, now with a fixup patch for the previous
    compile error plus I ran a few extra rounds of compile-testing.

    This is the bulk of GPIO changes for the v3.19 series:

    - A new API that allows setting more than one GPIO at the time. This
    is implemented for the new descriptor-based API only and makes it
    possible to e.g. toggle a clock and data line at the same time, if
    the hardware can do this with a single register write. Both
    consumers and drivers need new calls, and the core will fall back
    to driving individual lines where needed. Implemented for the
    MPC8xxx driver initially

    - Patched the mdio-mux-gpio and the serial mctrl driver that drives
    modems to use the new multiple-setting API to set several signals
    simultaneously

    - Get rid of the global GPIO descriptor array, and instead allocate
    descriptors dynamically for each GPIO on a certain GPIO chip. This
    moves us closer to getting rid of the limitation of using the
    global, static GPIO numberspace

    - New driver and device tree bindings for 74xx ICs

    - New driver and device tree bindings for the VF610 Vybrid

    - Support the RCAR r8a7793 and r8a7794

    - Guidelines for GPIO device tree bindings trying to get things a bit
    more strict with the advent of combined device properties

    - Suspend/resume support for the MVEBU driver

    - A slew of minor fixes and improvements"

    * tag 'gpio-v3.19-2' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-gpio: (33 commits)
    gpio: mcp23s08: fix up compilation error
    gpio: pl061: document gpio-ranges property for bindings file
    gpio: pl061: hook request if gpio-ranges avaiable
    gpio: mcp23s08: Add option to configure IRQ output polarity as active high
    gpio: fix deferred probe detection for legacy API
    serial: mctrl_gpio: use gpiod_set_array function
    mdio-mux-gpio: Use GPIO descriptor interface and new gpiod_set_array function
    gpio: remove const modifier from gpiod_get_direction()
    gpio: remove gpio_descs global array
    gpio: mxs: implement get_direction callback
    gpio: em: Use dynamic allocation of GPIOs
    gpio: Check if base is positive before calling gpio_is_valid()
    gpio: mcp23s08: Add simple IRQ support for SPI devices
    gpio: mcp23s08: request a shared interrupt
    gpio: mcp23s08: Do not free unrequested interrupt
    gpio: rcar: Add r8a7793 and r8a7794 support
    gpio-mpc8xxx: add mpc8xxx_gpio_set_multiple function
    gpiolib: allow simultaneous setting of multiple GPIO outputs
    gpio: mvebu: add suspend/resume support
    gpio: gpio-davinci: remove duplicate check on resource
    ..

    Linus Torvalds
     
  • Pull aio updates from Benjamin LaHaise.

    * git://git.kvack.org/~bcrl/aio-next:
    aio: Skip timer for io_getevents if timeout=0
    aio: Make it possible to remap aio ring

    Linus Torvalds
     
  • Pull i2c updates from Wolfram Sang:
    "For 3.19, the I2C subsystem has to offer special candy this time.
    Right in time for Christmas :)

    - I2C slave framework: finally, a generic mechanism for Linux being
    an I2C slave (if the bus driver supports that). Docs are still
    missing but will come later this cycle, the code is good enough to
    go.
    - I2C muxes represent their topology in sysfs much more detailed.
    This will help users to navigate around much easier.
    - irq population of i2c clients is now done at probe time, not device
    creation time, to have better support for deferred probing.
    - new drivers for Imagination SCB, Amlogic Meson
    - DMA support added for Freescale IMX, Renesas SHMobile
    - slightly bigger driver updates to OMAP, i801, AT91, and rk3x
    (mostly quirk handling, timing updates, and using better kernel
    interfaces)
    - eeprom driver can now write with byte-access (very slow, but OK to
    have)
    - and the bunch of smaller fixes, cleanups, ID updates..."

    * 'i2c/for-3.19' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux: (56 commits)
    i2c: sh_mobile: remove unneeded DMA mask
    i2c: rcar: add slave support
    i2c: slave-eeprom: add eeprom simulator driver
    i2c: core changes for slave support
    MAINTAINERS: add I2C dt bindings also to I2C realm
    i2c: designware: Fix falling time bindings doc
    i2c: davinci: switch to use platform_get_irq
    Documentation: i2c: Use PM ops instead of legacy suspend/resume
    i2c: sh_mobile: optimize irq entry
    i2c: pxa: add support for SCCB devices
    omap: i2c: don't check bus state IP rev3.3 and earlier
    i2c: s3c2410: Handle i2c sys_cfg register in i2c driver
    i2c: rk3x: add Kconfig dependency on COMMON_CLK
    i2c: omap: add notes related to i2c multimaster mode
    i2c: omap: don't reset controller if Arbitration Lost detected
    i2c: omap: implement workaround for handling invalid BB-bit values
    i2c: omap: cleanup register definitions
    i2c: rk3x: handle dynamic clock rate changes correctly
    i2c: at91: enable probe deferring on dma channel request
    i2c: at91: remove legacy DMA support
    ...

    Linus Torvalds
     
  • Pull md updates from Neil Brown:
    "Three fixes for md.

    I did have a largish set of locking changes queued, but late testing
    showed they weren't quite as stable as I thought and while I fixed
    what I found, I decided it safer to delay them a release ...
    particularly as I'll be AFK for a few weeks. So expect a larger batch
    next time :-)"

    * tag 'md/3.19' of git://neil.brown.name/md:
    md: Check MD_RECOVERY_RUNNING as well as ->sync_thread.
    md: fix semicolon.cocci warnings
    md/raid5: fetch_block must fetch all the blocks handle_stripe_dirtying wants.

    Linus Torvalds
     
  • Pull x86 fixes from Ingo Molnar:
    "Misc fixes (mainly Andy's TLS fixes), plus a cleanup"

    * 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    x86/tls: Disallow unusual TLS segments
    x86/tls: Validate TLS entries to protect espfix
    MAINTAINERS: Add me as x86 VDSO submaintainer
    x86/asm: Unify segment selector defines
    x86/asm: Guard against building the 32/64-bit versions of the asm-offsets*.c file directly
    x86_64, switch_to(): Load TLS descriptors before switching DS and ES
    x86/mm: Use min() instead of min_t() in the e820 printout code
    x86/mm: Fix zone ranges boot printout
    x86/doc: Update documentation after file shuffling

    Linus Torvalds
     

14 Dec, 2014

31 commits

  • Users have no business installing custom code segments into the
    GDT, and segments that are not present but are otherwise valid
    are a historical source of interesting attacks.

    For completeness, block attempts to set the L bit. (Prior to
    this patch, the L bit would have been silently dropped.)

    This is an ABI break. I've checked glibc, musl, and Wine, and
    none of them look like they'll have any trouble.

    Note to stable maintainers: this is a hardening patch that fixes
    no known bugs. Given the possibility of ABI issues, this
    probably shouldn't be backported quickly.

    Signed-off-by: Andy Lutomirski
    Acked-by: H. Peter Anvin
    Cc: stable@vger.kernel.org # optional
    Cc: Konrad Rzeszutek Wilk
    Cc: Linus Torvalds
    Cc: security@kernel.org
    Cc: Willy Tarreau
    Signed-off-by: Ingo Molnar

    Andy Lutomirski
     
  • Installing a 16-bit RW data segment into the GDT defeats espfix.
    AFAICT this will not affect glibc, Wine, or dosemu at all.

    Signed-off-by: Andy Lutomirski
    Acked-by: H. Peter Anvin
    Cc: stable@vger.kernel.org
    Cc: Konrad Rzeszutek Wilk
    Cc: Linus Torvalds
    Cc: security@kernel.org
    Cc: Willy Tarreau
    Signed-off-by: Ingo Molnar

    Andy Lutomirski
     
  • Here goes... :)

    Signed-off-by: Andy Lutomirski
    Acked-by: Thomas Gleixner
    Link: http://lkml.kernel.org/r/1042001e502f8e0deb0edfeeac209b68378650cf.1418430292.git.luto@amacapital.net
    Signed-off-by: Ingo Molnar

    Andy Lutomirski
     
  • In this case, it is basically a polling. Let's not involve timer at all
    because that would hurt performance for application event loops.

    In an arbitrary test I've done, io_getevents syscall elapsed time
    reduces from 50000+ nanoseconds to a few hundereds.

    Signed-off-by: Fam Zheng
    Signed-off-by: Benjamin LaHaise

    Fam Zheng
     
  • There are actually two issues this patch addresses. Let me start with
    the one I tried to solve in the beginning.

    So, in the checkpoint-restore project (criu) we try to dump tasks'
    state and restore one back exactly as it was. One of the tasks' state
    bits is rings set up with io_setup() call. There's (almost) no problems
    in dumping them, there's a problem restoring them -- if I dump a task
    with aio ring originally mapped at address A, I want to restore one
    back at exactly the same address A. Unfortunately, the io_setup() does
    not allow for that -- it mmaps the ring at whatever place mm finds
    appropriate (it calls do_mmap_pgoff() with zero address and without
    the MAP_FIXED flag).

    To make restore possible I'm going to mremap() the freshly created ring
    into the address A (under which it was seen before dump). The problem is
    that the ring's virtual address is passed back to the user-space as the
    context ID and this ID is then used as search key by all the other io_foo()
    calls. Reworking this ID to be just some integer doesn't seem to work, as
    this value is already used by libaio as a pointer using which this library
    accesses memory for aio meta-data.

    So, to make restore work we need to make sure that

    a) ring is mapped at desired virtual address
    b) kioctx->user_id matches this value

    Having said that, the patch makes mremap() on aio region update the
    kioctx's user_id and mmap_base values.

    Here appears the 2nd issue I mentioned in the beginning of this mail.
    If (regardless of the C/R dances I do) someone creates an io context
    with io_setup(), then mremap()-s the ring and then destroys the context,
    the kill_ioctx() routine will call munmap() on wrong (old) address.
    This will result in a) aio ring remaining in memory and b) some other
    vma get unexpectedly unmapped.

    What do you think?

    Signed-off-by: Pavel Emelyanov
    Acked-by: Dmitry Monakhov
    Signed-off-by: Benjamin LaHaise

    Pavel Emelyanov
     
  • Pull block layer driver updates from Jens Axboe:

    - NVMe updates:
    - The blk-mq conversion from Matias (and others)

    - A stack of NVMe bug fixes from the nvme tree, mostly from Keith.

    - Various bug fixes from me, fixing issues in both the blk-mq
    conversion and generic bugs.

    - Abort and CPU online fix from Sam.

    - Hot add/remove fix from Indraneel.

    - A couple of drbd fixes from the drbd team (Andreas, Lars, Philipp)

    - With the generic IO stat accounting from 3.19/core, converting md,
    bcache, and rsxx to use those. From Gu Zheng.

    - Boundary check for queue/irq mode for null_blk from Matias. Fixes
    cases where invalid values could be given, causing the device to hang.

    - The xen blkfront pull request, with two bug fixes from Vitaly.

    * 'for-3.19/drivers' of git://git.kernel.dk/linux-block: (56 commits)
    NVMe: fix race condition in nvme_submit_sync_cmd()
    NVMe: fix retry/error logic in nvme_queue_rq()
    NVMe: Fix FS mount issue (hot-remove followed by hot-add)
    NVMe: fix error return checking from blk_mq_alloc_request()
    NVMe: fix freeing of wrong request in abort path
    xen/blkfront: remove redundant flush_op
    xen/blkfront: improve protection against issuing unsupported REQ_FUA
    NVMe: Fix command setup on IO retry
    null_blk: boundary check queue_mode and irqmode
    block/rsxx: use generic io stats accounting functions to simplify io stat accounting
    md: use generic io stats accounting functions to simplify io stat accounting
    drbd: use generic io stats accounting functions to simplify io stat accounting
    md/bcache: use generic io stats accounting functions to simplify io stat accounting
    NVMe: Update module version major number
    NVMe: fail pci initialization if the device doesn't have any BARs
    NVMe: add ->exit_hctx() hook
    NVMe: make setup work for devices that don't do INTx
    NVMe: enable IO stats by default
    NVMe: nvme_submit_async_admin_req() must use atomic rq allocation
    NVMe: replace blk_put_request() with blk_mq_free_request()
    ...

    Linus Torvalds
     
  • Pull block driver core update from Jens Axboe:
    "This is the pull request for the core block IO changes for 3.19. Not
    a huge round this time, mostly lots of little good fixes:

    - Fix a bug in sysfs blktrace interface causing a NULL pointer
    dereference, when enabled/disabled through that API. From Arianna
    Avanzini.

    - Various updates/fixes/improvements for blk-mq:

    - A set of updates from Bart, mostly fixing buts in the tag
    handling.

    - Cleanup/code consolidation from Christoph.

    - Extend queue_rq API to be able to handle batching issues of IO
    requests. NVMe will utilize this shortly. From me.

    - A few tag and request handling updates from me.

    - Cleanup of the preempt handling for running queues from Paolo.

    - Prevent running of unmapped hardware queues from Ming Lei.

    - Move the kdump memory limiting check to be in the correct
    location, from Shaohua.

    - Initialize all software queues at init time from Takashi. This
    prevents a kobject warning when CPUs are brought online that
    weren't online when a queue was registered.

    - Single writeback fix for I_DIRTY clearing from Tejun. Queued with
    the core IO changes, since it's just a single fix.

    - Version X of the __bio_add_page() segment addition retry from
    Maurizio. Hope the Xth time is the charm.

    - Documentation fixup for IO scheduler merging from Jan.

    - Introduce (and use) generic IO stat accounting helpers for non-rq
    drivers, from Gu Zheng.

    - Kill off artificial limiting of max sectors in a request from
    Christoph"

    * 'for-3.19/core' of git://git.kernel.dk/linux-block: (26 commits)
    bio: modify __bio_add_page() to accept pages that don't start a new segment
    blk-mq: Fix uninitialized kobject at CPU hotplugging
    blktrace: don't let the sysfs interface remove trace from running list
    blk-mq: Use all available hardware queues
    blk-mq: Micro-optimize bt_get()
    blk-mq: Fix a race between bt_clear_tag() and bt_get()
    blk-mq: Avoid that __bt_get_word() wraps multiple times
    blk-mq: Fix a use-after-free
    blk-mq: prevent unmapped hw queue from being scheduled
    blk-mq: re-check for available tags after running the hardware queue
    blk-mq: fix hang in bt_get()
    blk-mq: move the kdump check to blk_mq_alloc_tag_set
    blk-mq: cleanup tag free handling
    blk-mq: use 'nr_cpu_ids' as highest CPU ID count for hwq cpu map
    blk: introduce generic io stat accounting help function
    blk-mq: handle the single queue case in blk_mq_hctx_next_cpu
    genhd: check for int overflow in disk_expand_part_tbl()
    blk-mq: add blk_mq_free_hctx_request()
    blk-mq: export blk_mq_free_request()
    blk-mq: use get_cpu/put_cpu instead of preempt_disable/preempt_enable
    ...

    Linus Torvalds
     
  • …it/rostedt/linux-trace

    Pull tracing fixlet from Steven Rostedt:
    "Remove unnecessary preempt_disable in printk()"

    * tag 'trace-seq-buf-3.19-v2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
    printk: Do not disable preemption for accessing printk_func

    Linus Torvalds
     
  • Pull tracing fixes from Steven Rostedt:
    "Here's two fixes:

    1) Discovered by Fengguang Wu's tests. I changed the parameters to
    the function graph x86 prepare_ftrace_return call but forgot to
    update the call from entry_32 (i386 version). This patch corrects
    that.

    2) I was tracing some code and found that the sched_switch tracepoint
    was showing tasks in the INTERRUPTIBLE state as RUNNING. This was
    due to the updates to convert preempt_count into a per_cpu
    variable. The tracepoint logic was made to use the tasks
    saved_preempt_count which could hold a stale "PREEMPT_ACTIVE",
    instead of using the current preempt_count() call"

    * tag 'trace-fixes-v3.18' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
    tracing/sched: Check preempt_count() for current when reading task->state
    ftrace/x86: Update i386 call to prepare_ftrace_return()

    Linus Torvalds
     
  • Pull audit updates from Paul Moore:
    "Two small patches from the audit next branch; only one of which has
    any real significant code changes, the other is simply a MAINTAINERS
    update for audit.

    The single code patch is pretty small and rather straightforward, it
    changes the audit "version" number reported to userspace from an
    integer to a bitmap which is used to indicate the functionality of the
    running kernel. This really doesn't have much impact on the kernel,
    but it will make life easier for the audit userspace folks.

    Thankfully we were still on a version number which allowed us to do
    this without breaking userspace"

    * 'upstream' of git://git.infradead.org/users/pcmoore/audit:
    audit: convert status version to a feature bitmap
    audit: add Paul Moore to the MAINTAINERS entry

    Linus Torvalds
     
  • Pull crypto update from Herbert Xu:
    - The crypto API is now documented :)
    - Disallow arbitrary module loading through crypto API.
    - Allow get request with empty driver name through crypto_user.
    - Allow speed testing of arbitrary hash functions.
    - Add caam support for ctr(aes), gcm(aes) and their derivatives.
    - nx now supports concurrent hashing properly.
    - Add sahara support for SHA1/256.
    - Add ARM64 version of CRC32.
    - Misc fixes.

    * git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (77 commits)
    crypto: tcrypt - Allow speed testing of arbitrary hash functions
    crypto: af_alg - add user space interface for AEAD
    crypto: qat - fix problem with coalescing enable logic
    crypto: sahara - add support for SHA1/256
    crypto: sahara - replace tasklets with kthread
    crypto: sahara - add support for i.MX53
    crypto: sahara - fix spinlock initialization
    crypto: arm - replace memset by memzero_explicit
    crypto: powerpc - replace memset by memzero_explicit
    crypto: sha - replace memset by memzero_explicit
    crypto: sparc - replace memset by memzero_explicit
    crypto: algif_skcipher - initialize upon init request
    crypto: algif_skcipher - removed unneeded code
    crypto: algif_skcipher - Fixed blocking recvmsg
    crypto: drbg - use memzero_explicit() for clearing sensitive data
    crypto: drbg - use MODULE_ALIAS_CRYPTO
    crypto: include crypto- module prefix in template
    crypto: user - add MODULE_ALIAS
    crypto: sha-mb - remove a bogus NULL check
    crytpo: qat - Fix 64 bytes requests
    ...

    Linus Torvalds
     
  • Merge second patchbomb from Andrew Morton:
    - the rest of MM
    - misc fs fixes
    - add execveat() syscall
    - new ratelimit feature for fault-injection
    - decompressor updates
    - ipc/ updates
    - fallocate feature creep
    - fsnotify cleanups
    - a few other misc things

    * emailed patches from Andrew Morton : (99 commits)
    cgroups: Documentation: fix trivial typos and wrong paragraph numberings
    parisc: percpu: update comments referring to __get_cpu_var
    percpu: update local_ops.txt to reflect this_cpu operations
    percpu: remove __get_cpu_var and __raw_get_cpu_var macros
    fsnotify: remove destroy_list from fsnotify_mark
    fsnotify: unify inode and mount marks handling
    fallocate: create FAN_MODIFY and IN_MODIFY events
    mm/cma: make kmemleak ignore CMA regions
    slub: fix cpuset check in get_any_partial
    slab: fix cpuset check in fallback_alloc
    shmdt: use i_size_read() instead of ->i_size
    ipc/shm.c: fix overly aggressive shmdt() when calls span multiple segments
    ipc/msg: increase MSGMNI, remove scaling
    ipc/sem.c: increase SEMMSL, SEMMNI, SEMOPM
    ipc/sem.c: change memory barrier in sem_lock() to smp_rmb()
    lib/decompress.c: consistency of compress formats for kernel image
    decompress_bunzip2: off by one in get_next_block()
    usr/Kconfig: make initrd compression algorithm selection not expert
    fault-inject: add ratelimit option
    ratelimit: add initialization macro
    ...

    Linus Torvalds
     
  • Signed-off-by: SeongJae Park
    Cc: Jonathan Corbet
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    SeongJae Park
     
  • __get_cpu_var was removed. Update comments to refer to
    this_cpu_ptr() instead.

    Signed-off-by: Christoph Lameter
    Cc: "James E.J. Bottomley"
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Update the documentation to reflect changes due to the availability of
    this_cpu operations.

    Signed-off-by: Christoph Lameter
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • No user is left in the kernel source tree. Therefore we can drop the
    definitions.

    This is the final merge of the transition away from __get_cpu_var. After
    this patch the kernel will not build if anyone uses __get_cpu_var.

    Signed-off-by: Christoph Lameter
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • destroy_list is used to track marks which still need waiting for srcu
    period end before they can be freed. However by the time mark is added to
    destroy_list it isn't in group's list of marks anymore and thus we can
    reuse fsnotify_mark->g_list for queueing into destroy_list. This saves
    two pointers for each fsnotify_mark.

    Signed-off-by: Jan Kara
    Cc: Eric Paris
    Cc: Heinrich Schuchardt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     
  • There's a lot of common code in inode and mount marks handling. Factor it
    out to a common helper function.

    Signed-off-by: Jan Kara
    Cc: Eric Paris
    Cc: Heinrich Schuchardt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     
  • The fanotify and the inotify API can be used to monitor changes of the
    file system. System call fallocate() modifies files. Hence it should
    trigger the corresponding fanotify (FAN_MODIFY) and inotify (IN_MODIFY)
    events. The most interesting case is FALLOC_FL_COLLAPSE_RANGE because
    this value allows to create arbitrary file content from random data.

    This patch adds the missing call to fsnotify_modify().

    The FAN_MODIFY and IN_MODIFY event will be created when fallocate()
    succeeds. It will even be created if the file length remains unchanged,
    e.g. when calling fanotify with flag FALLOC_FL_KEEP_SIZE.

    This logic was primarily chosen to keep the coding simple.

    It resembles the logic of the write() system call.

    When we call write() we always create a FAN_MODIFY event, even in the case
    of overwriting with identical data.

    Events FAN_MODIFY and IN_MODIFY do not provide any guarantee that data was
    actually changed.

    Furthermore even if if the filesize remains unchanged, fallocate() may
    influence whether a subsequent write() will succeed and hence the
    fallocate() call may be considered a modification.

    The fallocate(2) man page teaches: After a successful call, subsequent
    writes into the range specified by offset and len are guaranteed not to
    fail because of lack of disk space.

    So calling fallocate(fd, FALLOC_FL_KEEP_SIZE, offset, len) may result in
    different outcomes of a subsequent write depending on the values of offset
    and len.

    Signed-off-by: Heinrich Schuchardt
    Reviewed-by: Jan Kara
    Cc: Jan Kara
    Cc: Alexander Viro
    Cc: Eric Paris
    Cc: John McCutchan
    Cc: Robert Love
    Cc: Michael Kerrisk
    Cc: Theodore Ts'o
    Cc: Dave Chinner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Heinrich Schuchardt
     
  • kmemleak will add allocations as objects to a pool. The memory allocated
    for each object in this pool is periodically searched for pointers to
    other allocated objects. This only works for memory that is mapped into
    the kernel's virtual address space, which happens not to be the case for
    most CMA regions.

    Furthermore, CMA regions are typically used to store data transferred to
    or from a device and therefore don't contain pointers to other objects.

    Without this, the kernel crashes on the first execution of the
    scan_gray_list() because it tries to access highmem. Perhaps a more
    appropriate fix would be to reject any object that can't map to a kernel
    virtual address?

    [akpm@linux-foundation.org: add comment]
    [akpm@linux-foundation.org: fix comment, per Catalin]
    [sfr@canb.auug.org.au: include linux/io.h for phys_to_virt()]
    Signed-off-by: Thierry Reding
    Cc: Michal Nazarewicz
    Cc: Marek Szyprowski
    Cc: Joonsoo Kim
    Cc: "Aneesh Kumar K.V"
    Cc: Catalin Marinas
    Signed-off-by: Stephen Rothwell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Thierry Reding
     
  • If we fail to allocate from the current node's stock, we look for free
    objects on other nodes before calling the page allocator (see
    get_any_partial). While checking other nodes we respect cpuset
    constraints by calling cpuset_zone_allowed. We enforce hardwall check.
    As a result, we will fallback to the page allocator even if there are some
    pages cached on other nodes, but the current cpuset doesn't have them set.
    However, the page allocator uses softwall check for kernel allocations,
    so it may allocate from one of the other nodes in this case.

    Therefore we should use softwall cpuset check in get_any_partial to
    conform with the cpuset check in the page allocator.

    Signed-off-by: Vladimir Davydov
    Acked-by: Zefan Li
    Acked-by: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • fallback_alloc is called on kmalloc if the preferred node doesn't have
    free or partial slabs and there's no pages on the node's free list
    (GFP_THISNODE allocations fail). Before invoking the reclaimer it tries
    to locate a free or partial slab on other allowed nodes' lists. While
    iterating over the preferred node's zonelist it skips those zones which
    hardwall cpuset check returns false for. That means that for a task bound
    to a specific node using cpusets fallback_alloc will always ignore free
    slabs on other nodes and go directly to the reclaimer, which, however, may
    allocate from other nodes if cpuset.mem_hardwall is unset (default). As a
    result, we may get lists of free slabs grow without bounds on other nodes,
    which is bad, because inactive slabs are only evicted by cache_reap at a
    very slow rate and cannot be dropped forcefully.

    To reproduce the issue, run a process that will walk over a directory tree
    with lots of files inside a cpuset bound to a node that constantly
    experiences memory pressure. Look at num_slabs vs active_slabs growth as
    reported by /proc/slabinfo.

    To avoid this we should use softwall cpuset check in fallback_alloc.

    Signed-off-by: Vladimir Davydov
    Acked-by: Zefan Li
    Acked-by: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • Andrew Morton noted

    http://lkml.kernel.org/r/20141104142027.a7a0d010772d84560b445f59@linux-foundation.org

    that the shmdt uses inode->i_size outside of i_mutex being held.
    There is one more case in shm.c in shm_destroy(). This converts
    both users over to use i_size_read().

    Signed-off-by: Dave Hansen
    Cc: Manfred Spraul
    Cc: Davidlohr Bueso
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dave Hansen
     
  • This is a highly-contrived scenario. But, a single shmdt() call can be
    induced in to unmapping memory from mulitple shm segments. Example code
    is here:

    http://www.sr71.net/~dave/intel/shmfun.c

    The fix is pretty simple: Record the 'struct file' for the first VMA we
    encounter and then stick to it. Decline to unmap anything not from the
    same file and thus the same segment.

    I found this by inspection and the odds of anyone hitting this in practice
    are pretty darn small.

    Lightly tested, but it's a pretty small patch.

    Signed-off-by: Dave Hansen
    Cc: Manfred Spraul
    Reviewed-by: Davidlohr Bueso
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dave Hansen
     
  • SysV can be abused to allocate locked kernel memory. For most systems, a
    small limit doesn't make sense, see the discussion with regards to SHMMAX.

    Therefore: increase MSGMNI to the maximum supported.

    And: If we ignore the risk of locking too much memory, then an automatic
    scaling of MSGMNI doesn't make sense. Therefore the logic can be removed.

    The code preserves auto_msgmni to avoid breaking any user space applications
    that expect that the value exists.

    Notes:
    1) If an administrator must limit the memory allocations, then he can set
    MSGMNI as necessary.

    Or he can disable sysv entirely (as e.g. done by Android).

    2) MSGMAX and MSGMNB are intentionally not increased, as these values are used
    to control latency vs. throughput:
    If MSGMNB is large, then msgsnd() just returns and more messages can be queued
    before a task switch to a task that calls msgrcv() is forced.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Manfred Spraul
    Cc: Davidlohr Bueso
    Cc: Rafael Aquini
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Manfred Spraul
     
  • a)

    SysV can be abused to allocate locked kernel memory. For most systems, a
    small limit doesn't make sense, see the discussion with regards to SHMMAX.

    Therefore: Increase the sysv sem limits so that all known applications
    will work with these defaults.

    b)

    With regards to the maximum supported:
    Some of the specified hard limits are not correct anymore, therefore the
    patch updates the documentation.

    - SEMMNI must stay below IPCMNI, which is 32768.
    As for SHMMAX: Stay a bit below this limit.

    - SEMMSL was limited to 8k, to ensure that the kmalloc for the kernel array
    was limited to 16 kB (order=2)

    This doesn't apply anymore:
    - the allocation size isn't sizeof(short)*nsems anymore.
    - ipc_alloc falls back to vmalloc

    - SEMOPM should stay below 1000, to limit the kmalloc in semtimedop() to an
    order=1 allocation.
    Therefore: Leave it at 500 (order=0 allocation).

    Note:
    If an administrator must limit the memory allocations, then he can set the
    values as necessary.

    Or he can disable sysv entirely (as e.g. done by Android).

    Signed-off-by: Manfred Spraul
    Cc: Davidlohr Bueso
    Acked-by: Rafael Aquini
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Manfred Spraul
     
  • When I fixed bugs in the sem_lock() logic, I was more conservative than
    necessary. Therefore it is safe to replace the smp_mb() with smp_rmb().
    And: With smp_rmb(), semop() syscalls are up to 10% faster.

    The race we must protect against is:

    sem->lock is free
    sma->complex_count = 0
    sma->sem_perm.lock held by thread B

    thread A:

    A: spin_lock(&sem->lock)

    B: sma->complex_count++; (now 1)
    B: spin_unlock(&sma->sem_perm.lock);

    A: spin_is_locked(&sma->sem_perm.lock);
    A: XXXXX memory barrier
    A: if (sma->complex_count == 0)

    Thread A must read the increased complex_count value, i.e. the read must
    not be reordered with the read of sem_perm.lock done by spin_is_locked().

    Since it's about ordering of reads, smp_rmb() is sufficient.

    [akpm@linux-foundation.org: update sem_lock() comment, from Davidlohr]
    Signed-off-by: Manfred Spraul
    Reviewed-by: Davidlohr Bueso
    Acked-by: Rafael Aquini
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Manfred Spraul
     
  • Magic number of compress formats for kernel image is defined by two bytes.
    These numbers are written in hexadecimal number, nevertheless magic
    number for only gunzip is written in octal number. The formats should be
    consistent for readability. Therefore, magic numbers for gunzip are also
    defined by hexadecimal number.

    Signed-off-by: Haesung Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Haesung Kim
     
  • "origPtr" is used as an offset into the bd->dbuf[] array. That array is
    allocated in start_bunzip() and has "bd->dbufSize" number of elements so
    the test here should be >= instead of >.

    Later we check "origPtr" again before using it as an offset so I don't
    know if this bug can be triggered in real life.

    Fixes: bc22c17e12c1 ('bzip2/lzma: library support for gzip, bzip2 and lzma decompression')
    Signed-off-by: Dan Carpenter
    Cc: Alain Knaff
    Cc: Yinghai Lu
    Cc: "H. Peter Anvin"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Carpenter
     
  • The kernel has support for (nearly) every compression algorithm known to
    man, each to handle some particular microscopic niche.

    Unfortunately all of these always get compiled in if you want to support
    INITRDs, and can be only disabled when CONFIG_EXPERT is set.

    I don't see why I need to set EXPERT just to properly configure the initrd
    compression algorithms, and not always include every possible algorithm

    Usually the initrd is just compressed with gzip anyways, at least that's
    true on all distributions I use.

    Remove the dependencies for initrd compression on CONFIG_EXPERT.

    Make the various options just default y, which should be good enough to
    not break any previous configuration.

    Signed-off-by: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andi Kleen
     
  • Current debug levels are not optimal. Especially if one want to provoke
    big numbers of faults(broken device simulator) then any verbose level will
    produce giant numbers of identical logging messages. Let's add ratelimit
    parameter for that purpose.

    Signed-off-by: Dmitry Monakhov
    Acked-by: Akinobu Mita
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dmitry Monakhov