13 Jan, 2012

7 commits

  • While implementing cmpxchg_double() on s390 I realized that we don't set
    CONFIG_CMPXCHG_LOCAL despite the fact that we have support for it.

    However setting that option will increase the size of struct page by
    eight bytes on 64 bit, which we certainly do not want. Also, it doesn't
    make sense that a present cpu feature should increase the size of struct
    page.

    Besides that it looks like the dependency to CMPXCHG_LOCAL is wrong and
    that it should depend on CMPXCHG_DOUBLE instead.

    This patch:

    If an architecture supports CMPXCHG_LOCAL this shouldn't result
    automatically in larger struct pages if the SLUB allocator is used.
    Instead introduce a new config option "HAVE_ALIGNED_STRUCT_PAGE" which
    can be selected if a double word aligned struct page is required. Also
    update x86 Kconfig so that it should work as before.

    Signed-off-by: Heiko Carstens
    Acked-by: Christoph Lameter
    Cc: Pekka Enberg
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: "H. Peter Anvin"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Heiko Carstens
     
  • The uses have been renamed so delete the unused macro.

    Signed-off-by: Joe Perches
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joe Perches
     
  • Use the more commonly used __noreturn instead of ATTRIB_NORETURN.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Joe Perches
    Cc: Ingo Molnar
    Cc: Peter Zijlstra
    Cc: Haavard Skinnemoen
    Cc: Hans-Christian Egtvedt
    Cc: Tony Luck
    Cc: Fenghua Yu
    Acked-by: Geert Uytterhoeven
    Acked-by: Ralf Baechle
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: Martin Schwidefsky
    Cc: Heiko Carstens
    Cc: Chris Metcalf
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joe Perches
     
  • It's a very old and now unused prototype marking so just delete it.

    Neaten panic pointer argument style to keep checkpatch quiet.

    Signed-off-by: Joe Perches
    Cc: Ingo Molnar
    Cc: Peter Zijlstra
    Cc: Haavard Skinnemoen
    Cc: Hans-Christian Egtvedt
    Cc: Tony Luck
    Cc: Fenghua Yu
    Acked-by: Geert Uytterhoeven
    Acked-by: Ralf Baechle
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: Martin Schwidefsky
    Cc: Heiko Carstens
    Cc: Chris Metcalf
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joe Perches
     
  • The only use in kernel.h is gone so remove the macro.

    Signed-off-by: Joe Perches
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joe Perches
     
  • Use __printf macro.
    Convert NORET_AND to ATTRIB_NORET.
    Use the normal kernel style for pointer arguments.

    Signed-off-by: Joe Perches
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joe Perches
     
  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound: (526 commits)
    ASoC: twl6040 - Add method to query optimum PDM_DL1 gain
    ALSA: hda - Fix the lost power-setup of seconary pins after PM resume
    ALSA: usb-audio: add Yamaha MOX6/MOX8 support
    ALSA: virtuoso: add S/PDIF input support for all Xonars
    ALSA: ice1724 - Support for ooAoo SQ210a
    ALSA: ice1724 - Allow card info based on model only
    ALSA: ice1724 - Create capture pcm only for ADC-enabled configurations
    ALSA: hdspm - Provide unique driver id based on card serial
    ASoC: Dynamically allocate the rtd device for a non-empty release()
    ASoC: Fix recursive dependency due to select ATMEL_SSC in SND_ATMEL_SOC_SSC
    ALSA: hda - Fix the detection of "Loopback Mixing" control for VIA codecs
    ALSA: hda - Return the error from get_wcaps_type() for invalid NIDs
    ALSA: hda - Use auto-parser for HP laptops with cx20459 codec
    ALSA: asihpi - Fix potential Oops in snd_asihpi_cmode_info()
    ALSA: hdsp - Fix potential Oops in snd_hdsp_info_pref_sync_ref()
    ALSA: hda/cirrus - support for iMac12,2 model
    ASoC: cx20442: add bias control over a platform provided regulator
    ALSA: usb-audio - Avoid flood of frame-active debug messages
    ALSA: snd-usb-us122l: Delete calls to preempt_disable
    mfd: Put WM8994 into cache only mode when suspending
    ...

    Fix up trivial conflicts in:
    - arch/arm/mach-s3c64xx/mach-crag6410.c:
    renamed speyside_wm8962 to tobermory, added littlemill right
    next to it
    - drivers/base/regmap/{regcache.c,regmap.c}:
    duplicate diff that had already come in with other changes in
    the regmap tree

    Linus Torvalds
     

12 Jan, 2012

10 commits

  • Takashi Iwai
     
  • Takashi Iwai
     
  • SH/R-Mobile updates for 3.3 merge window.

    * tag 'rmobile-for-linus' of git://github.com/pmundt/linux-sh: (32 commits)
    arm: mach-shmobile: add a resource name for shdma
    ARM: mach-shmobile: r8a7779 SMP support V3
    ARM: mach-shmobile: Add kota2 defconfig.
    ARM: mach-shmobile: Add marzen defconfig.
    ARM: mach-shmobile: r8a7779 power domain support V2
    ARM: mach-shmobile: Fix up marzen build for recent GIC changes.
    ARM: mach-shmobile: r8a7779 PFC function support
    ARM: mach-shmobile: Flush caches in platform_cpu_die()
    ARM: mach-shmobile: Allow SoC specific CPU kill code
    ARM: mach-shmobile: Fix headsmp.S code to use CPUINIT
    ARM: mach-shmobile: clock-r8a7779: clkz/clkzs support
    ARM: mach-shmobile: clock-r8a7779: add DIV4 clock support
    ARM: mach-shmobile: Marzen LAN89218 support
    ARM: mach-shmobile: Marzen SCIF2/SCIF4 support
    ARM: mach-shmobile: r8a7779 PFC GPIO-only support V2
    ARM: mach-shmobile: r8a7779 and Marzen base support V2
    sh: pfc: Unlock register support
    sh: pfc: Variable bitfield width config register support
    sh: pfc: Add config_reg_helper() function
    sh: pfc: Convert index to field and value pair
    ...

    Linus Torvalds
     
  • SuperH updates for 3.3 merge window.

    * tag 'sh-for-linus' of git://github.com/pmundt/linux-sh: (38 commits)
    sh: magicpanelr2: Update for parse_mtd_partitions() fallout.
    sh: mach-rsk: Update for parse_mtd_partitions() fallout.
    sh: sh2a: Improve cache flush/invalidate functions
    sh: also without PM_RUNTIME pm_runtime.o must be built
    sh: add a resource name for shdma
    sh: Remove redundant try_to_freeze() invocations.
    sh: Ensure IRQs are enabled across do_notify_resume().
    sh: Fix up store queue code for subsys_interface changes.
    sh: clkfwk: sh_clk_init_parent() should be called after clk_register()
    sh: add platform_device for renesas_usbhs in board-sh7757lcr
    sh: modify clock-sh7757 for renesas_usbhs
    sh: pfc: ioremap() support
    sh: use ioread32/iowrite32 and mapped_reg for div6
    sh: use ioread32/iowrite32 and mapped_reg for div4
    sh: use ioread32/iowrite32 and mapped_reg for mstp32
    sh: extend clock struct with mapped_reg member
    sh: clkfwk: clock-sh73a0: all div6_clks use SH_CLK_DIV6_EXT()
    sh: clkfwk: clock-sh7724: all div6_clks use SH_CLK_DIV6_EXT()
    sh: clock-sh7723: add CLKDEV_ICK_ID for cleanup
    serial: sh-sci: Handle GPIO function requests.
    ...

    Linus Torvalds
     
  • Paul Mundt
     
  • * 'x86-debug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    x86, reboot: Fix typo in nmi reboot path
    x86, NMI: Add to_cpumask() to silence compile warning
    x86, NMI: NMI selftest depends on the local apic
    x86: Add stack top margin for stack overflow checking
    x86, NMI: NMI-selftest should handle the UP case properly
    x86: Fix the 32-bit stackoverflow-debug build
    x86, NMI: Add knob to disable using NMI IPIs to stop cpus
    x86, NMI: Add NMI IPI selftest
    x86, reboot: Use NMI instead of REBOOT_VECTOR to stop cpus
    x86: Clean up the range of stack overflow checking
    x86: Panic on detection of stack overflow
    x86: Check stack overflow in detail

    Linus Torvalds
     
  • * 'x86-efi-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    x86, efi: Break up large initrd reads
    x86, efi: EFI boot stub support
    efi: Add EFI file I/O data types
    efi.h: Add boottime->locate_handle search types
    efi.h: Add graphics protocol guids
    efi.h: Add allocation types for boottime->allocate_pages()
    efi.h: Add efi_image_loaded_t
    efi.h: Add struct definition for boot time services
    x86: Don't use magic strings for EFI loader signature
    x86: Add missing bzImage fields to struct setup_header

    Linus Torvalds
     
  • * 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    x86/numa: Add constraints check for nid parameters
    mm, x86: Remove debug_pagealloc_enabled
    x86/mm: Initialize high mem before free_all_bootmem()
    arch/x86/kernel/e820.c: quiet sparse noise about plain integer as NULL pointer
    arch/x86/kernel/e820.c: Eliminate bubble sort from sanitize_e820_map()
    x86: Fix mmap random address range
    x86, mm: Unify zone_sizes_init()
    x86, mm: Prepare zone_sizes_init() for unification
    x86, mm: Use max_low_pfn for ZONE_NORMAL on 64-bit
    x86, mm: Wrap ZONE_DMA32 with CONFIG_ZONE_DMA32
    x86, mm: Use max_pfn instead of highend_pfn
    x86, mm: Move zone init from paging_init() on 64-bit
    x86, mm: Use MAX_DMA_PFN for ZONE_DMA on 32-bit

    Linus Torvalds
     
  • * git://git.infradead.org/battery-2.6: (68 commits)
    power_supply: Mark da9052 driver as broken
    power_supply: Drop usage of nowarn variant of sysfs_create_link()
    s3c_adc_battery: Average over more than one adc sample
    power_supply: Add DA9052 battery driver
    isp1704_charger: Fix missing check
    jz4740-battery: Fix signedness bug
    power_supply: Assume mains power by default
    sbs-battery: Fix devicetree match table
    ARM: rx51: Add bq27200 i2c board info
    sbs-battery: Change power supply name
    devicetree-bindings: Propagate bq20z75->sbs rename to dt bindings
    devicetree-bindings: Add vendor entry for Smart Battery Systems
    sbs-battery: Rename internals to new name
    bq20z75: Rename to sbs-battery
    wm97xx_battery: Use DEFINE_MUTEX() for work_lock
    max8997_charger: Remove duplicate module.h
    lp8727_charger: Some minor fixes for the header
    lp8727_charger: Add header file
    power_supply: Convert drivers/power/* to use module_platform_driver()
    power_supply: Add "unknown" in power supply type
    ...

    Linus Torvalds
     
  • * 'linux-next' of git://git.kernel.org/pub/scm/linux/kernel/git/jbarnes/pci: (80 commits)
    x86/PCI: Expand the x86_msi_ops to have a restore MSIs.
    PCI: Increase resource array mask bit size in pcim_iomap_regions()
    PCI: DEVICE_COUNT_RESOURCE should be equal to PCI_NUM_RESOURCES
    PCI: pci_ids: add device ids for STA2X11 device (aka ConneXT)
    PNP: work around Dell 1536/1546 BIOS MMCONFIG bug that breaks USB
    x86/PCI: amd: factor out MMCONFIG discovery
    PCI: Enable ATS at the device state restore
    PCI: msi: fix imbalanced refcount of msi irq sysfs objects
    PCI: kconfig: English typo in pci/pcie/Kconfig
    PCI/PM/Runtime: make PCI traces quieter
    PCI: remove pci_create_bus()
    xtensa/PCI: convert to pci_scan_root_bus() for correct root bus resources
    x86/PCI: convert to pci_create_root_bus() and pci_scan_root_bus()
    x86/PCI: use pci_scan_bus() instead of pci_scan_bus_parented()
    x86/PCI: read Broadcom CNB20LE host bridge info before PCI scan
    sparc32, leon/PCI: convert to pci_scan_root_bus() for correct root bus resources
    sparc/PCI: convert to pci_create_root_bus()
    sh/PCI: convert to pci_scan_root_bus() for correct root bus resources
    powerpc/PCI: convert to pci_create_root_bus()
    powerpc/PCI: split PHB part out of pcibios_map_io_space()
    ...

    Fix up conflicts in drivers/pci/msi.c and include/linux/pci_regs.h due
    to the same patches being applied in other branches.

    Linus Torvalds
     

11 Jan, 2012

23 commits

  • * git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (54 commits)
    crypto: gf128mul - remove leftover "(EXPERIMENTAL)" in Kconfig
    crypto: serpent-sse2 - remove unneeded LRW/XTS #ifdefs
    crypto: serpent-sse2 - select LRW and XTS
    crypto: twofish-x86_64-3way - remove unneeded LRW/XTS #ifdefs
    crypto: twofish-x86_64-3way - select LRW and XTS
    crypto: xts - remove dependency on EXPERIMENTAL
    crypto: lrw - remove dependency on EXPERIMENTAL
    crypto: picoxcell - fix boolean and / or confusion
    crypto: caam - remove DECO access initialization code
    crypto: caam - fix polarity of "propagate error" logic
    crypto: caam - more desc.h cleanups
    crypto: caam - desc.h - convert spaces to tabs
    crypto: talitos - convert talitos_error to struct device
    crypto: talitos - remove NO_IRQ references
    crypto: talitos - fix bad kfree
    crypto: convert drivers/crypto/* to use module_platform_driver()
    char: hw_random: convert drivers/char/hw_random/* to use module_platform_driver()
    crypto: serpent-sse2 - should select CRYPTO_CRYPTD
    crypto: serpent - rename serpent.c to serpent_generic.c
    crypto: serpent - cleanup checkpatch errors and warnings
    ...

    Linus Torvalds
     
  • * 'for-linus' of git://selinuxproject.org/~jmorris/linux-security: (32 commits)
    ima: fix invalid memory reference
    ima: free duplicate measurement memory
    security: update security_file_mmap() docs
    selinux: Casting (void *) value returned by kmalloc is useless
    apparmor: fix module parameter handling
    Security: tomoyo: add .gitignore file
    tomoyo: add missing rcu_dereference()
    apparmor: add missing rcu_dereference()
    evm: prevent racing during tfm allocation
    evm: key must be set once during initialization
    mpi/mpi-mpow: NULL dereference on allocation failure
    digsig: build dependency fix
    KEYS: Give key types their own lockdep class for key->sem
    TPM: fix transmit_cmd error logic
    TPM: NSC and TIS drivers X86 dependency fix
    TPM: Export wait_for_stat for other vendor specific drivers
    TPM: Use vendor specific function for status probe
    tpm_tis: add delay after aborting command
    tpm_tis: Check return code from getting timeouts/durations
    tpm: Introduce function to poll for result of self test
    ...

    Fix up trivial conflict in lib/Makefile due to addition of CONFIG_MPI
    and SIGSIG next to CONFIG_DQL addition.

    Linus Torvalds
     
  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    autofs4: deal with autofs4_write/autofs4_write races
    autofs4: catatonic_mode vs. notify_daemon race
    autofs4: autofs4_wait() vs. autofs4_catatonic_mode() race
    hfsplus: creation of hidden dir on mount can fail
    block_dev: Suppress bdev_cache_init() kmemleak warninig
    fix shrink_dcache_parent() livelock
    coda: switch coda_cnode_make() to sane API as well, clean coda_lookup()
    coda: deal correctly with allocation failure from coda_cnode_makectl()
    securityfs: fix object creation races

    Linus Torvalds
     
  • lib: use generic pci_iomap on all architectures

    Many architectures don't want to pull in iomap.c,
    so they ended up duplicating pci_iomap from that file.
    That function isn't trivial, and we are going to modify it
    https://lkml.org/lkml/2011/11/14/183
    so the duplication hurts.

    This reduces the scope of the problem significantly,
    by moving pci_iomap to a separate file and
    referencing that from all architectures.

    * tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost:
    alpha: drop pci_iomap/pci_iounmap from pci-noop.c
    mn10300: switch to GENERIC_PCI_IOMAP
    mn10300: add missing __iomap markers
    frv: switch to GENERIC_PCI_IOMAP
    tile: switch to GENERIC_PCI_IOMAP
    tile: don't panic on iomap
    sparc: switch to GENERIC_PCI_IOMAP
    sh: switch to GENERIC_PCI_IOMAP
    powerpc: switch to GENERIC_PCI_IOMAP
    parisc: switch to GENERIC_PCI_IOMAP
    mips: switch to GENERIC_PCI_IOMAP
    microblaze: switch to GENERIC_PCI_IOMAP
    arm: switch to GENERIC_PCI_IOMAP
    alpha: switch to GENERIC_PCI_IOMAP
    lib: add GENERIC_PCI_IOMAP
    lib: move GENERIC_IOMAP to lib/Kconfig

    Fix up trivial conflicts due to changes nearby in arch/{m68k,score}/Kconfig

    Linus Torvalds
     
  • * tag 'for-linux-3.3-merge-window' of git://linux-c6x.org/git/projects/linux-c6x-upstreaming: (29 commits)
    C6X: replace tick_nohz_stop/restart_sched_tick calls
    C6X: add register_cpu call
    C6X: deal with memblock API changes
    C6X: fix timer64 initialization
    C6X: fix layout of EMIFA registers
    C6X: MAINTAINERS
    C6X: DSCR - Device State Configuration Registers
    C6X: EMIF - External Memory Interface
    C6X: general SoC support
    C6X: library code
    C6X: headers
    C6X: ptrace support
    C6X: loadable module support
    C6X: cache control
    C6X: clocks
    C6X: build infrastructure
    C6X: syscalls
    C6X: interrupt handling
    C6X: time management
    C6X: signal management
    ...

    Linus Torvalds
     
  • * 'writeback-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linux:
    writeback: move MIN_WRITEBACK_PAGES to fs-writeback.c
    writeback: balanced_rate cannot exceed write bandwidth
    writeback: do strict bdi dirty_exceeded
    writeback: avoid tiny dirty poll intervals
    writeback: max, min and target dirty pause time
    writeback: dirty ratelimit - think time compensation
    btrfs: fix dirtied pages accounting on sub-page writes
    writeback: fix dirtied pages accounting on redirty
    writeback: fix dirtied pages accounting on sub-page writes
    writeback: charge leaked page dirties to active tasks
    writeback: Include all dirty inodes in background writeback

    Linus Torvalds
     
  • Andrew elucidates:
    - First installmeant of MM. We have a HUGE number of MM patches this
    time. It's crazy.
    - MAINTAINERS updates
    - backlight updates
    - leds
    - checkpatch updates
    - misc ELF stuff
    - rtc updates
    - reiserfs
    - procfs
    - some misc other bits

    * akpm: (124 commits)
    user namespace: make signal.c respect user namespaces
    workqueue: make alloc_workqueue() take printf fmt and args for name
    procfs: add hidepid= and gid= mount options
    procfs: parse mount options
    procfs: introduce the /proc//map_files/ directory
    procfs: make proc_get_link to use dentry instead of inode
    signal: add block_sigmask() for adding sigmask to current->blocked
    sparc: make SA_NOMASK a synonym of SA_NODEFER
    reiserfs: don't lock root inode searching
    reiserfs: don't lock journal_init()
    reiserfs: delay reiserfs lock until journal initialization
    reiserfs: delete comments referring to the BKL
    drivers/rtc/interface.c: fix alarm rollover when day or month is out-of-range
    drivers/rtc/rtc-twl.c: add DT support for RTC inside twl4030/twl6030
    drivers/rtc/: remove redundant spi driver bus initialization
    drivers/rtc/rtc-jz4740.c: make jz4740_rtc_driver static
    drivers/rtc/rtc-mc13xxx.c: make mc13xxx_rtc_idtable static
    rtc: convert drivers/rtc/* to use module_platform_driver()
    drivers/rtc/rtc-wm831x.c: convert to devm_kzalloc()
    drivers/rtc/rtc-wm831x.c: remove unused period IRQ handler
    ...

    Linus Torvalds
     
  • alloc_workqueue() currently expects the passed in @name pointer to remain
    accessible. This is inconvenient and a bit silly given that the whole wq
    is being dynamically allocated. This patch updates alloc_workqueue() and
    friends to take printf format string instead of opaque string and matching
    varargs at the end. The name is allocated together with the wq and
    formatted.

    alloc_ordered_workqueue() is converted to a macro to unify varargs
    handling with alloc_workqueue(), and, while at it, add comment to
    alloc_workqueue().

    None of the current in-kernel users pass in string with '%' as constant
    name and this change shouldn't cause any problem.

    [akpm@linux-foundation.org: use __printf]
    Signed-off-by: Tejun Heo
    Suggested-by: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tejun Heo
     
  • Add support for mount options to restrict access to /proc/PID/
    directories. The default backward-compatible "relaxed" behaviour is left
    untouched.

    The first mount option is called "hidepid" and its value defines how much
    info about processes we want to be available for non-owners:

    hidepid=0 (default) means the old behavior - anybody may read all
    world-readable /proc/PID/* files.

    hidepid=1 means users may not access any /proc// directories, but
    their own. Sensitive files like cmdline, sched*, status are now protected
    against other users. As permission checking done in proc_pid_permission()
    and files' permissions are left untouched, programs expecting specific
    files' modes are not confused.

    hidepid=2 means hidepid=1 plus all /proc/PID/ will be invisible to other
    users. It doesn't mean that it hides whether a process exists (it can be
    learned by other means, e.g. by kill -0 $PID), but it hides process' euid
    and egid. It compicates intruder's task of gathering info about running
    processes, whether some daemon runs with elevated privileges, whether
    another user runs some sensitive program, whether other users run any
    program at all, etc.

    gid=XXX defines a group that will be able to gather all processes' info
    (as in hidepid=0 mode). This group should be used instead of putting
    nonroot user in sudoers file or something. However, untrusted users (like
    daemons, etc.) which are not supposed to monitor the tasks in the whole
    system should not be added to the group.

    hidepid=1 or higher is designed to restrict access to procfs files, which
    might reveal some sensitive private information like precise keystrokes
    timings:

    http://www.openwall.com/lists/oss-security/2011/11/05/3

    hidepid=1/2 doesn't break monitoring userspace tools. ps, top, pgrep, and
    conky gracefully handle EPERM/ENOENT and behave as if the current user is
    the only user running processes. pstree shows the process subtree which
    contains "pstree" process.

    Note: the patch doesn't deal with setuid/setgid issues of keeping
    preopened descriptors of procfs files (like
    https://lkml.org/lkml/2011/2/7/368). We rely on that the leaked
    information like the scheduling counters of setuid apps doesn't threaten
    anybody's privacy - only the user started the setuid program may read the
    counters.

    Signed-off-by: Vasiliy Kulikov
    Cc: Alexey Dobriyan
    Cc: Al Viro
    Cc: Randy Dunlap
    Cc: "H. Peter Anvin"
    Cc: Greg KH
    Cc: Theodore Tso
    Cc: Alan Cox
    Cc: James Morris
    Cc: Oleg Nesterov
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vasiliy Kulikov
     
  • This one behaves similarly to the /proc//fd/ one - it contains
    symlinks one for each mapping with file, the name of a symlink is
    "vma->vm_start-vma->vm_end", the target is the file. Opening a symlink
    results in a file that point exactly to the same inode as them vma's one.

    For example the ls -l of some arbitrary /proc//map_files/

    | lr-x------ 1 root root 64 Aug 26 06:40 7f8f80403000-7f8f80404000 -> /lib64/libc-2.5.so
    | lr-x------ 1 root root 64 Aug 26 06:40 7f8f8061e000-7f8f80620000 -> /lib64/libselinux.so.1
    | lr-x------ 1 root root 64 Aug 26 06:40 7f8f80826000-7f8f80827000 -> /lib64/libacl.so.1.1.0
    | lr-x------ 1 root root 64 Aug 26 06:40 7f8f80a2f000-7f8f80a30000 -> /lib64/librt-2.5.so
    | lr-x------ 1 root root 64 Aug 26 06:40 7f8f80a30000-7f8f80a4c000 -> /lib64/ld-2.5.so

    This *helps* checkpointing process in three ways:

    1. When dumping a task mappings we do know exact file that is mapped
    by particular region. We do this by opening
    /proc/$pid/map_files/$address symlink the way we do with file
    descriptors.

    2. This also helps in determining which anonymous shared mappings are
    shared with each other by comparing the inodes of them.

    3. When restoring a set of processes in case two of them has a mapping
    shared, we map the memory by the 1st one and then open its
    /proc/$pid/map_files/$address file and map it by the 2nd task.

    Using /proc/$pid/maps for this is quite inconvenient since it brings
    repeatable re-reading and reparsing for this text file which slows down
    restore procedure significantly. Also as being pointed in (3) it is a way
    easier to use top level shared mapping in children as
    /proc/$pid/map_files/$address when needed.

    [akpm@linux-foundation.org: coding-style fixes]
    [gorcunov@openvz.org: make map_files depend on CHECKPOINT_RESTORE]
    Signed-off-by: Pavel Emelyanov
    Signed-off-by: Cyrill Gorcunov
    Reviewed-by: Vasiliy Kulikov
    Reviewed-by: "Kirill A. Shutemov"
    Cc: Tejun Heo
    Cc: Alexey Dobriyan
    Cc: Al Viro
    Cc: Pavel Machek
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pavel Emelyanov
     
  • Prepare the ground for the next "map_files" patch which needs a name of a
    link file to analyse.

    Signed-off-by: Cyrill Gorcunov
    Cc: Pavel Emelyanov
    Cc: Tejun Heo
    Cc: Vasiliy Kulikov
    Cc: "Kirill A. Shutemov"
    Cc: Alexey Dobriyan
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cyrill Gorcunov
     
  • Abstract the code sequence for adding a signal handler's sa_mask to
    current->blocked because the sequence is identical for all architectures.
    Furthermore, in the past some architectures actually got this code wrong,
    so introduce a wrapper that all architectures can use.

    Signed-off-by: Matt Fleming
    Signed-off-by: Oleg Nesterov
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: H. Peter Anvin
    Cc: Tejun Heo
    Cc: "David S. Miller"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matt Fleming
     
  • TI's TCA6507 is the LED driver in the GTA04 Openmoko motherboard. The
    driver provides full support for brightness levels and hardware blinking.

    This driver can drive each of 7 outputs as an LED or a GPIO output,
    and provides hardware-assist blinking.

    [akpm@linux-foundation.org: fix __mod_i2c_device_table alias]
    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: NeilBrown
    Cc: Richard Purdie
    Cc: Randy Dunlap
    Cc: Dan Carpenter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • mpol_equal() logically returns a boolean. Use a bool type to slightly
    improve readability.

    Signed-off-by: KOSAKI Motohiro
    Cc: Stephen Wilson
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • oom_score_adj is used for guarding processes from OOM-Killer. One of
    problem is that it's inherited at fork(). When a daemon set oom_score_adj
    and make children, it's hard to know where the value is set.

    This patch adds some tracepoints useful for debugging. This patch adds
    3 trace points.
    - creating new task
    - renaming a task (exec)
    - set oom_score_adj

    To debug, users need to enable some trace pointer. Maybe filtering is useful as

    # EVENT=/sys/kernel/debug/tracing/events/task/
    # echo "oom_score_adj != 0" > $EVENT/task_newtask/filter
    # echo "oom_score_adj != 0" > $EVENT/task_rename/filter
    # echo 1 > $EVENT/enable
    # EVENT=/sys/kernel/debug/tracing/events/oom/
    # echo 1 > $EVENT/enable

    output will be like this.
    # grep oom /sys/kernel/debug/tracing/trace
    bash-7699 [007] d..3 5140.744510: oom_score_adj_update: pid=7699 comm=bash oom_score_adj=-1000
    bash-7699 [007] ...1 5151.818022: task_newtask: pid=7729 comm=bash clone_flags=1200011 oom_score_adj=-1000
    ls-7729 [003] ...2 5151.818504: task_rename: pid=7729 oldcomm=bash newcomm=ls oom_score_adj=-1000
    bash-7699 [002] ...1 5175.701468: task_newtask: pid=7730 comm=bash clone_flags=1200011 oom_score_adj=-1000
    grep-7730 [007] ...2 5175.701993: task_rename: pid=7730 oldcomm=bash newcomm=grep oom_score_adj=-1000

    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: KOSAKI Motohiro
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • migrate was doing an rmap_walk with speculative lock-less access on
    pagetables. That could lead it to not serializing properly against mremap
    PT locks. But a second problem remains in the order of vmas in the
    same_anon_vma list used by the rmap_walk.

    If vma_merge succeeds in copy_vma, the src vma could be placed after the
    dst vma in the same_anon_vma list. That could still lead to migrate
    missing some pte.

    This patch adds an anon_vma_moveto_tail() function to force the dst vma at
    the end of the list before mremap starts to solve the problem.

    If the mremap is very large and there are a lots of parents or childs
    sharing the anon_vma root lock, this should still scale better than taking
    the anon_vma root lock around every pte copy practically for the whole
    duration of mremap.

    Update: Hugh noticed special care is needed in the error path where
    move_page_tables goes in the reverse direction, a second
    anon_vma_moveto_tail() call is needed in the error path.

    This program exercises the anon_vma_moveto_tail:

    ===

    int main()
    {
    static struct timeval oldstamp, newstamp;
    long diffsec;
    char *p, *p2, *p3, *p4;
    if (posix_memalign((void **)&p, 2*1024*1024, SIZE))
    perror("memalign"), exit(1);
    if (posix_memalign((void **)&p2, 2*1024*1024, SIZE))
    perror("memalign"), exit(1);
    if (posix_memalign((void **)&p3, 2*1024*1024, SIZE))
    perror("memalign"), exit(1);

    memset(p, 0xff, SIZE);
    printf("%p\n", p);
    memset(p2, 0xff, SIZE);
    memset(p3, 0x77, 4096);
    if (memcmp(p, p2, SIZE))
    printf("error\n");
    p4 = mremap(p+SIZE/2, SIZE/2, SIZE/2, MREMAP_FIXED|MREMAP_MAYMOVE, p3);
    if (p4 != p3)
    perror("mremap"), exit(1);
    p4 = mremap(p4, SIZE/2, SIZE/2, MREMAP_FIXED|MREMAP_MAYMOVE, p+SIZE/2);
    if (p4 != p+SIZE/2)
    perror("mremap"), exit(1);
    if (memcmp(p, p2, SIZE))
    printf("error\n");
    printf("ok\n");

    return 0;
    }
    ===

    $ perf probe -a anon_vma_moveto_tail
    Add new event:
    probe:anon_vma_moveto_tail (on anon_vma_moveto_tail)

    You can now use it on all perf tools, such as:

    perf record -e probe:anon_vma_moveto_tail -aR sleep 1

    $ perf record -e probe:anon_vma_moveto_tail -aR ./anon_vma_moveto_tail
    0x7f2ca2800000
    ok
    [ perf record: Woken up 1 times to write data ]
    [ perf record: Captured and wrote 0.043 MB perf.data (~1860 samples) ]
    $ perf report --stdio
    100.00% anon_vma_moveto [kernel.kallsyms] [k] anon_vma_moveto_tail

    Signed-off-by: Andrea Arcangeli
    Reported-by: Nai Xia
    Acked-by: Mel Gorman
    Cc: Hugh Dickins
    Cc: Pawel Sikora
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • The maximum number of dirty pages that exist in the system at any time is
    determined by a number of pages considered dirtyable and a user-configured
    percentage of those, or an absolute number in bytes.

    This number of dirtyable pages is the sum of memory provided by all the
    zones in the system minus their lowmem reserves and high watermarks, so
    that the system can retain a healthy number of free pages without having
    to reclaim dirty pages.

    But there is a flaw in that we have a zoned page allocator which does not
    care about the global state but rather the state of individual memory
    zones. And right now there is nothing that prevents one zone from filling
    up with dirty pages while other zones are spared, which frequently leads
    to situations where kswapd, in order to restore the watermark of free
    pages, does indeed have to write pages from that zone's LRU list. This
    can interfere so badly with IO from the flusher threads that major
    filesystems (btrfs, xfs, ext4) mostly ignore write requests from reclaim
    already, taking away the VM's only possibility to keep such a zone
    balanced, aside from hoping the flushers will soon clean pages from that
    zone.

    Enter per-zone dirty limits. They are to a zone's dirtyable memory what
    the global limit is to the global amount of dirtyable memory, and try to
    make sure that no single zone receives more than its fair share of the
    globally allowed dirty pages in the first place. As the number of pages
    considered dirtyable excludes the zones' lowmem reserves and high
    watermarks, the maximum number of dirty pages in a zone is such that the
    zone can always be balanced without requiring page cleaning.

    As this is a placement decision in the page allocator and pages are
    dirtied only after the allocation, this patch allows allocators to pass
    __GFP_WRITE when they know in advance that the page will be written to and
    become dirty soon. The page allocator will then attempt to allocate from
    the first zone of the zonelist - which on NUMA is determined by the task's
    NUMA memory policy - that has not exceeded its dirty limit.

    At first glance, it would appear that the diversion to lower zones can
    increase pressure on them, but this is not the case. With a full high
    zone, allocations will be diverted to lower zones eventually, so it is
    more of a shift in timing of the lower zone allocations. Workloads that
    previously could fit their dirty pages completely in the higher zone may
    be forced to allocate from lower zones, but the amount of pages that
    "spill over" are limited themselves by the lower zones' dirty constraints,
    and thus unlikely to become a problem.

    For now, the problem of unfair dirty page distribution remains for NUMA
    configurations where the zones allowed for allocation are in sum not big
    enough to trigger the global dirty limits, wake up the flusher threads and
    remedy the situation. Because of this, an allocation that could not
    succeed on any of the considered zones is allowed to ignore the dirty
    limits before going into direct reclaim or even failing the allocation,
    until a future patch changes the global dirty throttling and flusher
    thread activation so that they take individual zone states into account.

    Test results

    15M DMA + 3246M DMA32 + 504 Normal = 3765M memory
    40% dirty ratio
    16G USB thumb drive
    10 runs of dd if=/dev/zero of=disk/zeroes bs=32k count=$((10 << 15))

    seconds nr_vmscan_write
    (stddev) min| median| max
    xfs
    vanilla: 549.747( 3.492) 0.000| 0.000| 0.000
    patched: 550.996( 3.802) 0.000| 0.000| 0.000

    fuse-ntfs
    vanilla: 1183.094(53.178) 54349.000| 59341.000| 65163.000
    patched: 558.049(17.914) 0.000| 0.000| 43.000

    btrfs
    vanilla: 573.679(14.015) 156657.000| 460178.000| 606926.000
    patched: 563.365(11.368) 0.000| 0.000| 1362.000

    ext4
    vanilla: 561.197(15.782) 0.000|2725438.000|4143837.000
    patched: 568.806(17.496) 0.000| 0.000| 0.000

    Signed-off-by: Johannes Weiner
    Reviewed-by: Minchan Kim
    Acked-by: Mel Gorman
    Reviewed-by: Michal Hocko
    Tested-by: Wu Fengguang
    Cc: KAMEZAWA Hiroyuki
    Cc: Christoph Hellwig
    Cc: Dave Chinner
    Cc: Jan Kara
    Cc: Shaohua Li
    Cc: Rik van Riel
    Cc: Chris Mason
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Per-zone dirty limits try to distribute page cache pages allocated for
    writing across zones in proportion to the individual zone sizes, to reduce
    the likelihood of reclaim having to write back individual pages from the
    LRU lists in order to make progress.

    This patch:

    The amount of dirtyable pages should not include the full number of free
    pages: there is a number of reserved pages that the page allocator and
    kswapd always try to keep free.

    The closer (reclaimable pages - dirty pages) is to the number of reserved
    pages, the more likely it becomes for reclaim to run into dirty pages:

    +----------+ ---
    | anon | |
    +----------+ |
    | | |
    | | -- dirty limit new -- flusher new
    | file | | |
    | | | |
    | | -- dirty limit old -- flusher old
    | | |
    +----------+ --- reclaim
    | reserved |
    +----------+
    | kernel |
    +----------+

    This patch introduces a per-zone dirty reserve that takes both the lowmem
    reserve as well as the high watermark of the zone into account, and a
    global sum of those per-zone values that is subtracted from the global
    amount of dirtyable pages. The lowmem reserve is unavailable to page
    cache allocations and kswapd tries to keep the high watermark free. We
    don't want to end up in a situation where reclaim has to clean pages in
    order to balance zones.

    Not treating reserved pages as dirtyable on a global level is only a
    conceptual fix. In reality, dirty pages are not distributed equally
    across zones and reclaim runs into dirty pages on a regular basis.

    But it is important to get this right before tackling the problem on a
    per-zone level, where the distance between reclaim and the dirty pages is
    mostly much smaller in absolute numbers.

    [akpm@linux-foundation.org: fix highmem build]
    Signed-off-by: Johannes Weiner
    Reviewed-by: Rik van Riel
    Reviewed-by: Michal Hocko
    Reviewed-by: Minchan Kim
    Acked-by: Mel Gorman
    Cc: KAMEZAWA Hiroyuki
    Cc: Christoph Hellwig
    Cc: Wu Fengguang
    Cc: Dave Chinner
    Cc: Jan Kara
    Cc: Shaohua Li
    Cc: Chris Mason
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Calling alloc_pages_exact_node() means the allocation only passes the
    zonelist of a single node into the page allocator. If that node isn't
    online, it's zonelist may never have been initialized causing a strange
    oops that may not immediately be clear.

    I recently debugged an issue where node 0 wasn't online and an allocator
    was passing 0 to alloc_pages_exact_node() and it resulted in a NULL
    pointer on zonelist->_zoneref. If CONFIG_DEBUG_VM is enabled, though, it
    would be nice to catch this a bit earlier.

    Signed-off-by: David Rientjes
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • With CONFIG_DEBUG_PAGEALLOC configured, the CPU will generate an exception
    on access (read,write) to an unallocated page, which permits us to catch
    code which corrupts memory. However the kernel is trying to maximise
    memory usage, hence there are usually few free pages in the system and
    buggy code usually corrupts some crucial data.

    This patch changes the buddy allocator to keep more free/protected pages
    and to interlace free/protected and allocated pages to increase the
    probability of catching corruption.

    When the kernel is compiled with CONFIG_DEBUG_PAGEALLOC,
    debug_guardpage_minorder defines the minimum order used by the page
    allocator to grant a request. The requested size will be returned with
    the remaining pages used as guard pages.

    The default value of debug_guardpage_minorder is zero: no change from
    current behaviour.

    [akpm@linux-foundation.org: tweak documentation, s/flg/flag/]
    Signed-off-by: Stanislaw Gruszka
    Cc: Mel Gorman
    Cc: Andrea Arcangeli
    Cc: "Rafael J. Wysocki"
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Stanislaw Gruszka
     
  • We can place this in definitions that we expect the compiler to remove by
    dead code elimination. If this assertion fails, we get a nice error
    message at build time.

    The GCC function attribute error("message") was added in version 4.3, so
    we define a new macro __linktime_error(message) to expand to this for
    GCC-4.3 and later. This will give us an error diagnostic from the
    compiler on the line that fails. For other compilers
    __linktime_error(message) expands to nothing, and we have to be content
    with a link time error, but at least we will still get a build error.

    BUILD_BUG() expands to the undefined function __build_bug_failed() and
    will fail at link time if the compiler ever emits code for it. On GCC-4.3
    and later, attribute((error())) is used so that the failure will be noted
    at compile time instead.

    Signed-off-by: David Daney
    Acked-by: David Rientjes
    Cc: DM
    Cc: Ralf Baechle
    Acked-by: David Howells
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Daney
     
  • Colin Cross reported;

    Under the following conditions, __alloc_pages_slowpath can loop forever:
    gfp_mask & __GFP_WAIT is true
    gfp_mask & __GFP_FS is false
    reclaim and compaction make no progress
    order
    Signed-off-by: Mel Gorman
    Acked-by: David Rientjes
    Cc: Minchan Kim
    Cc: Pekka Enberg
    Cc: KAMEZAWA Hiroyuki
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Rename mm_page_free_direct into mm_page_free and mm_pagevec_free into
    mm_page_free_batched

    Since v2.6.33-5426-gc475dab the kernel triggers mm_page_free_direct for
    all freed pages, not only for directly freed. So, let's name it properly.
    For pages freed via page-list we also trigger mm_page_free_batched event.

    Signed-off-by: Konstantin Khlebnikov
    Cc: Mel Gorman
    Cc: KOSAKI Motohiro
    Reviewed-by: Minchan Kim
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov