25 May, 2011

40 commits

  • This function makes a deep copy of the platform data to allow it to live
    in init memory. For a kernel that supports several machines and so
    includes the definition for several leds-gpio devices this saves quite
    some memory because all but one definition can be free'd after boot.

    As the function is used by arch code it must be builtin and so cannot go
    into leds-gpio.c.

    [akpm@linux-foundation.org: s/CONFIG_LED_REGISTER_GPIO/CONFIG_LEDS_REGISTER_GPIO/]
    Signed-off-by: Uwe Kleine-König
    Cc: Russell King
    Acked-by: Richard Purdie
    Cc: Fabio Estevam
    Cc: Sascha Hauer
    Tested-by: H Hartley Sweeten
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Uwe Kleine-König
     
  • Add add regulator support to lm3530 driver. The lm3530 driver needs to
    get proper regulator during device probe and enable it before accessing
    the device. Also it disables the regulator in case of brightness ==
    LED_OFF, and puts it back during driver removal.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Shreshtha Kumar Sahu
    Cc: Lee Jones
    Cc: Shreshtha Kumar Sahu
    Cc: Richard Purdie
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shreshtha Kumar Sahu
     
  • The H1940 machine now uses leds-gpio and leds-h1940 has no users anymore.

    Signed-off-by: Vasily Khoruzhick
    Cc: "Arnaud Patard (Rtp)"
    Cc: Ben Dooks
    Cc: Richard Purdie
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vasily Khoruzhick
     
  • The pca953x family are only different in number of leds and register
    layout Adding chipinfo to use driver with whole pca953x family Rename
    driver to pca953x, but left files and platformflags named pca9532.

    Tested with pca9530 and pca9533

    Tested-by: Juergen Kilb
    Signed-off-by: Jan Weitzel
    Acked-by: Joachim Eastwood
    Tested-by: Joachim Eastwood
    Cc: Wolfram Sang
    Cc: H Hartley Sweeten
    Cc: Richard Purdie
    Cc: Grant Likely
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Weitzel
     
  • Allow unused leds on pca9532 to be used as gpio. The board I am working
    on now has no less than 6 pca9532 chips. One chips is used for only leds,
    one has 14 leds and 2 gpio and the rest of the chips are gpio only.

    There is also one board in mainline which could use this capabilty;
    arch/arm/mach-iop32x/n2100.c
    232 { .type = PCA9532_TYPE_NONE }, /* power OFF gpio */
    233 { .type = PCA9532_TYPE_NONE }, /* reset gpio */

    This patch defines a new pin type, PCA9532_TYPE_GPIO, and registers a
    gpiochip if any pin has this type set. The gpio will registers all chip
    pins but will filter on gpio_request.

    [randy.dunlap@oracle.com: fix build when GPIOLIB is not enabled]
    Signed-off-by: Joachim Eastwood
    Reviewed-by: Wolfram Sang
    Reviewed-by: H Hartley Sweeten
    Cc: Richard Purdie
    Cc: Grant Likely
    Signed-off-by: Randy Dunlap
    Cc: Jan Weitzel
    Cc: Juergen Kilb
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joachim Eastwood
     
  • By setting initial values blink_delay_on and blink_delay_off in a
    led_classdev struct, this change starts the blinking when the led is
    initialized.

    With this patch, you can initialize blink_delay_on and blink_delay_off in
    led_classdev with default_trigger set to "timer", and the led will start
    up blinking. The current ledtrig-timer implementation ignores any initial
    blink_delay_on/blink_delay_off settings, and requires setting
    blink_delay_on/blink_delay_off (typically from userspace) before the led
    blinks.

    Signed-off-by: Esben Haabendal
    Cc: Richard Purdie
    Cc: Randy Dunlap
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Esben Haabendal
     
  • Tobias's email bounces and he hasn't submitted or acked a patch in git
    history.

    Signed-off-by: Joe Perches
    Cc: David Miller
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joe Perches
     
  • This tree hasn't been updated since June 2008.

    Signed-off-by: Lucian Adrian Grijincu
    Acked-by: Chris Wright
    Cc: James Morris
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lucian Adrian Grijincu
     
  • On larger systems, because of the numerous ACPI, Bootmem and EFI messages,
    the static log buffer overflows before the larger one specified by the
    log_buf_len param is allocated. Minimize the overflow by allocating the
    new log buffer as soon as possible.

    On kernels without memblock, a later call to setup_log_buf from
    kernel/init.c is the fallback.

    [akpm@linux-foundation.org: coding-style fixes]
    [akpm@linux-foundation.org: fix CONFIG_PRINTK=n build]
    Signed-off-by: Mike Travis
    Cc: Yinghai Lu
    Cc: "H. Peter Anvin"
    Cc: Jack Steiner
    Cc: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Travis
     
  • On larger systems, information in the kernel log is lost because there is
    so much early text printed, that it overflows the static log buffer before
    the log_buf_len kernel parameter can be processed, and a bigger log buffer
    allocated.

    Distros are relunctant to increase memory usage by increasing the size of
    the static log buffer, so minimize the problem by allocating the new log
    buffer as early as possible.

    This patch:

    Add an error return if CONFIG_HAVE_MEMBLOCK is not set instead of having
    to add #ifdef CONFIG_HAVE_MEMBLOCK around blocks of code calling that
    function.

    Signed-off-by: Mike Travis
    Cc: Yinghai Lu
    Cc: "H. Peter Anvin"
    Cc: Jack Steiner
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yinghai Lu
     
  • Otherwise, the warning at the top of vsnprintf() gets triggered by
    kvasprintf()'s first invocation (with NULL buffer and zero size) of
    vsnprintf().

    Signed-off-by: Jan Beulich
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Beulich
     
  • sparse can't parse warning and error attribute. then they should be
    hidden from sparse.

    Signed-off-by: KOSAKI Motohiro
    Cc: Arjan van de Ven
    Cc: Dave Hansen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • commit c5e631cf65f ("ARRAY_SIZE: check for type") added __must_be_array().
    But sparse can't parse this gcc extention.

    Now make C=2 makes following sparse errors a lot.

    kernel/futex.c:2699:25: error: No right hand side of '+'-expression

    Because __must_be_array() is used for ARRAY_SIZE() macro and it is
    used very widely.

    This patch fixes it.

    Signed-off-by: KOSAKI Motohiro
    Cc: Dave Hansen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • BUILD_BUG_ON() causes a syntax error to detect coding errors. So it
    causes sparse to detect an error too. This reduces sparse's usefulness.

    This patch makes a dummy BUILD_BUG_ON() definition for sparse.

    Signed-off-by: KOSAKI Motohiro
    Cc: Dave Hansen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • A fix to the TSC (Time Stamp Counter) based bogoMIPS calculation used on
    secondary CPUs which has two faults:

    1: Not handling wrapping of the lower 32 bits of the TSC counter on
    32bit kernel - perhaps TSC is not reset by a warm reset?

    2: TSC and Jiffies are no incrementing together properly. Either
    jiffies increment too quickly or Time Stamp Counter isn't incremented
    in during an SMI but the real time clock is and jiffies are
    incremented.

    Case 1 can result in a factor of 16 too large a value which makes udelay()
    values too small and can cause mysterious driver errors. Case 2 appears
    to give smaller 10-15% errors after averaging but enough to cause
    occasional failures on my own board

    I have tested this code on my own branch and attach patch suitable for
    current kernel code. See below for examples of the failures and how the
    fix handles these situations now.

    I reported this issue earlier here:
    Intermittent problem with BogoMIPs calculation on Intel AP CPUs -
    http://marc.info/?l=linux-kernel&m=129947246316875&w=4

    I suspect this issue has been seen by others but as it is intermittent and
    bogoMIPS for secondary CPUs are no longer printed out it might have been
    difficult to identify this as the cause. Perhaps these unresolved issues,
    although quite old, might be relevant as possibly this fault has been
    around for a while. In particular Case 1 may only be relevant to 32bit
    kernels on newer HW (most people run 64bit kernels?). Case 2 is less
    dramatic since the earlier fix in this area and also intermittent.

    Re: bogomips discrepancy on Intel Core2 Quad CPU -
    http://marc.info/?l=linux-kernel&m=118929277524298&w=4
    slow system and bogus bogomips -
    http://marc.info/?l=linux-kernel&m=116791286716107&w=4
    Re: Re: [RFC-PATCH] clocksource: update lpj if clocksource has -
    http://marc.info/?l=linux-kernel&m=128952775819467&w=4

    This issue is masked a little by commit feae3203d711db0a ("timers, init:
    Limit the number of per cpu calibration bootup messages") which only
    prints out the first bogoMIPS value making it much harder to notice other
    values differing. Perhaps it should be changed to only suppress them when
    they are similar values?

    Here are some outputs showing faults occurring and the new code handling
    them properly. See my earlier message for examples of the original
    failure.

    Case 1: A Time Stamp Counter wrap:
    ...
    Calibrating delay loop (skipped), value calculated using timer
    frequency.. 6332.70 BogoMIPS (lpj=31663540)
    ....
    calibrate_delay_direct() timer_rate_max=31666493
    timer_rate_min=31666151 pre_start=4170369255 pre_end=4202035539
    calibrate_delay_direct() timer_rate_max=2425955274
    timer_rate_min=2425954941 pre_start=4265368533 pre_end=2396356387
    calibrate_delay_direct() ignoring timer_rate as we had a TSC wrap
    around start=4265368581 >=post_end=2396356511
    calibrate_delay_direct() timer_rate_max=31666274
    timer_rate_min=31665942 pre_start=2440373374 pre_end=2472039515
    calibrate_delay_direct() timer_rate_max=31666492
    timer_rate_min=31666160 pre_start=2535372139 pre_end=2567038422
    calibrate_delay_direct() timer_rate_max=31666455
    timer_rate_min=31666207 pre_start=2630371084 pre_end=2662037415
    Calibrating delay using timer specific routine.. 6333.28 BogoMIPS (lpj=31666428)
    Total of 2 processors activated (12665.99 BogoMIPS).
    ....

    Case 2: Some thing (presumably the SMM interrupt?) causing the
    very low increase in TSC counter for the DELAY_CALIBRATION_TICKS
    increase in jiffies
    ...
    Calibrating delay loop (skipped), value calculated using timer
    frequency.. 6333.25 BogoMIPS (lpj=31666270)
    ...
    calibrate_delay_direct() timer_rate_max=31666483
    timer_rate_min=31666074 pre_start=4199536526 pre_end=4231202809
    calibrate_delay_direct() timer_rate_max=864348 timer_rate_min=864016
    pre_start=2405343672 pre_end=2406207897
    calibrate_delay_direct() timer_rate_max=31666483
    timer_rate_min=31666179 pre_start=2469540464 pre_end=2501206823
    calibrate_delay_direct() timer_rate_max=31666511
    timer_rate_min=31666122 pre_start=2564539400 pre_end=2596205712
    calibrate_delay_direct() timer_rate_max=31666084
    timer_rate_min=31665685 pre_start=2659538782 pre_end=2691204657
    calibrate_delay_direct() dropping min bogoMips estimate 1 = 864348
    Calibrating delay using timer specific routine.. 6333.27 BogoMIPS (lpj=31666390)
    Total of 2 processors activated (12666.53 BogoMIPS).
    ...

    After 70 boots I saw 2 variations
    Reviewed-by: Phil Carmody
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Worsley
     
  • af4f136056c9 ("security: move LSM xattrnames to xattr.h") moved the
    XATTR_CAPS_SUFFIX define from capability.h to xattr.h. This makes sense
    except it was previously exports to userspace but xattr.h does not export
    it to userspace. This patch exports these headers to userspace to fix the
    ABI regression.

    There is some slight possibility that this will cause problems in other
    applications which used these #defines differently (wrongly) and I could
    JUST export the capabilities xattr name that we broke. Does anyonehave an
    idea how exposing these headers could cause a problem?

    Below is what is being exposed to userspace, included here since it isn't
    clear exactly what is going to be made available from the patch.

    /* Namespaces */
    #define XATTR_OS2_PREFIX "os2."
    #define XATTR_OS2_PREFIX_LEN (sizeof (XATTR_OS2_PREFIX) - 1)

    #define XATTR_SECURITY_PREFIX "security."
    #define XATTR_SECURITY_PREFIX_LEN (sizeof (XATTR_SECURITY_PREFIX) - 1)

    #define XATTR_SYSTEM_PREFIX "system."
    #define XATTR_SYSTEM_PREFIX_LEN (sizeof (XATTR_SYSTEM_PREFIX) - 1)

    #define XATTR_TRUSTED_PREFIX "trusted."
    #define XATTR_TRUSTED_PREFIX_LEN (sizeof (XATTR_TRUSTED_PREFIX) - 1)

    #define XATTR_USER_PREFIX "user."
    #define XATTR_USER_PREFIX_LEN (sizeof (XATTR_USER_PREFIX) - 1)

    /* Security namespace */
    #define XATTR_SELINUX_SUFFIX "selinux"
    #define XATTR_NAME_SELINUX XATTR_SECURITY_PREFIX XATTR_SELINUX_SUFFIX

    #define XATTR_SMACK_SUFFIX "SMACK64"
    #define XATTR_SMACK_IPIN "SMACK64IPIN"
    #define XATTR_SMACK_IPOUT "SMACK64IPOUT"
    #define XATTR_NAME_SMACK XATTR_SECURITY_PREFIX XATTR_SMACK_SUFFIX
    #define XATTR_NAME_SMACKIPIN XATTR_SECURITY_PREFIX XATTR_SMACK_IPIN
    #define XATTR_NAME_SMACKIPOUT XATTR_SECURITY_PREFIX XATTR_SMACK_IPOUT

    #define XATTR_CAPS_SUFFIX "capability"
    #define XATTR_NAME_CAPS XATTR_SECURITY_PREFIX XATTR_CAPS_SUFFIX

    Reported-by: Ozan Çaglayan
    Signed-off-by: Eric Paris
    Cc: Mimi Zohar
    Cc: Serge Hallyn
    Cc: James Morris
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric Paris
     
  • Manually adjusting the smp_affinity for IRQ's becomes unwieldy when the
    cpu count is large.

    Setting smp affinity to cpus 256 to 263 would be:

    echo 000000ff,00000000,00000000,00000000,00000000,00000000,00000000,00000000 > smp_affinity

    instead of:

    echo 256-263 > smp_affinity_list

    Think about what it looks like for cpus around say, 4088 to 4095.

    We already have many alternate "list" interfaces:

    /sys/devices/system/cpu/cpuX/indexY/shared_cpu_list
    /sys/devices/system/cpu/cpuX/topology/thread_siblings_list
    /sys/devices/system/cpu/cpuX/topology/core_siblings_list
    /sys/devices/system/node/nodeX/cpulist
    /sys/devices/pci***/***/local_cpulist

    Add a companion interface, smp_affinity_list to use cpu lists instead of
    cpu maps. This conforms to other companion interfaces where both a map
    and a list interface exists.

    This required adding a bitmap_parselist_user() function in a manner
    similar to the bitmap_parse_user() function.

    [akpm@linux-foundation.org: make __bitmap_parselist() static]
    Signed-off-by: Mike Travis
    Cc: Thomas Gleixner
    Cc: Jack Steiner
    Cc: Lee Schermerhorn
    Cc: Andy Shevchenko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Travis
     
  • There is no CONFIG_WORKQUEUE_DEBUGFS any more, so this code is dead.

    Signed-off-by: WANG Cong
    Cc: David Howells
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Amerigo Wang
     
  • The presense of a writeq() implementation on 32-bit x86 that splits the
    64-bit write into two 32-bit writes turns out to break the mpt2sas driver
    (and in general is risky for drivers as was discussed in
    ). To fix this,
    revert 2c5643b1c5c7 ("x86: provide readq()/writeq() on 32-bit too") and
    follow-on cleanups.

    This unfortunately leads to pushing non-atomic definitions of readq() and
    write() to various x86-only drivers that in the meantime started using the
    definitions in the x86 version of . However as discussed
    exhaustively, this is actually the right thing to do, because the right
    way to split a 64-bit transaction is hardware dependent and therefore
    belongs in the hardware driver (eg mpt2sas needs a spinlock to make sure
    no other accesses occur in between the two halves of the access).

    Build tested on 32- and 64-bit x86 allmodconfig.

    Link: http://lkml.kernel.org/r/x86-32-writeq-is-broken@mdm.bga.com
    Acked-by: Hitoshi Mitake
    Cc: Kashyap Desai
    Cc: Len Brown
    Cc: Ravi Anand
    Cc: Vikas Chaudhary
    Cc: Matthew Garrett
    Cc: Jason Uhlenkott
    Acked-by: James Bottomley
    Acked-by: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: "H. Peter Anvin"
    Signed-off-by: Roland Dreier
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Roland Dreier
     
  • This constant hasn't been used since before the git era (2.6.12) and thus
    can be dropped.

    Signed-off-by: Stephen Boyd
    Cc: Russell King
    Cc: Richard Weinberger
    Cc: Hirokazu Takata
    Cc: Kyle McMartin
    Cc: Richard Henderson
    Cc: Ivan Kokshaysky
    Cc: Matt Turner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Stephen Boyd
     
  • The macro to_class_dev() uses the deprecated structure class_device, and
    the c2port_device has no member named class in the definition of the macro
    to_c2port_device.

    Signed-off-by: Wanlong Gao
    Cc: Rodolfo Giometti
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wanlong Gao
     
  • Apps are increasingly using more than 1024 file descriptors. See
    discussion in several distro bug trackers, e.g. BugLink:
    http://bugs.launchpad.net/bugs/663090
    https://issues.rpath.com/browse/RPL-2054

    You don't want to raise the default soft limit, since that might break
    apps that use select(), but it's safe to raise the default hard limit;
    that way, apps that know they need lots of file descriptors can raise
    their soft limit without needing root, and without user intervention.

    Ubuntu is doing this with a kernel change because they have a policy of
    not changing kernel defaults in userland.

    While 4096 might not be enough for *all* apps, it seems to be plenty for
    the apps I've seen lately that are unhappy with 1024.

    Signed-off-by: Tim Gardner
    Cc: Dan Kegel
    Cc: Al Viro
    Cc: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tim Gardner
     
  • os_dump_core() emits SIGTERM to terminate all UML processes. Kernel
    threads have to exit on SIGTERM instead of calling last_ditch_exit().
    Multiple calls to last_ditch_exit() can cause a crash.

    Signed-off-by: Richard Weinberger
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Richard Weinberger
     
  • Fix build failures on UML.

    Signed-off-by: Richard Weinberger
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Richard Weinberger
     
  • Print a short info about fatal segfaults like other archs do.

    Signed-off-by: Richard Weinberger
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Richard Weinberger
     
  • The ucast transport is similar to the mcast transport (and, in fact,
    shares most of its code), only it uses UDP unicast to move packets.

    Obviously this is only useful for point-to-point connections between
    virtual ethernet devices.

    Signed-off-by: Nolan Leake
    Signed-off-by: Richard Weinberger
    Cc: David Miller
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nolan Leake
     
  • User Mode Linux can also benefit from earlyprintk. UML's earlyprintk
    writes kernel messages directly to stdout.

    Signed-off-by: Richard Weinberger
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Richard Weinberger
     
  • The UML kernel ignores SIGHUP anyway. This handler is in vain.

    Signed-off-by: Richard Weinberger
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Richard Weinberger
     
  • UML_LIB_PATH is hardcoded to /usr/lib/uml/, on 64bit systems UML_LIB_PATH
    needs to be /usr/lib64/uml/.

    Signed-off-by: Richard Weinberger
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Richard Weinberger
     
  • Adapt to the new API.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: KOSAKI Motohiro
    Cc: Mikael Starvik
    Cc: Jesper Nilsson
    Cc: Thiago Farina
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • Adapt to the new API.

    We plan to remove old cpumask APIs later. Thus this patch converts them
    into the new one.

    Signed-off-by: KOSAKI Motohiro
    Cc: David Howells
    Cc: Koichi Yasutake
    Cc: Hugh Dickins
    Cc: Chris Metcalf
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • Allow people to use gpiolib on Alpha if they want to, mostly for build
    coverage. The header is a stright copy of that for Microblaze, which in
    turn was taken from PowerPC.

    [akpm@linux-foundation.org: define GENERIC_GPIO]
    Signed-off-by: Mark Brown
    Cc: Richard Henderson
    Cc: Ivan Kokshaysky
    Cc: Matt Turner
    Acked-by: Grant Likely
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mark Brown
     
  • We plan to remove cpu_xx() old APIs. Thus convert them. This patch has
    no functional change.

    Signed-off-by: KOSAKI Motohiro
    Cc: Richard Henderson
    Cc: Ivan Kokshaysky
    Cc: Matt Turner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • Currently on nommu arch mmap(),mremap() and munmap() doesn't do
    page_align() which isn't consist with mmu arch and cause some issues.

    First, some drivers' mmap() function depends on vma->vm_end - vma->start
    is page aligned which is true on mmu arch but not on nommu. eg: uvc
    camera driver.

    Second munmap() may return -EINVAL[split file] error in cases when end is
    not page aligned(passed into from userspace) but vma->vm_end is aligned
    dure to split or driver's mmap() ops.

    Add page alignment to fix those issues.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Bob Liu
    Cc: David Howells
    Cc: Paul Mundt
    Cc: Greg Ungerer
    Cc: Geert Uytterhoeven
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Bob Liu
     
  • The zone->lru_lock is heavily contented in workload where activate_page()
    is frequently used. We could do batch activate_page() to reduce the lock
    contention. The batched pages will be added into zone list when the pool
    is full or page reclaim is trying to drain them.

    For example, in a 4 socket 64 CPU system, create a sparse file and 64
    processes, processes shared map to the file. Each process read access the
    whole file and then exit. The process exit will do unmap_vmas() and cause
    a lot of activate_page() call. In such workload, we saw about 58% total
    time reduction with below patch. Other workloads with a lot of
    activate_page also benefits a lot too.

    Andrew Morton suggested activate_page() and putback_lru_pages() should
    follow the same path to active pages, but this is hard to implement (see
    commit 7a608572a282a ("Revert "mm: batch activate_page() to reduce lock
    contention")). On the other hand, do we really need putback_lru_pages()
    to follow the same path? I tested several FIO/FFSB benchmark (about 20
    scripts for each benchmark) in 3 machines here from 2 sockets to 4
    sockets. My test doesn't show anything significant with/without below
    patch (there is slight difference but mostly some noise which we found
    even without below patch before). Below patch basically returns to the
    same as my first post.

    I tested some microbenchmarks:
    case-anon-cow-rand-mt 0.58%
    case-anon-cow-rand -3.30%
    case-anon-cow-seq-mt -0.51%
    case-anon-cow-seq -5.68%
    case-anon-r-rand-mt 0.23%
    case-anon-r-rand 0.81%
    case-anon-r-seq-mt -0.71%
    case-anon-r-seq -1.99%
    case-anon-rx-rand-mt 2.11%
    case-anon-rx-seq-mt 3.46%
    case-anon-w-rand-mt -0.03%
    case-anon-w-rand -0.50%
    case-anon-w-seq-mt -1.08%
    case-anon-w-seq -0.12%
    case-anon-wx-rand-mt -5.02%
    case-anon-wx-seq-mt -1.43%
    case-fork 1.65%
    case-fork-sleep -0.07%
    case-fork-withmem 1.39%
    case-hugetlb -0.59%
    case-lru-file-mmap-read-mt -0.54%
    case-lru-file-mmap-read 0.61%
    case-lru-file-mmap-read-rand -2.24%
    case-lru-file-readonce -0.64%
    case-lru-file-readtwice -11.69%
    case-lru-memcg -1.35%
    case-mmap-pread-rand-mt 1.88%
    case-mmap-pread-rand -15.26%
    case-mmap-pread-seq-mt 0.89%
    case-mmap-pread-seq -69.72%
    case-mmap-xread-rand-mt 0.71%
    case-mmap-xread-seq-mt 0.38%

    The most significent are:
    case-lru-file-readtwice -11.69%
    case-mmap-pread-rand -15.26%
    case-mmap-pread-seq -69.72%

    which use activate_page a lot. others are basically variations because
    each run has slightly difference.

    In UP case, 'size mm/swap.o'
    before the two patches:
    text data bss dec hex filename
    6466 896 4 7366 1cc6 mm/swap.o
    after the two patches:
    text data bss dec hex filename
    6343 896 4 7243 1c4b mm/swap.o

    Signed-off-by: Shaohua Li
    Cc: KOSAKI Motohiro
    Cc: Hiroyuki Kamezawa
    Cc: Andi Kleen
    Cc: Minchan Kim
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: Johannes Weiner
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shaohua Li
     
  • The copy_to_user_page() function is supposed to flush the icache on the
    memory that was written, but the current asm-generic version lacks that
    logic. While normally it isn't a big deal as the asm-generic version of
    icache flushing is a stub, it is a deal for ports that want to use the
    asm-generic version as a baseline and then overlay its own specific parts
    (like icache flushing).

    Signed-off-by: Mike Frysinger
    Cc: Arnd Bergmann
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Frysinger
     
  • I believe I found a problem in __alloc_pages_slowpath, which allows a
    process to get stuck endlessly looping, even when lots of memory is
    available.

    Running an I/O and memory intensive stress-test I see a 0-order page
    allocation with __GFP_IO and __GFP_WAIT, running on a system with very
    little free memory. Right about the same time that the stress-test gets
    killed by the OOM-killer, the utility trying to allocate memory gets stuck
    in __alloc_pages_slowpath even though most of the systems memory was freed
    by the oom-kill of the stress-test.

    The utility ends up looping from the rebalance label down through the
    wait_iff_congested continiously. Because order=0,
    __alloc_pages_direct_compact skips the call to get_page_from_freelist.
    Because all of the reclaimable memory on the system has already been
    reclaimed, __alloc_pages_direct_reclaim skips the call to
    get_page_from_freelist. Since there is no __GFP_FS flag, the block with
    __alloc_pages_may_oom is skipped. The loop hits the wait_iff_congested,
    then jumps back to rebalance without ever trying to
    get_page_from_freelist. This loop repeats infinitely.

    The test case is pretty pathological. Running a mix of I/O stress-tests
    that do a lot of fork() and consume all of the system memory, I can pretty
    reliably hit this on 600 nodes, in about 12 hours. 32GB/node.

    Signed-off-by: Andrew Barry
    Signed-off-by: Minchan Kim
    Reviewed-by: Rik van Riel
    Acked-by: Mel Gorman
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Barry
     
  • Add SECTION_ALIGN_UP() and SECTION_ALIGN_DOWN() macro which aligns given
    pfn to upper section and lower section boundary accordingly.

    Required for the latest memory hotplug support for the Xen balloon driver.

    Signed-off-by: Daniel Kiper
    Reviewed-by: Konrad Rzeszutek Wilk
    David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Daniel Kiper
     
  • The noswapaccount parameter has been deprecated since 2.6.38 without any
    complaints from users so we can remove it. swapaccount=0|1 can be used
    instead.

    As we are removing the parameter we can also clean up swapaccount because
    it doesn't have to accept an empty string anymore (to match noswapaccount)
    and so we can push = into __setup macro rather than checking "=1" resp.
    "=0" strings

    Signed-off-by: Michal Hocko
    Cc: Hiroyuki Kamezawa
    Cc: Daisuke Nishimura
    Cc: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • In show_numa_map() we collect statistics into a numa_maps structure.
    Since the number of NUMA nodes can be very large, this structure is not a
    candidate for stack allocation.

    Instead of going thru a kmalloc()+kfree() cycle each time show_numa_map()
    is invoked, perform the allocation just once when /proc/pid/numa_maps is
    opened.

    Performing the allocation when numa_maps is opened, and thus before a
    reference to the target tasks mm is taken, eliminates a potential
    stalemate condition in the oom-killer as originally described by Hugh
    Dickins:

    ... imagine what happens if the system is out of memory, and the mm
    we're looking at is selected for killing by the OOM killer: while
    we wait in __get_free_page for more memory, no memory is freed
    from the selected mm because it cannot reach exit_mmap while we hold
    that reference.

    Signed-off-by: Stephen Wilson
    Reviewed-by: KOSAKI Motohiro
    Cc: Hugh Dickins
    Cc: David Rientjes
    Cc: Lee Schermerhorn
    Cc: Alexey Dobriyan
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Stephen Wilson