12 Jul, 2012

1 commit

  • Commit fd3142a59af2012a7c5dc72ec97a4935ff1c5fc6 broke
    slob since a piece of a change for a later patch slipped into
    it.

    Fengguang Wu writes:

    The commit crashes the kernel w/o any dmesg output (the attached one is
    created by the script as a summary for that run). This is very
    reproducible in kvm for the attached config.

    Reported-by: Fengguang Wu
    Signed-off-by: Christoph Lameter
    Signed-off-by: Pekka Enberg

    Christoph Lameter
     

11 Jul, 2012

1 commit

  • kmemcheck_alloc_shadow() requires irqs to be enabled, so wait to disable
    them until after its called for __GFP_WAIT allocations.

    This fixes a warning for such allocations:

    WARNING: at kernel/lockdep.c:2739 lockdep_trace_alloc+0x14e/0x1c0()

    Acked-by: Fengguang Wu
    Acked-by: Steven Rostedt
    Tested-by: Fengguang Wu
    Signed-off-by: David Rientjes
    Signed-off-by: Pekka Enberg

    David Rientjes
     

09 Jul, 2012

5 commits

  • Move the mutex handling into the common kmem_cache_create()
    function.

    Then we can also move more checks out of SLAB's kmem_cache_create()
    into the common code.

    Reviewed-by: Glauber Costa
    Signed-off-by: Christoph Lameter
    Signed-off-by: Pekka Enberg

    Christoph Lameter
     
  • Use the mutex definition from SLAB and make it the common way to take a sleeping lock.

    This has the effect of using a mutex instead of a rw semaphore for SLUB.

    SLOB gains the use of a mutex for kmem_cache_create serialization.
    Not needed now but SLOB may acquire some more features later (like slabinfo
    / sysfs support) through the expansion of the common code that will
    need this.

    Reviewed-by: Glauber Costa
    Reviewed-by: Joonsoo Kim
    Signed-off-by: Christoph Lameter
    Signed-off-by: Pekka Enberg

    Christoph Lameter
     
  • All allocators have some sort of support for the bootstrap status.

    Setup a common definition for the boot states and make all slab
    allocators use that definition.

    Reviewed-by: Glauber Costa
    Reviewed-by: Joonsoo Kim
    Signed-off-by: Christoph Lameter
    Signed-off-by: Pekka Enberg

    Christoph Lameter
     
  • Kmem_cache_create() does a variety of sanity checks but those
    vary depending on the allocator. Use the strictest tests and put them into
    a slab_common file. Make the tests conditional on CONFIG_DEBUG_VM.

    This patch has the effect of adding sanity checks for SLUB and SLOB
    under CONFIG_DEBUG_VM and removes the checks in SLAB for !CONFIG_DEBUG_VM.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Pekka Enberg

    Christoph Lameter
     
  • If list_for_each_entry, etc complete a traversal of the list, the iterator
    variable ends up pointing to an address at an offset from the list head,
    and not a meaningful structure. Thus this value should not be used after
    the end of the iterator. The patch replaces s->name by al->name, which is
    referenced nearby.

    This problem was found using Coccinelle (http://coccinelle.lip6.fr/).

    Signed-off-by: Julia Lawall
    Signed-off-by: Pekka Enberg

    Julia Lawall
     

03 Jul, 2012

1 commit


02 Jul, 2012

4 commits

  • During kmem_cache_init_late(), we transition to the LATE state,
    and after some more work, to the FULL state, its last state

    This is quite different from slub, that will only transition to
    its last state (previously SYSFS), in a (late)initcall, after a lot
    more of the kernel is ready.

    This means that in slab, we have no way to taking actions dependent
    on the initialization of other pieces of the kernel that are supposed
    to start way after kmem_init_late(), such as cgroups initialization.

    To achieve more consistency in this behavior, that patch only
    transitions to the UP state in kmem_init_late. In my analysis,
    setup_cpu_cache() should be happy to test for >= UP, instead of
    == FULL. It also has passed some tests I've made.

    We then only mark FULL state after the reap timers are in place,
    meaning that no further setup is expected.

    Signed-off-by: Glauber Costa
    Acked-by: Christoph Lameter
    Acked-by: David Rientjes
    Signed-off-by: Pekka Enberg

    Glauber Costa
     
  • Commit 8c138b only sits in Pekka's and linux-next tree now, which tries
    to replace obj_size(cachep) with cachep->object_size, but has a typo in
    kmem_cache_free() by using "size" instead of "object_size", which casues
    some regressions.

    Reported-and-tested-by: Fengguang Wu
    Signed-off-by: Feng Tang
    Cc: Christoph Lameter
    Acked-by: Glauber Costa
    Signed-off-by: Pekka Enberg

    Feng Tang
     
  • Commit 3b0efdf ("mm, sl[aou]b: Extract common fields from struct
    kmem_cache") renamed the kmem_cache structure's "next" field to "list"
    but forgot to update one instance in leaks_show().

    Signed-off-by: Thierry Reding
    Signed-off-by: Pekka Enberg

    Thierry Reding
     
  • A consistent name with slub saves us an acessor function.
    In both caches, this field represents the same thing. We would
    like to use it from the mem_cgroup code.

    Signed-off-by: Glauber Costa
    Acked-by: Christoph Lameter
    CC: Pekka Enberg
    Signed-off-by: Pekka Enberg

    Glauber Costa
     

20 Jun, 2012

3 commits

  • Current implementation of unfreeze_partials() is so complicated,
    but benefit from it is insignificant. In addition many code in
    do {} while loop have a bad influence to a fail rate of cmpxchg_double_slab.
    Under current implementation which test status of cpu partial slab
    and acquire list_lock in do {} while loop,
    we don't need to acquire a list_lock and gain a little benefit
    when front of the cpu partial slab is to be discarded, but this is a rare case.
    In case that add_partial is performed and cmpxchg_double_slab is failed,
    remove_partial should be called case by case.

    I think that these are disadvantages of current implementation,
    so I do refactoring unfreeze_partials().

    Minimizing code in do {} while loop introduce a reduced fail rate
    of cmpxchg_double_slab. Below is output of 'slabinfo -r kmalloc-256'
    when './perf stat -r 33 hackbench 50 process 4000 > /dev/null' is done.

    ** before **
    Cmpxchg_double Looping
    ------------------------
    Locked Cmpxchg Double redos 182685
    Unlocked Cmpxchg Double redos 0

    ** after **
    Cmpxchg_double Looping
    ------------------------
    Locked Cmpxchg Double redos 177995
    Unlocked Cmpxchg Double redos 1

    We can see cmpxchg_double_slab fail rate is improved slightly.

    Bolow is output of './perf stat -r 30 hackbench 50 process 4000 > /dev/null'.

    ** before **
    Performance counter stats for './hackbench 50 process 4000' (30 runs):

    108517.190463 task-clock # 7.926 CPUs utilized ( +- 0.24% )
    2,919,550 context-switches # 0.027 M/sec ( +- 3.07% )
    100,774 CPU-migrations # 0.929 K/sec ( +- 4.72% )
    124,201 page-faults # 0.001 M/sec ( +- 0.15% )
    401,500,234,387 cycles # 3.700 GHz ( +- 0.24% )
    stalled-cycles-frontend
    stalled-cycles-backend
    250,576,913,354 instructions # 0.62 insns per cycle ( +- 0.13% )
    45,934,956,860 branches # 423.297 M/sec ( +- 0.14% )
    188,219,787 branch-misses # 0.41% of all branches ( +- 0.56% )

    13.691837307 seconds time elapsed ( +- 0.24% )

    ** after **
    Performance counter stats for './hackbench 50 process 4000' (30 runs):

    107784.479767 task-clock # 7.928 CPUs utilized ( +- 0.22% )
    2,834,781 context-switches # 0.026 M/sec ( +- 2.33% )
    93,083 CPU-migrations # 0.864 K/sec ( +- 3.45% )
    123,967 page-faults # 0.001 M/sec ( +- 0.15% )
    398,781,421,836 cycles # 3.700 GHz ( +- 0.22% )
    stalled-cycles-frontend
    stalled-cycles-backend
    250,189,160,419 instructions # 0.63 insns per cycle ( +- 0.09% )
    45,855,370,128 branches # 425.436 M/sec ( +- 0.10% )
    169,881,248 branch-misses # 0.37% of all branches ( +- 0.43% )

    13.596272341 seconds time elapsed ( +- 0.22% )

    No regression is found, but rather we can see slightly better result.

    Acked-by: Christoph Lameter
    Signed-off-by: Joonsoo Kim
    Signed-off-by: Pekka Enberg

    Joonsoo Kim
     
  • get_freelist(), unfreeze_partials() are only called with interrupt disabled,
    so __cmpxchg_double_slab() is suitable.

    Acked-by: Christoph Lameter
    Signed-off-by: Joonsoo Kim
    Signed-off-by: Pekka Enberg

    Joonsoo Kim
     
  • slab_node() could access current->mempolicy from interrupt context.
    However there's a race condition during exit where the mempolicy
    is first freed and then the pointer zeroed.

    Using this from interrupts seems bogus anyways. The interrupt
    will interrupt a random process and therefore get a random
    mempolicy. Many times, this will be idle's, which noone can change.

    Just disable this here and always use local for slab
    from interrupts. I also cleaned up the callers of slab_node a bit
    which always passed the same argument.

    I believe the original mempolicy code did that in fact,
    so it's likely a regression.

    v2: send version with correct logic
    v3: simplify. fix typo.
    Reported-by: Arun Sharma
    Cc: penberg@kernel.org
    Cc: cl@linux.com
    Signed-off-by: Andi Kleen
    [tdmackey@twitter.com: Rework control flow based on feedback from
    cl@linux.com, fix logic, and cleanup current task_struct reference]
    Acked-by: David Rientjes
    Acked-by: Christoph Lameter
    Acked-by: KOSAKI Motohiro
    Signed-off-by: David Mackey
    Signed-off-by: Pekka Enberg

    Andi Kleen
     

14 Jun, 2012

7 commits

  • The size of the slab object is frequently needed. Since we now
    have a size field directly in the kmem_cache structure there is no
    need anymore of the obj_size macro/function.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Pekka Enberg

    Christoph Lameter
     
  • Define a struct that describes common fields used in all slab allocators.
    A slab allocator either uses the common definition (like SLOB) or is
    required to provide members of kmem_cache with the definition given.

    After that it will be possible to share code that
    only operates on those fields of kmem_cache.

    The patch basically takes the slob definition of kmem cache and
    uses the field namees for the other allocators.

    It also standardizes the names used for basic object lengths in
    allocators:

    object_size Struct size specified at kmem_cache_create. Basically
    the payload expected to be used by the subsystem.

    size The size of memory allocator for each object. This size
    is larger than object_size and includes padding, alignment
    and extra metadata for each object (f.e. for debugging
    and rcu).

    Signed-off-by: Christoph Lameter
    Signed-off-by: Pekka Enberg

    Christoph Lameter
     
  • Those are rather trivial now and its better to see inline what is
    really going on.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Pekka Enberg

    Christoph Lameter
     
  • Add fields to the page struct so that it is properly documented that
    slab overlays the lru fields.

    This cleans up some casts in slab.

    Reviewed-by: Glauber Costa
    Reviewed-by: Joonsoo Kim
    Signed-off-by: Christoph Lameter
    Signed-off-by: Pekka Enberg

    Christoph Lameter
     
  • Those have become so simple that they are no longer needed.

    Reviewed-by: Joonsoo Kim
    Acked-by: David Rientjes
    signed-off-by: Christoph Lameter

    Signed-off-by: Pekka Enberg

    Christoph Lameter
     
  • Reviewed-by: Joonsoo Kim
    Acked-by: David Rientjes
    Signed-off-by: Christoph Lameter
    Signed-off-by: Pekka Enberg

    Christoph Lameter
     
  • Define the fields used by slob in mm_types.h and use struct page instead
    of struct slob_page in slob. This cleans up numerous of typecasts in slob.c and
    makes readers aware of slob's use of page struct fields.

    [Also cleans up some bitrot in slob.c. The page struct field layout
    in slob.c is an old layout and does not match the one in mm_types.h]

    Reviewed-by: Glauber Costa
    Acked-by: David Rientjes
    Reviewed-by: Joonsoo Kim
    Signed-off-by: Christoph Lameter
    Signed-off-by: Pekka Enberg

    Christoph Lameter
     

04 Jun, 2012

1 commit


03 Jun, 2012

13 commits

  • Linus Torvalds
     
  • Pull device-mapper updates from Alasdair G Kergon:
    "Improve multipath's retrying mechanism in some defined circumstances
    and provide a simple reserve/release mechanism for userspace tools to
    access thin provisioning metadata while the pool is in use."

    * tag 'dm-3.5-changes-1' of git://git.kernel.org/pub/scm/linux/kernel/git/agk/linux-dm:
    dm thin: provide userspace access to pool metadata
    dm thin: use slab mempools
    dm mpath: allow ioctls to trigger pg init
    dm mpath: delay retry of bypassed pg
    dm mpath: reduce size of struct multipath

    Linus Torvalds
     
  • This patch implements two new messages that can be sent to the thin
    pool target allowing it to take a snapshot of the _metadata_. This,
    read-only snapshot can be accessed by userland, concurrently with the
    live target.

    Only one metadata snapshot can be held at a time. The pool's status
    line will give the block location for the current msnap.

    Since version 0.1.5 of the userland thin provisioning tools, the
    thin_dump program displays the msnap as follows:

    thin_dump -m

    Available here: https://github.com/jthornber/thin-provisioning-tools

    Now that userland can access the metadata we can do various things
    that have traditionally been kernel side tasks:

    i) Incremental backups.

    By using metadata snapshots we can work out what blocks have
    changed over time. Combined with data snapshots we can ensure
    the data doesn't change while we back it up.

    A short proof of concept script can be found here:

    https://github.com/jthornber/thinp-test-suite/blob/master/incremental_backup_example.rb

    ii) Migration of thin devices from one pool to another.

    iii) Merging snapshots back into an external origin.

    iv) Asyncronous replication.

    Signed-off-by: Joe Thornber
    Signed-off-by: Alasdair G Kergon

    Joe Thornber
     
  • Use dedicated caches prefixed with a "dm_" name rather than relying on
    kmalloc mempools backed by generic slab caches so the memory usage of
    thin provisioning (and any leaks) can be accounted for independently.

    Signed-off-by: Mike Snitzer
    Signed-off-by: Alasdair G Kergon

    Mike Snitzer
     
  • After the failure of a group of paths, any alternative paths that
    need initialising do not become available until further I/O is sent to
    the device. Until this has happened, ioctls return -EAGAIN.

    With this patch, new paths are made available in response to an ioctl
    too. The processing of the ioctl gets delayed until this has happened.

    Instead of returning an error, we submit a work item to kmultipathd
    (that will potentially activate the new path) and retry in ten
    milliseconds.

    Note that the patch doesn't retry an ioctl if the ioctl itself fails due
    to a path failure. Such retries should be handled intelligently by the
    code that generated the ioctl in the first place, noting that some SCSI
    commands should not be retried because they are not idempotent (XOR write
    commands). For commands that could be retried, there is a danger that
    if the device rejected the SCSI command, the path could be errorneously
    marked as failed, and the request would be retried on another path which
    might fail too. It can be determined if the failure happens on the
    device or on the SCSI controller, but there is no guarantee that all
    SCSI drivers set these flags correctly.

    Signed-off-by: Mikulas Patocka
    Signed-off-by: Alasdair G Kergon

    Mikulas Patocka
     
  • If I/O needs retrying and only bypassed priority groups are available,
    set the pg_init_delay_retry flag to wait before retrying.

    If, for example, the reason for the bypass is that the controller is
    getting reset or there is a firmware upgrade happening, retrying right
    away would cause a flood of log messages and retries for what could be a
    few seconds or even several minutes.

    Signed-off-by: Mike Christie
    Acked-by: Mike Snitzer
    Signed-off-by: Alasdair G Kergon

    Mike Christie
     
  • Move multipath structure's 'lock' and 'queue_size' members to eliminate
    two 4-byte holes. Also use a bit within a single unsigned int for each
    existing flag (saves 8-bytes). This allows future flags to be added
    without each consuming an unsigned int.

    Signed-off-by: Mike Snitzer
    Acked-by: Hannes Reinecke
    Signed-off-by: Alasdair G Kergon

    Mike Snitzer
     
  • Pull networking updates from David Miller:

    1) Make syn floods consume significantly less resources by

    a) Not pre-COW'ing routing metrics for SYN/ACKs
    b) Mirroring the device queue mapping of the SYN for the SYN/ACK
    reply.

    Both from Eric Dumazet.

    2) Fix calculation errors in Byte Queue Limiting, from Hiroaki SHIMODA.

    3) Validate the length requested when building a paged SKB for a
    socket, so we don't overrun the page vector accidently. From Jason
    Wang.

    4) When netlabel is disabled, we abort all IP option processing when we
    see a CIPSO option. This isn't the right thing to do, we should
    simply skip over it and continue processing the remaining options
    (if any). Fix from Paul Moore.

    5) SRIOV fixes for the mellanox driver from Jack orgenstein and Marcel
    Apfelbaum.

    6) 8139cp enables the receiver before the ring address is properly
    programmed, which potentially lets the device crap over random
    memory. Fix from Jason Wang.

    7) e1000/e1000e fixes for i217 RST handling, and an improper buffer
    address reference in jumbo RX frame processing from Bruce Allan and
    Sebastian Andrzej Siewior, respectively.

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net:
    fec_mpc52xx: fix timestamp filtering
    mcs7830: Implement link state detection
    e1000e: fix Rapid Start Technology support for i217
    e1000: look into the page instead of skb->data for e1000_tbi_adjust_stats()
    r8169: call netif_napi_del at errpaths and at driver unload
    tcp: reflect SYN queue_mapping into SYNACK packets
    tcp: do not create inetpeer on SYNACK message
    8139cp/8139too: terminate the eeprom access with the right opmode
    8139cp: set ring address before enabling receiver
    cipso: handle CIPSO options correctly when NetLabel is disabled
    net: sock: validate data_len before allocating skb in sock_alloc_send_pskb()
    bql: Avoid possible inconsistent calculation.
    bql: Avoid unneeded limit decrement.
    bql: Fix POSDIFF() to integer overflow aware.
    net/mlx4_core: Fix obscure mlx4_cmd_box parameter in QUERY_DEV_CAP
    net/mlx4_core: Check port out-of-range before using in mlx4_slave_cap
    net/mlx4_core: Fixes for VF / Guest startup flow
    net/mlx4_en: Fix improper use of "port" parameter in mlx4_en_event
    net/mlx4_core: Fix number of EQs used in ICM initialisation
    net/mlx4_core: Fix the slave_id out-of-range test in mlx4_eq_int

    Linus Torvalds
     
  • Pull straggler x86 fixes from Peter Anvin:
    "Three groups of patches:

    - EFI boot stub documentation and the ability to print error messages;
    - Removal for PTRACE_ARCH_PRCTL for x32 (obsolete interface which
    should never have been ported, and the port is broken and
    potentially dangerous.)
    - ftrace stack corruption fixes. I'm not super-happy about the
    technical implementation, but it is probably the least invasive in
    the short term. In the future I would like a single method for
    nesting the debug stack, however."

    * 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    x86, x32, ptrace: Remove PTRACE_ARCH_PRCTL for x32
    x86, efi: Add EFI boot stub documentation
    x86, efi; Add EFI boot stub console support
    x86, efi: Only close open files in error path
    ftrace/x86: Do not change stacks in DEBUG when calling lockdep
    x86: Allow nesting of the debug stack IDT setting
    x86: Reset the debug_stack update counter
    ftrace: Use breakpoint method to update ftrace caller
    ftrace: Synchronize variable setting with breakpoints

    Linus Torvalds
     
  • This reverts the tty layer change to use per-tty locking, because it's
    not correct yet, and fixing it will require some more deep surgery.

    The main revert is d29f3ef39be4 ("tty_lock: Localise the lock"), but
    there are several smaller commits that built upon it, they also get
    reverted here. The list of reverted commits is:

    fde86d310886 - tty: add lockdep annotations
    8f6576ad476b - tty: fix ldisc lock inversion trace
    d3ca8b64b97e - pty: Fix lock inversion
    b1d679afd766 - tty: drop the pty lock during hangup
    abcefe5fc357 - tty/amiserial: Add missing argument for tty_unlock()
    fd11b42e3598 - cris: fix missing tty arg in wait_event_interruptible_tty call
    d29f3ef39be4 - tty_lock: Localise the lock

    The revert had a trivial conflict in the 68360serial.c staging driver
    that got removed in the meantime.

    Signed-off-by: Linus Torvalds

    Linus Torvalds
     
  • skb_defer_rx_timestamp was called with a freshly allocated skb but must
    be called with rskb instead.

    Signed-off-by: Stephan Gatzka
    Cc: stable
    Acked-by: Richard Cochran
    Signed-off-by: David S. Miller

    Stephan Gatzka
     
  • Add .status callback that detects link state changes.
    Tested with MCS7832CV-AA chip (9710:7830, identified as rev.C by the driver).
    Fixes https://bugzilla.kernel.org/show_bug.cgi?id=28532

    Signed-off-by: Ondrej Zary
    Signed-off-by: David S. Miller

    Ondrej Zary
     
  • Pull vfs fix and a fix from the signal changes for frv from Al Viro.

    The __kernel_nlink_t for powerpc got scrogged because 64-bit powerpc
    actually depended on the default "unsigned long", while 32-bit powerpc
    had an explicit override to "unsigned short". Al didn't notice, and
    made both of them be the unsigned short.

    The frv signal fix is fallout from simplifying the do_notify_resume()
    code, and leaving an extra parenthesis.

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    powerpc: Fix size of st_nlink on 64bit

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal:
    frv: Remove bogus closing parenthesis

    Linus Torvalds
     

02 Jun, 2012

4 commits