12 Jul, 2012
1 commit
-
Commit fd3142a59af2012a7c5dc72ec97a4935ff1c5fc6 broke
slob since a piece of a change for a later patch slipped into
it.Fengguang Wu writes:
The commit crashes the kernel w/o any dmesg output (the attached one is
created by the script as a summary for that run). This is very
reproducible in kvm for the attached config.Reported-by: Fengguang Wu
Signed-off-by: Christoph Lameter
Signed-off-by: Pekka Enberg
11 Jul, 2012
1 commit
-
kmemcheck_alloc_shadow() requires irqs to be enabled, so wait to disable
them until after its called for __GFP_WAIT allocations.This fixes a warning for such allocations:
WARNING: at kernel/lockdep.c:2739 lockdep_trace_alloc+0x14e/0x1c0()
Acked-by: Fengguang Wu
Acked-by: Steven Rostedt
Tested-by: Fengguang Wu
Signed-off-by: David Rientjes
Signed-off-by: Pekka Enberg
09 Jul, 2012
5 commits
-
Move the mutex handling into the common kmem_cache_create()
function.Then we can also move more checks out of SLAB's kmem_cache_create()
into the common code.Reviewed-by: Glauber Costa
Signed-off-by: Christoph Lameter
Signed-off-by: Pekka Enberg -
Use the mutex definition from SLAB and make it the common way to take a sleeping lock.
This has the effect of using a mutex instead of a rw semaphore for SLUB.
SLOB gains the use of a mutex for kmem_cache_create serialization.
Not needed now but SLOB may acquire some more features later (like slabinfo
/ sysfs support) through the expansion of the common code that will
need this.Reviewed-by: Glauber Costa
Reviewed-by: Joonsoo Kim
Signed-off-by: Christoph Lameter
Signed-off-by: Pekka Enberg -
All allocators have some sort of support for the bootstrap status.
Setup a common definition for the boot states and make all slab
allocators use that definition.Reviewed-by: Glauber Costa
Reviewed-by: Joonsoo Kim
Signed-off-by: Christoph Lameter
Signed-off-by: Pekka Enberg -
Kmem_cache_create() does a variety of sanity checks but those
vary depending on the allocator. Use the strictest tests and put them into
a slab_common file. Make the tests conditional on CONFIG_DEBUG_VM.This patch has the effect of adding sanity checks for SLUB and SLOB
under CONFIG_DEBUG_VM and removes the checks in SLAB for !CONFIG_DEBUG_VM.Signed-off-by: Christoph Lameter
Signed-off-by: Pekka Enberg -
If list_for_each_entry, etc complete a traversal of the list, the iterator
variable ends up pointing to an address at an offset from the list head,
and not a meaningful structure. Thus this value should not be used after
the end of the iterator. The patch replaces s->name by al->name, which is
referenced nearby.This problem was found using Coccinelle (http://coccinelle.lip6.fr/).
Signed-off-by: Julia Lawall
Signed-off-by: Pekka Enberg
03 Jul, 2012
1 commit
-
In function slab_stats(), if total_free is equal zero, it will error.
So fix it.Acked-by: Christoph Lameter
Signed-off-by: majianpeng
Signed-off-by: Pekka Enberg
02 Jul, 2012
4 commits
-
During kmem_cache_init_late(), we transition to the LATE state,
and after some more work, to the FULL state, its last stateThis is quite different from slub, that will only transition to
its last state (previously SYSFS), in a (late)initcall, after a lot
more of the kernel is ready.This means that in slab, we have no way to taking actions dependent
on the initialization of other pieces of the kernel that are supposed
to start way after kmem_init_late(), such as cgroups initialization.To achieve more consistency in this behavior, that patch only
transitions to the UP state in kmem_init_late. In my analysis,
setup_cpu_cache() should be happy to test for >= UP, instead of
== FULL. It also has passed some tests I've made.We then only mark FULL state after the reap timers are in place,
meaning that no further setup is expected.Signed-off-by: Glauber Costa
Acked-by: Christoph Lameter
Acked-by: David Rientjes
Signed-off-by: Pekka Enberg -
Commit 8c138b only sits in Pekka's and linux-next tree now, which tries
to replace obj_size(cachep) with cachep->object_size, but has a typo in
kmem_cache_free() by using "size" instead of "object_size", which casues
some regressions.Reported-and-tested-by: Fengguang Wu
Signed-off-by: Feng Tang
Cc: Christoph Lameter
Acked-by: Glauber Costa
Signed-off-by: Pekka Enberg -
Commit 3b0efdf ("mm, sl[aou]b: Extract common fields from struct
kmem_cache") renamed the kmem_cache structure's "next" field to "list"
but forgot to update one instance in leaks_show().Signed-off-by: Thierry Reding
Signed-off-by: Pekka Enberg -
A consistent name with slub saves us an acessor function.
In both caches, this field represents the same thing. We would
like to use it from the mem_cgroup code.Signed-off-by: Glauber Costa
Acked-by: Christoph Lameter
CC: Pekka Enberg
Signed-off-by: Pekka Enberg
20 Jun, 2012
3 commits
-
Current implementation of unfreeze_partials() is so complicated,
but benefit from it is insignificant. In addition many code in
do {} while loop have a bad influence to a fail rate of cmpxchg_double_slab.
Under current implementation which test status of cpu partial slab
and acquire list_lock in do {} while loop,
we don't need to acquire a list_lock and gain a little benefit
when front of the cpu partial slab is to be discarded, but this is a rare case.
In case that add_partial is performed and cmpxchg_double_slab is failed,
remove_partial should be called case by case.I think that these are disadvantages of current implementation,
so I do refactoring unfreeze_partials().Minimizing code in do {} while loop introduce a reduced fail rate
of cmpxchg_double_slab. Below is output of 'slabinfo -r kmalloc-256'
when './perf stat -r 33 hackbench 50 process 4000 > /dev/null' is done.** before **
Cmpxchg_double Looping
------------------------
Locked Cmpxchg Double redos 182685
Unlocked Cmpxchg Double redos 0** after **
Cmpxchg_double Looping
------------------------
Locked Cmpxchg Double redos 177995
Unlocked Cmpxchg Double redos 1We can see cmpxchg_double_slab fail rate is improved slightly.
Bolow is output of './perf stat -r 30 hackbench 50 process 4000 > /dev/null'.
** before **
Performance counter stats for './hackbench 50 process 4000' (30 runs):108517.190463 task-clock # 7.926 CPUs utilized ( +- 0.24% )
2,919,550 context-switches # 0.027 M/sec ( +- 3.07% )
100,774 CPU-migrations # 0.929 K/sec ( +- 4.72% )
124,201 page-faults # 0.001 M/sec ( +- 0.15% )
401,500,234,387 cycles # 3.700 GHz ( +- 0.24% )
stalled-cycles-frontend
stalled-cycles-backend
250,576,913,354 instructions # 0.62 insns per cycle ( +- 0.13% )
45,934,956,860 branches # 423.297 M/sec ( +- 0.14% )
188,219,787 branch-misses # 0.41% of all branches ( +- 0.56% )13.691837307 seconds time elapsed ( +- 0.24% )
** after **
Performance counter stats for './hackbench 50 process 4000' (30 runs):107784.479767 task-clock # 7.928 CPUs utilized ( +- 0.22% )
2,834,781 context-switches # 0.026 M/sec ( +- 2.33% )
93,083 CPU-migrations # 0.864 K/sec ( +- 3.45% )
123,967 page-faults # 0.001 M/sec ( +- 0.15% )
398,781,421,836 cycles # 3.700 GHz ( +- 0.22% )
stalled-cycles-frontend
stalled-cycles-backend
250,189,160,419 instructions # 0.63 insns per cycle ( +- 0.09% )
45,855,370,128 branches # 425.436 M/sec ( +- 0.10% )
169,881,248 branch-misses # 0.37% of all branches ( +- 0.43% )13.596272341 seconds time elapsed ( +- 0.22% )
No regression is found, but rather we can see slightly better result.
Acked-by: Christoph Lameter
Signed-off-by: Joonsoo Kim
Signed-off-by: Pekka Enberg -
get_freelist(), unfreeze_partials() are only called with interrupt disabled,
so __cmpxchg_double_slab() is suitable.Acked-by: Christoph Lameter
Signed-off-by: Joonsoo Kim
Signed-off-by: Pekka Enberg -
slab_node() could access current->mempolicy from interrupt context.
However there's a race condition during exit where the mempolicy
is first freed and then the pointer zeroed.Using this from interrupts seems bogus anyways. The interrupt
will interrupt a random process and therefore get a random
mempolicy. Many times, this will be idle's, which noone can change.Just disable this here and always use local for slab
from interrupts. I also cleaned up the callers of slab_node a bit
which always passed the same argument.I believe the original mempolicy code did that in fact,
so it's likely a regression.v2: send version with correct logic
v3: simplify. fix typo.
Reported-by: Arun Sharma
Cc: penberg@kernel.org
Cc: cl@linux.com
Signed-off-by: Andi Kleen
[tdmackey@twitter.com: Rework control flow based on feedback from
cl@linux.com, fix logic, and cleanup current task_struct reference]
Acked-by: David Rientjes
Acked-by: Christoph Lameter
Acked-by: KOSAKI Motohiro
Signed-off-by: David Mackey
Signed-off-by: Pekka Enberg
14 Jun, 2012
7 commits
-
The size of the slab object is frequently needed. Since we now
have a size field directly in the kmem_cache structure there is no
need anymore of the obj_size macro/function.Signed-off-by: Christoph Lameter
Signed-off-by: Pekka Enberg -
Define a struct that describes common fields used in all slab allocators.
A slab allocator either uses the common definition (like SLOB) or is
required to provide members of kmem_cache with the definition given.After that it will be possible to share code that
only operates on those fields of kmem_cache.The patch basically takes the slob definition of kmem cache and
uses the field namees for the other allocators.It also standardizes the names used for basic object lengths in
allocators:object_size Struct size specified at kmem_cache_create. Basically
the payload expected to be used by the subsystem.size The size of memory allocator for each object. This size
is larger than object_size and includes padding, alignment
and extra metadata for each object (f.e. for debugging
and rcu).Signed-off-by: Christoph Lameter
Signed-off-by: Pekka Enberg -
Those are rather trivial now and its better to see inline what is
really going on.Signed-off-by: Christoph Lameter
Signed-off-by: Pekka Enberg -
Add fields to the page struct so that it is properly documented that
slab overlays the lru fields.This cleans up some casts in slab.
Reviewed-by: Glauber Costa
Reviewed-by: Joonsoo Kim
Signed-off-by: Christoph Lameter
Signed-off-by: Pekka Enberg -
Those have become so simple that they are no longer needed.
Reviewed-by: Joonsoo Kim
Acked-by: David Rientjes
signed-off-by: Christoph LameterSigned-off-by: Pekka Enberg
-
Reviewed-by: Joonsoo Kim
Acked-by: David Rientjes
Signed-off-by: Christoph Lameter
Signed-off-by: Pekka Enberg -
Define the fields used by slob in mm_types.h and use struct page instead
of struct slob_page in slob. This cleans up numerous of typecasts in slob.c and
makes readers aware of slob's use of page struct fields.[Also cleans up some bitrot in slob.c. The page struct field layout
in slob.c is an old layout and does not match the one in mm_types.h]Reviewed-by: Glauber Costa
Acked-by: David Rientjes
Reviewed-by: Joonsoo Kim
Signed-off-by: Christoph Lameter
Signed-off-by: Pekka Enberg
04 Jun, 2012
1 commit
-
* Fix a merge conflict in mm/slub.c::acquire_slab() due to commit 02d7633
("slub: fix a memory leak in get_partial_node()").Conflicts:
mm/slub.cSigned-off-by: Pekka Enberg
03 Jun, 2012
13 commits
-
Pull device-mapper updates from Alasdair G Kergon:
"Improve multipath's retrying mechanism in some defined circumstances
and provide a simple reserve/release mechanism for userspace tools to
access thin provisioning metadata while the pool is in use."* tag 'dm-3.5-changes-1' of git://git.kernel.org/pub/scm/linux/kernel/git/agk/linux-dm:
dm thin: provide userspace access to pool metadata
dm thin: use slab mempools
dm mpath: allow ioctls to trigger pg init
dm mpath: delay retry of bypassed pg
dm mpath: reduce size of struct multipath -
This patch implements two new messages that can be sent to the thin
pool target allowing it to take a snapshot of the _metadata_. This,
read-only snapshot can be accessed by userland, concurrently with the
live target.Only one metadata snapshot can be held at a time. The pool's status
line will give the block location for the current msnap.Since version 0.1.5 of the userland thin provisioning tools, the
thin_dump program displays the msnap as follows:thin_dump -m
Available here: https://github.com/jthornber/thin-provisioning-tools
Now that userland can access the metadata we can do various things
that have traditionally been kernel side tasks:i) Incremental backups.
By using metadata snapshots we can work out what blocks have
changed over time. Combined with data snapshots we can ensure
the data doesn't change while we back it up.A short proof of concept script can be found here:
https://github.com/jthornber/thinp-test-suite/blob/master/incremental_backup_example.rb
ii) Migration of thin devices from one pool to another.
iii) Merging snapshots back into an external origin.
iv) Asyncronous replication.
Signed-off-by: Joe Thornber
Signed-off-by: Alasdair G Kergon -
Use dedicated caches prefixed with a "dm_" name rather than relying on
kmalloc mempools backed by generic slab caches so the memory usage of
thin provisioning (and any leaks) can be accounted for independently.Signed-off-by: Mike Snitzer
Signed-off-by: Alasdair G Kergon -
After the failure of a group of paths, any alternative paths that
need initialising do not become available until further I/O is sent to
the device. Until this has happened, ioctls return -EAGAIN.With this patch, new paths are made available in response to an ioctl
too. The processing of the ioctl gets delayed until this has happened.Instead of returning an error, we submit a work item to kmultipathd
(that will potentially activate the new path) and retry in ten
milliseconds.Note that the patch doesn't retry an ioctl if the ioctl itself fails due
to a path failure. Such retries should be handled intelligently by the
code that generated the ioctl in the first place, noting that some SCSI
commands should not be retried because they are not idempotent (XOR write
commands). For commands that could be retried, there is a danger that
if the device rejected the SCSI command, the path could be errorneously
marked as failed, and the request would be retried on another path which
might fail too. It can be determined if the failure happens on the
device or on the SCSI controller, but there is no guarantee that all
SCSI drivers set these flags correctly.Signed-off-by: Mikulas Patocka
Signed-off-by: Alasdair G Kergon -
If I/O needs retrying and only bypassed priority groups are available,
set the pg_init_delay_retry flag to wait before retrying.If, for example, the reason for the bypass is that the controller is
getting reset or there is a firmware upgrade happening, retrying right
away would cause a flood of log messages and retries for what could be a
few seconds or even several minutes.Signed-off-by: Mike Christie
Acked-by: Mike Snitzer
Signed-off-by: Alasdair G Kergon -
Move multipath structure's 'lock' and 'queue_size' members to eliminate
two 4-byte holes. Also use a bit within a single unsigned int for each
existing flag (saves 8-bytes). This allows future flags to be added
without each consuming an unsigned int.Signed-off-by: Mike Snitzer
Acked-by: Hannes Reinecke
Signed-off-by: Alasdair G Kergon -
Pull networking updates from David Miller:
1) Make syn floods consume significantly less resources by
a) Not pre-COW'ing routing metrics for SYN/ACKs
b) Mirroring the device queue mapping of the SYN for the SYN/ACK
reply.Both from Eric Dumazet.
2) Fix calculation errors in Byte Queue Limiting, from Hiroaki SHIMODA.
3) Validate the length requested when building a paged SKB for a
socket, so we don't overrun the page vector accidently. From Jason
Wang.4) When netlabel is disabled, we abort all IP option processing when we
see a CIPSO option. This isn't the right thing to do, we should
simply skip over it and continue processing the remaining options
(if any). Fix from Paul Moore.5) SRIOV fixes for the mellanox driver from Jack orgenstein and Marcel
Apfelbaum.6) 8139cp enables the receiver before the ring address is properly
programmed, which potentially lets the device crap over random
memory. Fix from Jason Wang.7) e1000/e1000e fixes for i217 RST handling, and an improper buffer
address reference in jumbo RX frame processing from Bruce Allan and
Sebastian Andrzej Siewior, respectively.* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net:
fec_mpc52xx: fix timestamp filtering
mcs7830: Implement link state detection
e1000e: fix Rapid Start Technology support for i217
e1000: look into the page instead of skb->data for e1000_tbi_adjust_stats()
r8169: call netif_napi_del at errpaths and at driver unload
tcp: reflect SYN queue_mapping into SYNACK packets
tcp: do not create inetpeer on SYNACK message
8139cp/8139too: terminate the eeprom access with the right opmode
8139cp: set ring address before enabling receiver
cipso: handle CIPSO options correctly when NetLabel is disabled
net: sock: validate data_len before allocating skb in sock_alloc_send_pskb()
bql: Avoid possible inconsistent calculation.
bql: Avoid unneeded limit decrement.
bql: Fix POSDIFF() to integer overflow aware.
net/mlx4_core: Fix obscure mlx4_cmd_box parameter in QUERY_DEV_CAP
net/mlx4_core: Check port out-of-range before using in mlx4_slave_cap
net/mlx4_core: Fixes for VF / Guest startup flow
net/mlx4_en: Fix improper use of "port" parameter in mlx4_en_event
net/mlx4_core: Fix number of EQs used in ICM initialisation
net/mlx4_core: Fix the slave_id out-of-range test in mlx4_eq_int -
Pull straggler x86 fixes from Peter Anvin:
"Three groups of patches:- EFI boot stub documentation and the ability to print error messages;
- Removal for PTRACE_ARCH_PRCTL for x32 (obsolete interface which
should never have been ported, and the port is broken and
potentially dangerous.)
- ftrace stack corruption fixes. I'm not super-happy about the
technical implementation, but it is probably the least invasive in
the short term. In the future I would like a single method for
nesting the debug stack, however."* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86, x32, ptrace: Remove PTRACE_ARCH_PRCTL for x32
x86, efi: Add EFI boot stub documentation
x86, efi; Add EFI boot stub console support
x86, efi: Only close open files in error path
ftrace/x86: Do not change stacks in DEBUG when calling lockdep
x86: Allow nesting of the debug stack IDT setting
x86: Reset the debug_stack update counter
ftrace: Use breakpoint method to update ftrace caller
ftrace: Synchronize variable setting with breakpoints -
This reverts the tty layer change to use per-tty locking, because it's
not correct yet, and fixing it will require some more deep surgery.The main revert is d29f3ef39be4 ("tty_lock: Localise the lock"), but
there are several smaller commits that built upon it, they also get
reverted here. The list of reverted commits is:fde86d310886 - tty: add lockdep annotations
8f6576ad476b - tty: fix ldisc lock inversion trace
d3ca8b64b97e - pty: Fix lock inversion
b1d679afd766 - tty: drop the pty lock during hangup
abcefe5fc357 - tty/amiserial: Add missing argument for tty_unlock()
fd11b42e3598 - cris: fix missing tty arg in wait_event_interruptible_tty call
d29f3ef39be4 - tty_lock: Localise the lockThe revert had a trivial conflict in the 68360serial.c staging driver
that got removed in the meantime.Signed-off-by: Linus Torvalds
-
skb_defer_rx_timestamp was called with a freshly allocated skb but must
be called with rskb instead.Signed-off-by: Stephan Gatzka
Cc: stable
Acked-by: Richard Cochran
Signed-off-by: David S. Miller -
Add .status callback that detects link state changes.
Tested with MCS7832CV-AA chip (9710:7830, identified as rev.C by the driver).
Fixes https://bugzilla.kernel.org/show_bug.cgi?id=28532Signed-off-by: Ondrej Zary
Signed-off-by: David S. Miller -
Pull vfs fix and a fix from the signal changes for frv from Al Viro.
The __kernel_nlink_t for powerpc got scrogged because 64-bit powerpc
actually depended on the default "unsigned long", while 32-bit powerpc
had an explicit override to "unsigned short". Al didn't notice, and
made both of them be the unsigned short.The frv signal fix is fallout from simplifying the do_notify_resume()
code, and leaving an extra parenthesis.* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
powerpc: Fix size of st_nlink on 64bit* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal:
frv: Remove bogus closing parenthesis
02 Jun, 2012
4 commits
-
commit e57f93cc53b7 (powerpc: get rid of nlink_t uses, switch to
explicitly-sized type) changed the size of st_nlink on ppc64 from
a long to a short, resulting in boot failures.Signed-off-by: Anton Blanchard
Signed-off-by: Al Viro -
Introduced by commit 6fd84c0831ec78d98736b76dc5e9b849f1dbfc9e
("TIF_RESTORE_SIGMASK can be set only when TIF_SIGPENDING is set")Signed-off-by: Geert Uytterhoeven
Signed-off-by: Al Viro -
The definition of I217_PROXY_CTRL must use the BM_PHY_REG() macro instead
of the PHY_REG() macro for PHY page 800 register 70 since it is for a PHY
register greater than the maximum allowed by the latter macro, and fix a
typo setting the I217_MEMPWR register in e1000_suspend_workarounds_ich8lan.Also for clarity, rename a few defines as bit definitions instead of masks.
Signed-off-by: Bruce Allan
Tested-by: Aaron Brown
Signed-off-by: Jeff Kirsher -
This is another fixup where the data is not transfered into buffer
addressed by skb->data but into a page.Signed-off-by: Sebastian Andrzej Siewior
Tested-by: Aaron Brown
Signed-off-by: Jeff Kirsher