Eric Lee / smarc-fsl-linux-kernel

24 Jun, 2005

23 commits

d6e711448 [PATCH] setuid core dump ... Browse Code »

Add a new `suid_dumpable' sysctl:

This value can be used to query and set the core dump mode for setuid
or otherwise protected/tainted binaries. The modes are

0 - (default) - traditional behaviour. Any process which has changed
privilege levels or is execute only will not be dumped

1 - (debug) - all processes dump core when possible. The core dump is
owned by the current user and no security is applied. This is intended
for system debugging situations only. Ptrace is unchecked.

2 - (suidsafe) - any binary which normally would not be dumped is dumped
readable by root only. This allows the end user to remove such a dump but
not access it directly. For security reasons core dumps in this mode will
not overwrite one another or other files. This mode is appropriate when
adminstrators are attempting to debug problems in a normal environment.

(akpm:

> > +EXPORT_SYMBOL(suid_dumpable);
>
> EXPORT_SYMBOL_GPL?

No problem to me.

> > if (current->euid == current->uid && current->egid == current->gid)
> > current->mm->dumpable = 1;
>
> Should this be SUID_DUMP_USER?

Actually the feedback I had from last time was that the SUID_ defines
should go because its clearer to follow the numbers. They can go
everywhere (and there are lots of places where dumpable is tested/used
as a bool in untouched code)

> Maybe this should be renamed to `dump_policy' or something. Doing that
> would help us catch any code which isn't using the #defines, too.

Fair comment. The patch was designed to be easy to maintain for Red Hat
rather than for merging. Changing that field would create a gigantic
diff because it is used all over the place.

)

Signed-off-by: Alan Cox
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Alan Cox
2005-06-24 00:45:26 +0800
ea32c65cc [PATCH] kprobes: Temporary disarming of reentrant probe ... Browse Code »

In situations where a kprobes handler calls a routine which has a probe on it,
then kprobes_handler() disarms the new probe forever. This patch removes the
above limitation by temporarily disarming the new probe. When the another
probe hits while handling the old probe, the kprobes_handler() saves previous
kprobes state and handles the new probe without calling the new kprobes
registered handlers. kprobe_post_handler() restores back the previous kprobes
state and the normal execution continues.

However on x86_64 architecture, re-rentrancy is provided only through
pre_handler(). If a routine having probe is referenced through
post_handler(), then the probes on that routine are disarmed forever, since
the exception stack is gets changed after the processor single steps the
instruction of the new probe.

This patch includes generic changes to support temporary disarming on
reentrancy of probes.

Signed-of-by: Prasanna S Panchamukhi

Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Prasanna S Panchamukhi
2005-06-24 00:45:24 +0800
0aa55e4d7 [PATCH] kprobes: moves lock-unlock to non-arch kprobe_flush_task ... Browse Code »

This patch moves the lock/unlock of the arch specific kprobe_flush_task()
to the non-arch specific kprobe_flusk_task().

Signed-off-by: Hien Nguyen
Acked-by: Prasanna S Panchamukhi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hien Nguyen
2005-06-24 00:45:21 +0800
7e1048b11 [PATCH] Move kprobe [dis]arming into arch specific code ... Browse Code »

The architecture independent code of the current kprobes implementation is
arming and disarming kprobes at registration time. The problem is that the
code is assuming that arming and disarming is a just done by a simple write
of some magic value to an address. This is problematic for ia64 where our
instructions look more like structures, and we can not insert break points
by just doing something like:

*p->addr = BREAKPOINT_INSTRUCTION;

The following patch to 2.6.12-rc4-mm2 adds two new architecture dependent
functions:

* void arch_arm_kprobe(struct kprobe *p)
* void arch_disarm_kprobe(struct kprobe *p)

and then adds the new functions for each of the architectures that already
implement kprobes (spar64/ppc64/i386/x86_64).

I thought arch_[dis]arm_kprobe was the most descriptive of what was really
happening, but each of the architectures already had a disarm_kprobe()
function that was really a "disarm and do some other clean-up items as
needed when you stumble across a recursive kprobe." So... I took the
liberty of changing the code that was calling disarm_kprobe() to call
arch_disarm_kprobe(), and then do the cleanup in the block of code dealing
with the recursive kprobe case.

So far this patch as been tested on i386, x86_64, and ppc64, but still
needs to be tested in sparc64.

Signed-off-by: Rusty Lynch
Signed-off-by: Anil S Keshavamurthy
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Rusty Lynch
2005-06-24 00:45:21 +0800
b94cce926 [PATCH] kprobes: function-return probes ... Browse Code »

This patch adds function-return probes to kprobes for the i386
architecture. This enables you to establish a handler to be run when a
function returns.

1. API

Two new functions are added to kprobes:

int register_kretprobe(struct kretprobe *rp);
void unregister_kretprobe(struct kretprobe *rp);

2. Registration and unregistration

2.1 Register

To register a function-return probe, the user populates the following
fields in a kretprobe object and calls register_kretprobe() with the
kretprobe address as an argument:

kp.addr - the function's address

handler - this function is run after the ret instruction executes, but
before control returns to the return address in the caller.

maxactive - The maximum number of instances of the probed function that
can be active concurrently. For example, if the function is non-
recursive and is called with a spinlock or mutex held, maxactive = 1
should be enough. If the function is non-recursive and can never
relinquish the CPU (e.g., via a semaphore or preemption), NR_CPUS should
be enough. maxactive is used to determine how many kretprobe_instance
objects to allocate for this particular probed function. If maxactive
Signed-off-by: Prasanna S Panchamukhi
Signed-off-by: Frederik Deweerdt
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hien Nguyen
2005-06-24 00:45:21 +0800
84de856ed [PATCH] quota: consolidate code surrounding vfs_quota_on_mount ... Browse Code »

Move some code duplicated in both callers into vfs_quota_on_mount

Signed-off-by: Christoph Hellwig
Acked-by: Jan Kara
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Hellwig
2005-06-24 00:45:20 +0800
ac20427ef [PATCH] add check to /proc/devices read routines ... Browse Code »

Patch to add check to get_chrdev_list and get_blkdev_list to prevent reads
of /proc/devices from spilling over the provided page if more than 4096
bytes of string data are generated from all the registered character and
block devices in a system

Signed-off-by: Neil Horman
Cc: Christoph Hellwig
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Neil Horman
2005-06-24 00:45:19 +0800
35a82d1a5 [PATCH] optimise loop driver a bit ... Browse Code »

Looks like locking can be optimised quite a lot. Increase lock widths
slightly so lo_lock is taken fewer times per request. Also it was quite
trivial to cover lo_pending with that lock, and remove the atomic
requirement. This also makes memory ordering explicitly correct, which is
nice (not that I particularly saw any mem ordering bugs).

Test was reading 4 250MB files in parallel on ext2-on-tmpfs filesystem (1K
block size, 4K page size). System is 2 socket Xeon with HT (4 thread).

intel:/home/npiggin# umount /dev/loop0 ; mount /dev/loop0 /mnt/loop ; /usr/bin/time ./mtloop.sh

Before:
0.24user 5.51system 0:02.84elapsed 202%CPU (0avgtext+0avgdata 0maxresident)k
0.19user 5.52system 0:02.88elapsed 198%CPU (0avgtext+0avgdata 0maxresident)k
0.19user 5.57system 0:02.89elapsed 198%CPU (0avgtext+0avgdata 0maxresident)k
0.22user 5.51system 0:02.90elapsed 197%CPU (0avgtext+0avgdata 0maxresident)k
0.19user 5.44system 0:02.91elapsed 193%CPU (0avgtext+0avgdata 0maxresident)k

After:
0.07user 2.34system 0:01.68elapsed 143%CPU (0avgtext+0avgdata 0maxresident)k
0.06user 2.37system 0:01.68elapsed 144%CPU (0avgtext+0avgdata 0maxresident)k
0.06user 2.39system 0:01.68elapsed 145%CPU (0avgtext+0avgdata 0maxresident)k
0.06user 2.36system 0:01.68elapsed 144%CPU (0avgtext+0avgdata 0maxresident)k
0.06user 2.42system 0:01.68elapsed 147%CPU (0avgtext+0avgdata 0maxresident)k

Signed-off-by: Nick Piggin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Nick Piggin
2005-06-24 00:45:18 +0800
543537bd9 [PATCH] create a kstrdup library function ... Browse Code »

This patch creates a new kstrdup library function and changes the "local"
implementations in several places to use this function.

Most of the changes come from the sound and net subsystems. The sound part
had already been acknowledged by Takashi Iwai and the net part by David S.
Miller.

I left UML alone for now because I would need more time to read the code
carefully before making changes there.

Signed-off-by: Paulo Marques
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Paulo Marques
2005-06-24 00:45:18 +0800
991114c6f [PATCH] fix for prune_icache()/forced final iput() races ... Browse Code »

Based on analysis and a patch from Russ Weight

There is a race condition that can occur if an inode is allocated and then
released (using iput) during the ->fill_super functions. The race
condition is between kswapd and mount.

For most filesystems this can only happen in an error path when kswapd is
running concurrently. For isofs, however, the error can occur in a more
common code path (which is how the bug was found).

The logic here is "we want final iput() to free inode *now* instead of
letting it sit in cache if fs is going down or had not quite come up". The
problem is with kswapd seeing such inodes in the middle of being killed and
happily taking over.

The clean solution would be to tell kswapd to leave those inodes alone and
let our final iput deal with them. I.e. add a new flag
(I_FORCED_FREEING), set it before write_inode_now() there and make
prune_icache() leave those alone.

Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Alexander Viro
2005-06-24 00:45:17 +0800
fd450b731 [PATCH] timers: introduce try_to_del_timer_sync() ... Browse Code »

This patch splits del_timer_sync() into 2 functions. The new one,
try_to_del_timer_sync(), returns -1 when it hits executing timer.

It can be used in interrupt context, or when the caller hold locks which
can prevent completion of the timer's handler.

NOTE. Currently it can't be used in interrupt context in UP case, because
->running_timer is used only with CONFIG_SMP.

Should the need arise, it is possible to kill #ifdef CONFIG_SMP in
set_running_timer(), it is cheap.

Signed-off-by: Oleg Nesterov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2005-06-24 00:45:16 +0800
55c888d6d [PATCH] timers fixes/improvements ... Browse Code »

This patch tries to solve following problems:

1. del_timer_sync() is racy. The timer can be fired again after
del_timer_sync have checked all cpus and before it will recheck
timer_pending().

2. It has scalability problems. All cpus are scanned to determine
if the timer is running on that cpu.

With this patch del_timer_sync is O(1) and no slower than plain
del_timer(pending_timer), unless it has to actually wait for
completion of the currently running timer.

The only restriction is that the recurring timer should not use
add_timer_on().

3. The timers are not serialized wrt to itself.

If CPU_0 does mod_timer(jiffies+1) while the timer is currently
running on CPU 1, it is quite possible that local interrupt on
CPU_0 will start that timer before it finished on CPU_1.

4. The timers locking is suboptimal. __mod_timer() takes 3 locks
at once and still requires wmb() in del_timer/run_timers.

The new implementation takes 2 locks sequentially and does not
need memory barriers.

Currently ->base != NULL means that the timer is pending. In that case
->base.lock is used to lock the timer. __mod_timer also takes timer->lock
because ->base can be == NULL.

This patch uses timer->entry.next != NULL as indication that the timer is
pending. So it does __list_del(), entry->next = NULL instead of list_del()
when the timer is deleted.

The ->base field is used for hashed locking only, it is initialized
in init_timer() which sets ->base = per_cpu(tvec_bases). When the
tvec_bases.lock is locked, it means that all timers which are tied
to this base via timer->base are locked, and the base itself is locked
too.

So __run_timers/migrate_timers can safely modify all timers which could
be found on ->tvX lists (pending timers).

When the timer's base is locked, and the timer removed from ->entry list
(which means that _run_timers/migrate_timers can't see this timer), it is
possible to set timer->base = NULL and drop the lock: the timer remains
locked.

This patch adds lock_timer_base() helper, which waits for ->base != NULL,
locks the ->base, and checks it is still the same.

__mod_timer() schedules the timer on the local CPU and changes it's base.
However, it does not lock both old and new bases at once. It locks the
timer via lock_timer_base(), deletes the timer, sets ->base = NULL, and
unlocks old base. Then __mod_timer() locks new_base, sets ->base = new_base,
and adds this timer. This simplifies the code, because AB-BA deadlock is not
possible. __mod_timer() also ensures that the timer's base is not changed
while the timer's handler is running on the old base.

__run_timers(), del_timer() do not change ->base anymore, they only clear
pending flag.

So del_timer_sync() can test timer->base->running_timer == timer to detect
whether it is running or not.

We don't need timer_list->lock anymore, this patch kills it.

We also don't need barriers. del_timer() and __run_timers() used smp_wmb()
before clearing timer's pending flag. It was needed because __mod_timer()
did not lock old_base if the timer is not pending, so __mod_timer()->list_add()
could race with del_timer()->list_del(). With this patch these functions are
serialized through base->lock.

One problem. TIMER_INITIALIZER can't use per_cpu(tvec_bases). So this patch
adds global

struct timer_base_s {
spinlock_t lock;
struct timer_list *running_timer;
} __init_timer_base;

which is used by TIMER_INITIALIZER. The corresponding fields in tvec_t_base_s
struct are replaced by struct timer_base_s t_base.

It is indeed ugly. But this can't have scalability problems. The global
__init_timer_base.lock is used only when __mod_timer() is called for the first
time AND the timer was compile time initialized. After that the timer migrates
to the local CPU.

Signed-off-by: Oleg Nesterov
Acked-by: Ingo Molnar
Signed-off-by: Renaud Lienhart
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2005-06-24 00:45:16 +0800
f7d37d028 [PATCH] blk: remove BLK_TAGS_{PER_LONG|MASK} ... Browse Code »

Replace BLK_TAGS_PER_LONG with BITS_PER_LONG and remove unused BLK_TAGS_MASK.

Signed-off-by: Tejun Heo
Acked-by: Jens Axboe
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Tejun Heo
2005-06-24 00:45:15 +0800
fa72b903f [PATCH] blk: remove blk_queue_tag->real_max_depth optimization ... Browse Code »

blk_queue_tag->real_max_depth was used to optimize out unnecessary
allocations/frees on tag resize. However, the whole thing was very broken -
tag_map was never allocated to real_max_depth resulting in access beyond the
end of the map, bits in [max_depth..real_max_depth] were set when initializing
a map and copied when resizing resulting in pre-occupied tags.

As the gain of the optimization is very small, well, almost nill, remove the
whole thing.

Signed-off-by: Tejun Heo
Acked-by: Jens Axboe
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Tejun Heo
2005-06-24 00:45:15 +0800
1946089a1 [PATCH] NUMA aware block device control structure allocation ... Browse Code »

Patch to allocate the control structures for for ide devices on the node of
the device itself (for NUMA systems). The patch depends on the Slab API
change patch by Manfred and me (in mm) and the pcidev_to_node patch that I
posted today.

Does some realignment too.

Signed-off-by: Justin M. Forbes
Signed-off-by: Christoph Lameter
Signed-off-by: Pravin Shelar
Signed-off-by: Shobhit Dayal
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2005-06-24 00:45:09 +0800
29751f699 [PATCH] sparsemem hotplug base ... Browse Code »

Make sparse's initalization be accessible at runtime. This allows sparse
mappings to be created after boot in a hotplug situation.

This patch is separated from the previous one just to give an indication how
much of the sparse infrastructure is *just* for hotplug memory.

The section_mem_map doesn't really store a pointer. It stores something that
is convenient to do some math against to get a pointer. It isn't valid to
just do *section_mem_map, so I don't think it should be stored as a pointer.

There are a couple of things I'd like to store about a section. First of all,
the fact that it is !NULL does not mean that it is present. There could be
such a combination where section_mem_map *is* NULL, but the math gets you
properly to a real mem_map. So, I don't think that check is safe.

Since we're storing 32-bit-aligned structures, we have a few bits in the
bottom of the pointer to play with. Use one bit to encode whether there's
really a mem_map there, and the other one to tell whether there's a valid
section there. We need to distinguish between the two because sometimes
there's a gap between when a section is discovered to be present and when we
can get the mem_map for it.

Signed-off-by: Dave Hansen
Signed-off-by: Andy Whitcroft
Signed-off-by: Jack Steiner
Signed-off-by: Bob Picco
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andy Whitcroft
2005-06-24 00:45:05 +0800
641c76738 [PATCH] sparsemem swiss cheese numa layouts ... Browse Code »

The part of the sparsemem patch which modifies memmap_init_zone() has recently
become a problem. It changes behavior so that there is a call to
pfn_to_page() for each individual page inside of a node's range:
node_start_pfn through node_end_pfn. It used to simply do this once, at the
beginning of the node, but having sparsemem's non-contiguous mem_map[]s inside
of a node made it necessary to change.

Mike Kravetz recently wrote a patch which made the NUMA code accept some new
kinds of layouts. The system's memory was laid out like this, with node 0's
memory in two pieces: one before and one after node 1's memory:

Node 0: +++++ +++++
Node 1: +++++

Previous behavior before Mike's patch was to assign nodes like this:

Node 0: 00000 XXXXX
Node 1: 11111

Where the 'X' areas were simply thrown away. The new behavior was to make the
pg_data_t span node 0 across all of its areas, including areas that are really
node 1's: Node 0: 000000000000000 Node 1: 11111

This wastes a little bit of mem_map space, but ends up being OK, and more
fully utilizes the system's memory. memmap_init_zone() initializes all of the
"struct page"s for node 0, even for the "hole", but those never get used,
because there is no pfn_to_page() that resolves to those pages. However, only
calling pfn_to_page() once, memmap_init_zone() always uses the pages that were
allocated for node0->node_mem_map because:

struct page *start = pfn_to_page(start_pfn);
// effectively start = &node->node_mem_map[0]
for (page = start; page < (start + size); page++) {
init_page_here();...
page++;
}

Slow, and wasteful, but generally harmless.

But, modify that to call pfn_to_page() for each loop iteration (like sparsemem
does):

for (pfn = start_pfn; pfn < < (start_pfn + size); pfn++++) {
page = pfn_to_page(pfn);
}

And you end up trying to initialize node 1's pages too early, along with bogus
data from node 0. This patch checks for those weird layouts and declines to
touch the pages, making the more frequent pfn_to_page() calls OK to do.

Signed-off-by: Dave Hansen
Signed-off-by: Andy Whitcroft
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andy Whitcroft
2005-06-24 00:45:05 +0800
d41dee369 [PATCH] sparsemem memory model ... Browse Code »

Sparsemem abstracts the use of discontiguous mem_maps[]. This kind of
mem_map[] is needed by discontiguous memory machines (like in the old
CONFIG_DISCONTIGMEM case) as well as memory hotplug systems. Sparsemem
replaces DISCONTIGMEM when enabled, and it is hoped that it can eventually
become a complete replacement.

A significant advantage over DISCONTIGMEM is that it's completely separated
from CONFIG_NUMA. When producing this patch, it became apparent in that NUMA
and DISCONTIG are often confused.

Another advantage is that sparse doesn't require each NUMA node's ranges to be
contiguous. It can handle overlapping ranges between nodes with no problems,
where DISCONTIGMEM currently throws away that memory.

Sparsemem uses an array to provide different pfn_to_page() translations for
each SECTION_SIZE area of physical memory. This is what allows the mem_map[]
to be chopped up.

In order to do quick pfn_to_page() operations, the section number of the page
is encoded in page->flags. Part of the sparsemem infrastructure enables
sharing of these bits more dynamically (at compile-time) between the
page_zone() and sparsemem operations. However, on 32-bit architectures, the
number of bits is quite limited, and may require growing the size of the
page->flags type in certain conditions. Several things might force this to
occur: a decrease in the SECTION_SIZE (if you want to hotplug smaller areas of
memory), an increase in the physical address space, or an increase in the
number of used page->flags.

One thing to note is that, once sparsemem is present, the NUMA node
information no longer needs to be stored in the page->flags. It might provide
speed increases on certain platforms and will be stored there if there is
room. But, if out of room, an alternate (theoretically slower) mechanism is
used.

This patch introduces CONFIG_FLATMEM. It is used in almost all cases where
there used to be an #ifndef DISCONTIG, because SPARSEMEM and DISCONTIGMEM
often have to compile out the same areas of code.

Signed-off-by: Andy Whitcroft
Signed-off-by: Dave Hansen
Signed-off-by: Martin Bligh
Signed-off-by: Adrian Bunk
Signed-off-by: Yasunori Goto
Signed-off-by: Bob Picco
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andy Whitcroft
2005-06-24 00:45:04 +0800
b159d43fb [PATCH] generify early_pfn_to_nid ... Browse Code »

Provide a default implementation for early_pfn_to_nid returning node 0. Allow
architectures to override this with their own implementation out of
asm/mmzone.h.

Signed-off-by: Andy Whitcroft
Signed-off-by: Dave Hansen
Signed-off-by: Martin Bligh
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andy Whitcroft
2005-06-24 00:45:04 +0800
93b7504e3 [PATCH] Introduce new Kconfig option for NUMA or DISCONTIG ... Browse Code »

There is some confusion that arose when working on SPARSEMEM patch between
what is needed for DISCONTIG vs. NUMA.

Multiple pg_data_t's are needed for DISCONTIGMEM or NUMA, independently.
All of the current NUMA implementations require an implementation of
DISCONTIG. Because of this, quite a lot of code which is really needed for
NUMA is actually under DISCONTIG #ifdefs. For SPARSEMEM, we changed some
of these #ifdefs to CONFIG_NUMA, but that broke the DISCONTIG=y and NUMA=n
case.

Introducing this new NEED_MULTIPLE_NODES config option allows code that is
needed for both NUMA or DISCONTIG to be separated out from code that is
specific to DISCONTIG.

One great advantage of this approach is that it doesn't require every
architecture to be converted over. All of the current implementations
should "just work", only the ones implementing SPARSEMEM will have to be
fixed up.

The change to free_area_init() makes it work inside, or out of the new
config option.

Signed-off-by: Dave Hansen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Dave Hansen
2005-06-24 00:45:03 +0800
348f8b6c4 [PATCH] sparsemem base: reorganize page->flags bit operations ... Browse Code »

Generify the value fields in the page_flags. The aim is to allow the location
and size of these fields to be varied. Additionally we want to move away from
fixed allocations per field whilst still enforcing the overall bit utilisation
limits. We rely on the compiler to spot and optimise the accessor functions.

Signed-off-by: Andy Whitcroft
Signed-off-by: Dave Hansen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Dave Hansen
2005-06-24 00:45:01 +0800
6f167ec72 [PATCH] sparsemem base: simple NUMA remap space allocator ... Browse Code »

Introduce a simple allocator for the NUMA remap space. This space is very
scarce, used for structures which are best allocated node local.

This mechanism is also used on non-NUMA ia64 systems with a vmem_map to keep
the pgdat->node_mem_map initialized in a consistent place for all
architectures.

Issues:
o alloc_remap takes a node_id where we might expect a pgdat which was intended
to allow us to allocate the pgdat's using this mechanism; which we do not yet
do. Could have alloc_remap_node() and alloc_remap_nid() for this purpose.

Signed-off-by: Andy Whitcroft
Signed-off-by: Dave Hansen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Dave Hansen
2005-06-24 00:45:01 +0800
408fde81c [PATCH] remove non-DISCONTIG use of pgdat->node_mem_map ... Browse Code »

This patch effectively eliminates direct use of pgdat->node_mem_map outside
of the DISCONTIG code. On a flat memory system, these fields aren't
currently used, neither are they on a sparsemem system.

There was also a node_mem_map(nid) macro on many architectures. Its use
along with the use of ->node_mem_map itself was not consistent. It has
been removed in favor of two new, more explicit, arch-independent macros:

pgdat_page_nr(pgdat, pagenr)
nid_page_nr(nid, pagenr)

I called them "pgdat" and "nid" because we overload the term "node" to mean
"NUMA node", "DISCONTIG node" or "pg_data_t" in very confusing ways. I
believe the newer names are much clearer.

These macros can be overridden in the sparsemem case with a theoretically
slower operation using node_start_pfn and pfn_to_page(), instead. We could
make this the only behavior if people want, but I don't want to change too
much at once. One thing at a time.

This patch removes more code than it adds.

Compile tested on alpha, alpha discontig, arm, arm-discontig, i386, i386
generic, NUMAQ, Summit, ppc64, ppc64 discontig, and x86_64. Full list
here: http://sr71.net/patches/2.6.12/2.6.12-rc1-mhp2/configs/

Boot tested on NUMAQ, x86 SMP and ppc64 power4/5 LPARs.

Signed-off-by: Dave Hansen
Signed-off-by: Martin J. Bligh
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Dave Hansen
2005-06-24 00:45:00 +0800

23 Jun, 2005

17 commits

060de20e8 Merge rsync://rsync.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6 Browse Code »

Linus Torvalds
2005-06-23 14:11:50 +0800
ebc3f64b8 [X25]: Fast select with no restriction on response ... Browse Code »

This patch is a follow up to patch 1 regarding "Selective Sub Address
matching with call user data". It allows use of the Fast-Select-Acceptance
optional user facility for X.25.

This patch just implements fast select with no restriction on response
(NRR). What this means (according to ITU-T Recomendation 10/96 section
6.16) is that if in an incoming call packet, the relevant facility bits are
set for fast-select-NRR, then the called DTE can issue a direct response to
the incoming packet using a call-accepted packet that contains
call-user-data. This patch allows such a response.

The called DTE can also respond with a clear-request packet that contains
call-user-data. However, this feature is currently not implemented by the
patch.

How is Fast Select Acceptance used?
By default, the system does not allow fast select acceptance (as before).
To enable a response to fast select acceptance,
After a listen socket in created and bound as follows
socket(AF_X25, SOCK_SEQPACKET, 0);
bind(call_soc, (struct sockaddr *)&locl_addr, sizeof(locl_addr));
but before a listen system call is made, the following ioctl should be used.
ioctl(call_soc,SIOCX25CALLACCPTAPPRV);
Now the listen system call can be made
listen(call_soc, 4);
After this, an incoming-call packet will be accepted, but no call-accepted
packet will be sent back until the following system call is made on the socket
that accepts the call
ioctl(vc_soc,SIOCX25SENDCALLACCPT);
The network (or cisco xot router used for testing here) will allow the
application server's call-user-data in the call-accepted packet,
provided the call-request was made with Fast-select NRR.

Signed-off-by: Shaun Pereira
Signed-off-by: Andrew Morton
Signed-off-by: David S. Miller

Shaun Pereira
2005-06-23 13:16:17 +0800
cb65d506c [X25]: Selective sub-address matching with call user data. ... Browse Code »

From: Shaun Pereira

This is the first (independent of the second) patch of two that I am
working on with x25 on linux (tested with xot on a cisco router). Details
are as follows.

Current state of module:

A server using the current implementation (2.6.11.7) of the x25 module will
accept a call request/ incoming call packet at the listening x.25 address,
from all callers to that address, as long as NO call user data is present
in the packet header.

If the server needs to choose to accept a particular call request/ incoming
call packet arriving at its listening x25 address, then the kernel has to
allow a match of call user data present in the call request packet with its
own. This is required when multiple servers listen at the same x25 address
and device interface. The kernel currently matches ALL call user data, if
present.

Current Changes:

This patch is a follow up to the patch submitted previously by Andrew
Hendry, and allows the user to selectively control the number of octets of
call user data in the call request packet, that the kernel will match. By
default no call user data is matched, even if call user data is present.
To allow call user data matching, a cudmatchlength > 0 has to be passed
into the kernel after which the passed number of octets will be matched.
Otherwise the kernel behavior is exactly as the original implementation.

This patch also ensures that as is normally the case, no call user data
will be present in the Call accepted / call connected packet sent back to
the caller

Future Changes on next patch:

There are cases however when call user data may be present in the call
accepted packet. According to the X.25 recommendation (ITU-T 10/96)
section 5.2.3.2 call user data may be present in the call accepted packet
provided the fast select facility is used. My next patch will include this
fast select utility and the ability to send up to 128 octets call user data
in the call accepted packet provided the fast select facility is used. I
am currently testing this, again with xot on linux and cisco.

Signed-off-by: Shaun Pereira

(With a fix from Alexey Dobriyan )
Signed-off-by: Andrew Morton
Signed-off-by: David S. Miller

Shaun Pereira
2005-06-23 13:15:01 +0800
fbeec2e15 [NETPOLL]: allow multiple netpoll_clients to register against one interface ... Browse Code »

This patch provides support for registering multiple netpoll clients to the
same network device. Only one of these clients may register an rx_hook,
however. In practice, this restriction has not been problematic. It is
worth mentioning, though, that the current design can be easily extended to
allow for the registration of multiple rx_hooks.

The basic idea of the patch is that the rx_np pointer in the netpoll_info
structure points to the struct netpoll that has rx_hook filled in. Aside
from this one case, there is no need for a pointer from the struct
net_device to an individual struct netpoll.

A lock is introduced to protect the setting and clearing of the np_rx
pointer. The pointer will only be cleared upon netpoll client module
removal, and the lock should be uncontested.

Signed-off-by: Jeff Moyer
Signed-off-by: David S. Miller

Jeff Moyer
2005-06-23 13:05:59 +0800
115c1d6e6 [NETPOLL]: Introduce a netpoll_info struct ... Browse Code »

This patch introduces a netpoll_info structure, which the struct net_device
will now point to instead of pointing to a struct netpoll. The reason for
this is two-fold: 1) fields such as the rx_flags, poll_owner, and poll_lock
should be maintained per net_device, not per netpoll; and 2) this is a first
step in providing support for multiple netpoll clients to register against the
same net_device.

The struct netpoll is now pointed to by the netpoll_info structure. As
such, the previous behaviour of the code is preserved.

Signed-off-by: Jeff Moyer
Signed-off-by: David S. Miller

Jeff Moyer
2005-06-23 13:05:31 +0800
6ca4f65e6 [NETPOLL]: Set poll_owner to -1 before unlocking in netpoll_poll_unlock() ... Browse Code »

This trivial patch moves the assignment of poll_owner to -1 inside of
the lock. This fixes a potential SMP race in the code.

Signed-off-by: Jeff Moyer
Signed-off-by: David S. Miller

Jeff Moyer
2005-06-23 13:04:55 +0800
9092131f7 Merge rsync://client.linux-nfs.org/pub/linux/nfs-2.6 Browse Code »

Linus Torvalds
2005-06-23 05:32:15 +0800
8d0a8a9d0 [PATCH] NFSv4: Clean up nfs4 lock state accounting ... Browse Code »

Ensure that lock owner structures are not released prematurely.

Signed-off-by: Trond Myklebust

Trond Myklebust
2005-06-23 04:07:42 +0800
ecdbf769b [PATCH] NLM: fix a client-side race on blocking locks. ... Browse Code »

If the lock blocks, the server may send us a GRANTED message that
races with the reply to our LOCK request. Make sure that we catch
the GRANTED by queueing up our request on the nlm_blocked list
before we send off the first LOCK rpc call.

Signed-off-by: Trond Myklebust

Trond Myklebust
2005-06-23 04:07:42 +0800
3da28eb1c [PATCH] NFS: Replace nfs_page insertion sort with a radix sort ... Browse Code »

Signed-off-by: Trond Myklebust

Trond Myklebust
2005-06-23 04:07:39 +0800
c6a556b88 [PATCH] NFS: Make searching and waiting on busy writeback requests more efficient. ... Browse Code »

Basically copies the VFS's method for tracking writebacks and applies
it to the struct nfs_page.

Signed-off-by: Trond Myklebust

Trond Myklebust
2005-06-23 04:07:39 +0800
fe51beecc [PATCH] NFS: Ensure that fstat() always returns the correct mtime ... Browse Code »

Even if the file is open for writes.

Signed-off-by: Trond Myklebust

Trond Myklebust
2005-06-23 04:07:37 +0800
7d52e8627 [PATCH] NFS: Cleanup of caching code, and slight optimization of writes. ... Browse Code »

Unless we're doing O_APPEND writes, we really don't care about revalidating
the file length. Just make sure that we catch any page cache invalidations.

Signed-off-by: Trond Myklebust

Trond Myklebust
2005-06-23 04:07:37 +0800
951a143b3 [PATCH] NFS: Fix the file size revalidation ... Browse Code »

Instead of looking at whether or not the file is open for writes before
we accept to update the length using the server value, we should rather
be looking at whether or not we are currently caching any writes.

Failure to do so means in particular that we're not updating the file
length correctly after obtaining a POSIX or BSD lock.

Signed-off-by: Trond Myklebust

Trond Myklebust
2005-06-23 04:07:36 +0800
f0dd2136d [PATCH] NFS: Clean up readdir changes. ... Browse Code »

Signed-off-by: Trond Myklebust

Trond Myklebust
2005-06-23 04:07:34 +0800
00a926422 [PATCH] NFS: Hide NFS server-generated readdir cookies from userland ... Browse Code »

NFSv3 currently returns the unsigned 64-bit cookie directly to
userspace. The following patch causes the kernel to generate
loff_t offsets for the benefit of userland.
The current server-generated READDIR cookie is cached in the
nfs_open_context instead of in filp->f_pos, so we still end up work
correctly under directory insertions/deletion.

Signed-off-by: Olivier Galibert
Signed-off-by: Trond Myklebust

Olivier Galibert
2005-06-23 04:07:33 +0800
5c6a9f7d9 [PATCH] NFS: Cache the NFSv3 acls. ... Browse Code »

Attach acls to inodes in the icache to avoid unnecessary GETACL RPC
round-trips. As long as the client doesn't retrieve any acls itself, only the
default acls of exiting directories and the default and access acls of new
directories will end up in the cache, which preserves some memory compared to
always caching the access and default acl of all files.

Signed-off-by: Andreas Gruenbacher
Acked-by: Olaf Kirch
Signed-off-by: Andrew Morton
Signed-off-by: Trond Myklebust

Andreas Gruenbacher
2005-06-23 04:07:25 +0800