24 Jun, 2005

23 commits

  • Add a new `suid_dumpable' sysctl:

    This value can be used to query and set the core dump mode for setuid
    or otherwise protected/tainted binaries. The modes are

    0 - (default) - traditional behaviour. Any process which has changed
    privilege levels or is execute only will not be dumped

    1 - (debug) - all processes dump core when possible. The core dump is
    owned by the current user and no security is applied. This is intended
    for system debugging situations only. Ptrace is unchecked.

    2 - (suidsafe) - any binary which normally would not be dumped is dumped
    readable by root only. This allows the end user to remove such a dump but
    not access it directly. For security reasons core dumps in this mode will
    not overwrite one another or other files. This mode is appropriate when
    adminstrators are attempting to debug problems in a normal environment.

    (akpm:

    > > +EXPORT_SYMBOL(suid_dumpable);
    >
    > EXPORT_SYMBOL_GPL?

    No problem to me.

    > > if (current->euid == current->uid && current->egid == current->gid)
    > > current->mm->dumpable = 1;
    >
    > Should this be SUID_DUMP_USER?

    Actually the feedback I had from last time was that the SUID_ defines
    should go because its clearer to follow the numbers. They can go
    everywhere (and there are lots of places where dumpable is tested/used
    as a bool in untouched code)

    > Maybe this should be renamed to `dump_policy' or something. Doing that
    > would help us catch any code which isn't using the #defines, too.

    Fair comment. The patch was designed to be easy to maintain for Red Hat
    rather than for merging. Changing that field would create a gigantic
    diff because it is used all over the place.

    )

    Signed-off-by: Alan Cox
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alan Cox
     
  • In situations where a kprobes handler calls a routine which has a probe on it,
    then kprobes_handler() disarms the new probe forever. This patch removes the
    above limitation by temporarily disarming the new probe. When the another
    probe hits while handling the old probe, the kprobes_handler() saves previous
    kprobes state and handles the new probe without calling the new kprobes
    registered handlers. kprobe_post_handler() restores back the previous kprobes
    state and the normal execution continues.

    However on x86_64 architecture, re-rentrancy is provided only through
    pre_handler(). If a routine having probe is referenced through
    post_handler(), then the probes on that routine are disarmed forever, since
    the exception stack is gets changed after the processor single steps the
    instruction of the new probe.

    This patch includes generic changes to support temporary disarming on
    reentrancy of probes.

    Signed-of-by: Prasanna S Panchamukhi

    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Prasanna S Panchamukhi
     
  • This patch moves the lock/unlock of the arch specific kprobe_flush_task()
    to the non-arch specific kprobe_flusk_task().

    Signed-off-by: Hien Nguyen
    Acked-by: Prasanna S Panchamukhi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hien Nguyen
     
  • The architecture independent code of the current kprobes implementation is
    arming and disarming kprobes at registration time. The problem is that the
    code is assuming that arming and disarming is a just done by a simple write
    of some magic value to an address. This is problematic for ia64 where our
    instructions look more like structures, and we can not insert break points
    by just doing something like:

    *p->addr = BREAKPOINT_INSTRUCTION;

    The following patch to 2.6.12-rc4-mm2 adds two new architecture dependent
    functions:

    * void arch_arm_kprobe(struct kprobe *p)
    * void arch_disarm_kprobe(struct kprobe *p)

    and then adds the new functions for each of the architectures that already
    implement kprobes (spar64/ppc64/i386/x86_64).

    I thought arch_[dis]arm_kprobe was the most descriptive of what was really
    happening, but each of the architectures already had a disarm_kprobe()
    function that was really a "disarm and do some other clean-up items as
    needed when you stumble across a recursive kprobe." So... I took the
    liberty of changing the code that was calling disarm_kprobe() to call
    arch_disarm_kprobe(), and then do the cleanup in the block of code dealing
    with the recursive kprobe case.

    So far this patch as been tested on i386, x86_64, and ppc64, but still
    needs to be tested in sparc64.

    Signed-off-by: Rusty Lynch
    Signed-off-by: Anil S Keshavamurthy
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rusty Lynch
     
  • This patch adds function-return probes to kprobes for the i386
    architecture. This enables you to establish a handler to be run when a
    function returns.

    1. API

    Two new functions are added to kprobes:

    int register_kretprobe(struct kretprobe *rp);
    void unregister_kretprobe(struct kretprobe *rp);

    2. Registration and unregistration

    2.1 Register

    To register a function-return probe, the user populates the following
    fields in a kretprobe object and calls register_kretprobe() with the
    kretprobe address as an argument:

    kp.addr - the function's address

    handler - this function is run after the ret instruction executes, but
    before control returns to the return address in the caller.

    maxactive - The maximum number of instances of the probed function that
    can be active concurrently. For example, if the function is non-
    recursive and is called with a spinlock or mutex held, maxactive = 1
    should be enough. If the function is non-recursive and can never
    relinquish the CPU (e.g., via a semaphore or preemption), NR_CPUS should
    be enough. maxactive is used to determine how many kretprobe_instance
    objects to allocate for this particular probed function. If maxactive
    Signed-off-by: Prasanna S Panchamukhi
    Signed-off-by: Frederik Deweerdt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hien Nguyen
     
  • Move some code duplicated in both callers into vfs_quota_on_mount

    Signed-off-by: Christoph Hellwig
    Acked-by: Jan Kara
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Hellwig
     
  • Patch to add check to get_chrdev_list and get_blkdev_list to prevent reads
    of /proc/devices from spilling over the provided page if more than 4096
    bytes of string data are generated from all the registered character and
    block devices in a system

    Signed-off-by: Neil Horman
    Cc: Christoph Hellwig
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Neil Horman
     
  • Looks like locking can be optimised quite a lot. Increase lock widths
    slightly so lo_lock is taken fewer times per request. Also it was quite
    trivial to cover lo_pending with that lock, and remove the atomic
    requirement. This also makes memory ordering explicitly correct, which is
    nice (not that I particularly saw any mem ordering bugs).

    Test was reading 4 250MB files in parallel on ext2-on-tmpfs filesystem (1K
    block size, 4K page size). System is 2 socket Xeon with HT (4 thread).

    intel:/home/npiggin# umount /dev/loop0 ; mount /dev/loop0 /mnt/loop ; /usr/bin/time ./mtloop.sh

    Before:
    0.24user 5.51system 0:02.84elapsed 202%CPU (0avgtext+0avgdata 0maxresident)k
    0.19user 5.52system 0:02.88elapsed 198%CPU (0avgtext+0avgdata 0maxresident)k
    0.19user 5.57system 0:02.89elapsed 198%CPU (0avgtext+0avgdata 0maxresident)k
    0.22user 5.51system 0:02.90elapsed 197%CPU (0avgtext+0avgdata 0maxresident)k
    0.19user 5.44system 0:02.91elapsed 193%CPU (0avgtext+0avgdata 0maxresident)k

    After:
    0.07user 2.34system 0:01.68elapsed 143%CPU (0avgtext+0avgdata 0maxresident)k
    0.06user 2.37system 0:01.68elapsed 144%CPU (0avgtext+0avgdata 0maxresident)k
    0.06user 2.39system 0:01.68elapsed 145%CPU (0avgtext+0avgdata 0maxresident)k
    0.06user 2.36system 0:01.68elapsed 144%CPU (0avgtext+0avgdata 0maxresident)k
    0.06user 2.42system 0:01.68elapsed 147%CPU (0avgtext+0avgdata 0maxresident)k

    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • This patch creates a new kstrdup library function and changes the "local"
    implementations in several places to use this function.

    Most of the changes come from the sound and net subsystems. The sound part
    had already been acknowledged by Takashi Iwai and the net part by David S.
    Miller.

    I left UML alone for now because I would need more time to read the code
    carefully before making changes there.

    Signed-off-by: Paulo Marques
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paulo Marques
     
  • Based on analysis and a patch from Russ Weight

    There is a race condition that can occur if an inode is allocated and then
    released (using iput) during the ->fill_super functions. The race
    condition is between kswapd and mount.

    For most filesystems this can only happen in an error path when kswapd is
    running concurrently. For isofs, however, the error can occur in a more
    common code path (which is how the bug was found).

    The logic here is "we want final iput() to free inode *now* instead of
    letting it sit in cache if fs is going down or had not quite come up". The
    problem is with kswapd seeing such inodes in the middle of being killed and
    happily taking over.

    The clean solution would be to tell kswapd to leave those inodes alone and
    let our final iput deal with them. I.e. add a new flag
    (I_FORCED_FREEING), set it before write_inode_now() there and make
    prune_icache() leave those alone.

    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexander Viro
     
  • This patch splits del_timer_sync() into 2 functions. The new one,
    try_to_del_timer_sync(), returns -1 when it hits executing timer.

    It can be used in interrupt context, or when the caller hold locks which
    can prevent completion of the timer's handler.

    NOTE. Currently it can't be used in interrupt context in UP case, because
    ->running_timer is used only with CONFIG_SMP.

    Should the need arise, it is possible to kill #ifdef CONFIG_SMP in
    set_running_timer(), it is cheap.

    Signed-off-by: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • This patch tries to solve following problems:

    1. del_timer_sync() is racy. The timer can be fired again after
    del_timer_sync have checked all cpus and before it will recheck
    timer_pending().

    2. It has scalability problems. All cpus are scanned to determine
    if the timer is running on that cpu.

    With this patch del_timer_sync is O(1) and no slower than plain
    del_timer(pending_timer), unless it has to actually wait for
    completion of the currently running timer.

    The only restriction is that the recurring timer should not use
    add_timer_on().

    3. The timers are not serialized wrt to itself.

    If CPU_0 does mod_timer(jiffies+1) while the timer is currently
    running on CPU 1, it is quite possible that local interrupt on
    CPU_0 will start that timer before it finished on CPU_1.

    4. The timers locking is suboptimal. __mod_timer() takes 3 locks
    at once and still requires wmb() in del_timer/run_timers.

    The new implementation takes 2 locks sequentially and does not
    need memory barriers.

    Currently ->base != NULL means that the timer is pending. In that case
    ->base.lock is used to lock the timer. __mod_timer also takes timer->lock
    because ->base can be == NULL.

    This patch uses timer->entry.next != NULL as indication that the timer is
    pending. So it does __list_del(), entry->next = NULL instead of list_del()
    when the timer is deleted.

    The ->base field is used for hashed locking only, it is initialized
    in init_timer() which sets ->base = per_cpu(tvec_bases). When the
    tvec_bases.lock is locked, it means that all timers which are tied
    to this base via timer->base are locked, and the base itself is locked
    too.

    So __run_timers/migrate_timers can safely modify all timers which could
    be found on ->tvX lists (pending timers).

    When the timer's base is locked, and the timer removed from ->entry list
    (which means that _run_timers/migrate_timers can't see this timer), it is
    possible to set timer->base = NULL and drop the lock: the timer remains
    locked.

    This patch adds lock_timer_base() helper, which waits for ->base != NULL,
    locks the ->base, and checks it is still the same.

    __mod_timer() schedules the timer on the local CPU and changes it's base.
    However, it does not lock both old and new bases at once. It locks the
    timer via lock_timer_base(), deletes the timer, sets ->base = NULL, and
    unlocks old base. Then __mod_timer() locks new_base, sets ->base = new_base,
    and adds this timer. This simplifies the code, because AB-BA deadlock is not
    possible. __mod_timer() also ensures that the timer's base is not changed
    while the timer's handler is running on the old base.

    __run_timers(), del_timer() do not change ->base anymore, they only clear
    pending flag.

    So del_timer_sync() can test timer->base->running_timer == timer to detect
    whether it is running or not.

    We don't need timer_list->lock anymore, this patch kills it.

    We also don't need barriers. del_timer() and __run_timers() used smp_wmb()
    before clearing timer's pending flag. It was needed because __mod_timer()
    did not lock old_base if the timer is not pending, so __mod_timer()->list_add()
    could race with del_timer()->list_del(). With this patch these functions are
    serialized through base->lock.

    One problem. TIMER_INITIALIZER can't use per_cpu(tvec_bases). So this patch
    adds global

    struct timer_base_s {
    spinlock_t lock;
    struct timer_list *running_timer;
    } __init_timer_base;

    which is used by TIMER_INITIALIZER. The corresponding fields in tvec_t_base_s
    struct are replaced by struct timer_base_s t_base.

    It is indeed ugly. But this can't have scalability problems. The global
    __init_timer_base.lock is used only when __mod_timer() is called for the first
    time AND the timer was compile time initialized. After that the timer migrates
    to the local CPU.

    Signed-off-by: Oleg Nesterov
    Acked-by: Ingo Molnar
    Signed-off-by: Renaud Lienhart
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Replace BLK_TAGS_PER_LONG with BITS_PER_LONG and remove unused BLK_TAGS_MASK.

    Signed-off-by: Tejun Heo
    Acked-by: Jens Axboe
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tejun Heo
     
  • blk_queue_tag->real_max_depth was used to optimize out unnecessary
    allocations/frees on tag resize. However, the whole thing was very broken -
    tag_map was never allocated to real_max_depth resulting in access beyond the
    end of the map, bits in [max_depth..real_max_depth] were set when initializing
    a map and copied when resizing resulting in pre-occupied tags.

    As the gain of the optimization is very small, well, almost nill, remove the
    whole thing.

    Signed-off-by: Tejun Heo
    Acked-by: Jens Axboe
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tejun Heo
     
  • Patch to allocate the control structures for for ide devices on the node of
    the device itself (for NUMA systems). The patch depends on the Slab API
    change patch by Manfred and me (in mm) and the pcidev_to_node patch that I
    posted today.

    Does some realignment too.

    Signed-off-by: Justin M. Forbes
    Signed-off-by: Christoph Lameter
    Signed-off-by: Pravin Shelar
    Signed-off-by: Shobhit Dayal
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Make sparse's initalization be accessible at runtime. This allows sparse
    mappings to be created after boot in a hotplug situation.

    This patch is separated from the previous one just to give an indication how
    much of the sparse infrastructure is *just* for hotplug memory.

    The section_mem_map doesn't really store a pointer. It stores something that
    is convenient to do some math against to get a pointer. It isn't valid to
    just do *section_mem_map, so I don't think it should be stored as a pointer.

    There are a couple of things I'd like to store about a section. First of all,
    the fact that it is !NULL does not mean that it is present. There could be
    such a combination where section_mem_map *is* NULL, but the math gets you
    properly to a real mem_map. So, I don't think that check is safe.

    Since we're storing 32-bit-aligned structures, we have a few bits in the
    bottom of the pointer to play with. Use one bit to encode whether there's
    really a mem_map there, and the other one to tell whether there's a valid
    section there. We need to distinguish between the two because sometimes
    there's a gap between when a section is discovered to be present and when we
    can get the mem_map for it.

    Signed-off-by: Dave Hansen
    Signed-off-by: Andy Whitcroft
    Signed-off-by: Jack Steiner
    Signed-off-by: Bob Picco
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andy Whitcroft
     
  • The part of the sparsemem patch which modifies memmap_init_zone() has recently
    become a problem. It changes behavior so that there is a call to
    pfn_to_page() for each individual page inside of a node's range:
    node_start_pfn through node_end_pfn. It used to simply do this once, at the
    beginning of the node, but having sparsemem's non-contiguous mem_map[]s inside
    of a node made it necessary to change.

    Mike Kravetz recently wrote a patch which made the NUMA code accept some new
    kinds of layouts. The system's memory was laid out like this, with node 0's
    memory in two pieces: one before and one after node 1's memory:

    Node 0: +++++ +++++
    Node 1: +++++

    Previous behavior before Mike's patch was to assign nodes like this:

    Node 0: 00000 XXXXX
    Node 1: 11111

    Where the 'X' areas were simply thrown away. The new behavior was to make the
    pg_data_t span node 0 across all of its areas, including areas that are really
    node 1's: Node 0: 000000000000000 Node 1: 11111

    This wastes a little bit of mem_map space, but ends up being OK, and more
    fully utilizes the system's memory. memmap_init_zone() initializes all of the
    "struct page"s for node 0, even for the "hole", but those never get used,
    because there is no pfn_to_page() that resolves to those pages. However, only
    calling pfn_to_page() once, memmap_init_zone() always uses the pages that were
    allocated for node0->node_mem_map because:

    struct page *start = pfn_to_page(start_pfn);
    // effectively start = &node->node_mem_map[0]
    for (page = start; page < (start + size); page++) {
    init_page_here();...
    page++;
    }

    Slow, and wasteful, but generally harmless.

    But, modify that to call pfn_to_page() for each loop iteration (like sparsemem
    does):

    for (pfn = start_pfn; pfn < < (start_pfn + size); pfn++++) {
    page = pfn_to_page(pfn);
    }

    And you end up trying to initialize node 1's pages too early, along with bogus
    data from node 0. This patch checks for those weird layouts and declines to
    touch the pages, making the more frequent pfn_to_page() calls OK to do.

    Signed-off-by: Dave Hansen
    Signed-off-by: Andy Whitcroft
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andy Whitcroft
     
  • Sparsemem abstracts the use of discontiguous mem_maps[]. This kind of
    mem_map[] is needed by discontiguous memory machines (like in the old
    CONFIG_DISCONTIGMEM case) as well as memory hotplug systems. Sparsemem
    replaces DISCONTIGMEM when enabled, and it is hoped that it can eventually
    become a complete replacement.

    A significant advantage over DISCONTIGMEM is that it's completely separated
    from CONFIG_NUMA. When producing this patch, it became apparent in that NUMA
    and DISCONTIG are often confused.

    Another advantage is that sparse doesn't require each NUMA node's ranges to be
    contiguous. It can handle overlapping ranges between nodes with no problems,
    where DISCONTIGMEM currently throws away that memory.

    Sparsemem uses an array to provide different pfn_to_page() translations for
    each SECTION_SIZE area of physical memory. This is what allows the mem_map[]
    to be chopped up.

    In order to do quick pfn_to_page() operations, the section number of the page
    is encoded in page->flags. Part of the sparsemem infrastructure enables
    sharing of these bits more dynamically (at compile-time) between the
    page_zone() and sparsemem operations. However, on 32-bit architectures, the
    number of bits is quite limited, and may require growing the size of the
    page->flags type in certain conditions. Several things might force this to
    occur: a decrease in the SECTION_SIZE (if you want to hotplug smaller areas of
    memory), an increase in the physical address space, or an increase in the
    number of used page->flags.

    One thing to note is that, once sparsemem is present, the NUMA node
    information no longer needs to be stored in the page->flags. It might provide
    speed increases on certain platforms and will be stored there if there is
    room. But, if out of room, an alternate (theoretically slower) mechanism is
    used.

    This patch introduces CONFIG_FLATMEM. It is used in almost all cases where
    there used to be an #ifndef DISCONTIG, because SPARSEMEM and DISCONTIGMEM
    often have to compile out the same areas of code.

    Signed-off-by: Andy Whitcroft
    Signed-off-by: Dave Hansen
    Signed-off-by: Martin Bligh
    Signed-off-by: Adrian Bunk
    Signed-off-by: Yasunori Goto
    Signed-off-by: Bob Picco
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andy Whitcroft
     
  • Provide a default implementation for early_pfn_to_nid returning node 0. Allow
    architectures to override this with their own implementation out of
    asm/mmzone.h.

    Signed-off-by: Andy Whitcroft
    Signed-off-by: Dave Hansen
    Signed-off-by: Martin Bligh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andy Whitcroft
     
  • There is some confusion that arose when working on SPARSEMEM patch between
    what is needed for DISCONTIG vs. NUMA.

    Multiple pg_data_t's are needed for DISCONTIGMEM or NUMA, independently.
    All of the current NUMA implementations require an implementation of
    DISCONTIG. Because of this, quite a lot of code which is really needed for
    NUMA is actually under DISCONTIG #ifdefs. For SPARSEMEM, we changed some
    of these #ifdefs to CONFIG_NUMA, but that broke the DISCONTIG=y and NUMA=n
    case.

    Introducing this new NEED_MULTIPLE_NODES config option allows code that is
    needed for both NUMA or DISCONTIG to be separated out from code that is
    specific to DISCONTIG.

    One great advantage of this approach is that it doesn't require every
    architecture to be converted over. All of the current implementations
    should "just work", only the ones implementing SPARSEMEM will have to be
    fixed up.

    The change to free_area_init() makes it work inside, or out of the new
    config option.

    Signed-off-by: Dave Hansen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dave Hansen
     
  • Generify the value fields in the page_flags. The aim is to allow the location
    and size of these fields to be varied. Additionally we want to move away from
    fixed allocations per field whilst still enforcing the overall bit utilisation
    limits. We rely on the compiler to spot and optimise the accessor functions.

    Signed-off-by: Andy Whitcroft
    Signed-off-by: Dave Hansen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dave Hansen
     
  • Introduce a simple allocator for the NUMA remap space. This space is very
    scarce, used for structures which are best allocated node local.

    This mechanism is also used on non-NUMA ia64 systems with a vmem_map to keep
    the pgdat->node_mem_map initialized in a consistent place for all
    architectures.

    Issues:
    o alloc_remap takes a node_id where we might expect a pgdat which was intended
    to allow us to allocate the pgdat's using this mechanism; which we do not yet
    do. Could have alloc_remap_node() and alloc_remap_nid() for this purpose.

    Signed-off-by: Andy Whitcroft
    Signed-off-by: Dave Hansen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dave Hansen
     
  • This patch effectively eliminates direct use of pgdat->node_mem_map outside
    of the DISCONTIG code. On a flat memory system, these fields aren't
    currently used, neither are they on a sparsemem system.

    There was also a node_mem_map(nid) macro on many architectures. Its use
    along with the use of ->node_mem_map itself was not consistent. It has
    been removed in favor of two new, more explicit, arch-independent macros:

    pgdat_page_nr(pgdat, pagenr)
    nid_page_nr(nid, pagenr)

    I called them "pgdat" and "nid" because we overload the term "node" to mean
    "NUMA node", "DISCONTIG node" or "pg_data_t" in very confusing ways. I
    believe the newer names are much clearer.

    These macros can be overridden in the sparsemem case with a theoretically
    slower operation using node_start_pfn and pfn_to_page(), instead. We could
    make this the only behavior if people want, but I don't want to change too
    much at once. One thing at a time.

    This patch removes more code than it adds.

    Compile tested on alpha, alpha discontig, arm, arm-discontig, i386, i386
    generic, NUMAQ, Summit, ppc64, ppc64 discontig, and x86_64. Full list
    here: http://sr71.net/patches/2.6.12/2.6.12-rc1-mhp2/configs/

    Boot tested on NUMAQ, x86 SMP and ppc64 power4/5 LPARs.

    Signed-off-by: Dave Hansen
    Signed-off-by: Martin J. Bligh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dave Hansen
     

23 Jun, 2005

17 commits

  • Linus Torvalds
     
  • This patch is a follow up to patch 1 regarding "Selective Sub Address
    matching with call user data". It allows use of the Fast-Select-Acceptance
    optional user facility for X.25.

    This patch just implements fast select with no restriction on response
    (NRR). What this means (according to ITU-T Recomendation 10/96 section
    6.16) is that if in an incoming call packet, the relevant facility bits are
    set for fast-select-NRR, then the called DTE can issue a direct response to
    the incoming packet using a call-accepted packet that contains
    call-user-data. This patch allows such a response.

    The called DTE can also respond with a clear-request packet that contains
    call-user-data. However, this feature is currently not implemented by the
    patch.

    How is Fast Select Acceptance used?
    By default, the system does not allow fast select acceptance (as before).
    To enable a response to fast select acceptance,
    After a listen socket in created and bound as follows
    socket(AF_X25, SOCK_SEQPACKET, 0);
    bind(call_soc, (struct sockaddr *)&locl_addr, sizeof(locl_addr));
    but before a listen system call is made, the following ioctl should be used.
    ioctl(call_soc,SIOCX25CALLACCPTAPPRV);
    Now the listen system call can be made
    listen(call_soc, 4);
    After this, an incoming-call packet will be accepted, but no call-accepted
    packet will be sent back until the following system call is made on the socket
    that accepts the call
    ioctl(vc_soc,SIOCX25SENDCALLACCPT);
    The network (or cisco xot router used for testing here) will allow the
    application server's call-user-data in the call-accepted packet,
    provided the call-request was made with Fast-select NRR.

    Signed-off-by: Shaun Pereira
    Signed-off-by: Andrew Morton
    Signed-off-by: David S. Miller

    Shaun Pereira
     
  • From: Shaun Pereira

    This is the first (independent of the second) patch of two that I am
    working on with x25 on linux (tested with xot on a cisco router). Details
    are as follows.

    Current state of module:

    A server using the current implementation (2.6.11.7) of the x25 module will
    accept a call request/ incoming call packet at the listening x.25 address,
    from all callers to that address, as long as NO call user data is present
    in the packet header.

    If the server needs to choose to accept a particular call request/ incoming
    call packet arriving at its listening x25 address, then the kernel has to
    allow a match of call user data present in the call request packet with its
    own. This is required when multiple servers listen at the same x25 address
    and device interface. The kernel currently matches ALL call user data, if
    present.

    Current Changes:

    This patch is a follow up to the patch submitted previously by Andrew
    Hendry, and allows the user to selectively control the number of octets of
    call user data in the call request packet, that the kernel will match. By
    default no call user data is matched, even if call user data is present.
    To allow call user data matching, a cudmatchlength > 0 has to be passed
    into the kernel after which the passed number of octets will be matched.
    Otherwise the kernel behavior is exactly as the original implementation.

    This patch also ensures that as is normally the case, no call user data
    will be present in the Call accepted / call connected packet sent back to
    the caller

    Future Changes on next patch:

    There are cases however when call user data may be present in the call
    accepted packet. According to the X.25 recommendation (ITU-T 10/96)
    section 5.2.3.2 call user data may be present in the call accepted packet
    provided the fast select facility is used. My next patch will include this
    fast select utility and the ability to send up to 128 octets call user data
    in the call accepted packet provided the fast select facility is used. I
    am currently testing this, again with xot on linux and cisco.

    Signed-off-by: Shaun Pereira

    (With a fix from Alexey Dobriyan )
    Signed-off-by: Andrew Morton
    Signed-off-by: David S. Miller

    Shaun Pereira
     
  • This patch provides support for registering multiple netpoll clients to the
    same network device. Only one of these clients may register an rx_hook,
    however. In practice, this restriction has not been problematic. It is
    worth mentioning, though, that the current design can be easily extended to
    allow for the registration of multiple rx_hooks.

    The basic idea of the patch is that the rx_np pointer in the netpoll_info
    structure points to the struct netpoll that has rx_hook filled in. Aside
    from this one case, there is no need for a pointer from the struct
    net_device to an individual struct netpoll.

    A lock is introduced to protect the setting and clearing of the np_rx
    pointer. The pointer will only be cleared upon netpoll client module
    removal, and the lock should be uncontested.

    Signed-off-by: Jeff Moyer
    Signed-off-by: David S. Miller

    Jeff Moyer
     
  • This patch introduces a netpoll_info structure, which the struct net_device
    will now point to instead of pointing to a struct netpoll. The reason for
    this is two-fold: 1) fields such as the rx_flags, poll_owner, and poll_lock
    should be maintained per net_device, not per netpoll; and 2) this is a first
    step in providing support for multiple netpoll clients to register against the
    same net_device.

    The struct netpoll is now pointed to by the netpoll_info structure. As
    such, the previous behaviour of the code is preserved.

    Signed-off-by: Jeff Moyer
    Signed-off-by: David S. Miller

    Jeff Moyer
     
  • This trivial patch moves the assignment of poll_owner to -1 inside of
    the lock. This fixes a potential SMP race in the code.

    Signed-off-by: Jeff Moyer
    Signed-off-by: David S. Miller

    Jeff Moyer
     
  • Linus Torvalds
     
  • Ensure that lock owner structures are not released prematurely.

    Signed-off-by: Trond Myklebust

    Trond Myklebust
     
  • If the lock blocks, the server may send us a GRANTED message that
    races with the reply to our LOCK request. Make sure that we catch
    the GRANTED by queueing up our request on the nlm_blocked list
    before we send off the first LOCK rpc call.

    Signed-off-by: Trond Myklebust

    Trond Myklebust
     
  • Signed-off-by: Trond Myklebust

    Trond Myklebust
     
  • Basically copies the VFS's method for tracking writebacks and applies
    it to the struct nfs_page.

    Signed-off-by: Trond Myklebust

    Trond Myklebust
     
  • Even if the file is open for writes.

    Signed-off-by: Trond Myklebust

    Trond Myklebust
     
  • Unless we're doing O_APPEND writes, we really don't care about revalidating
    the file length. Just make sure that we catch any page cache invalidations.

    Signed-off-by: Trond Myklebust

    Trond Myklebust
     
  • Instead of looking at whether or not the file is open for writes before
    we accept to update the length using the server value, we should rather
    be looking at whether or not we are currently caching any writes.

    Failure to do so means in particular that we're not updating the file
    length correctly after obtaining a POSIX or BSD lock.

    Signed-off-by: Trond Myklebust

    Trond Myklebust
     
  • Signed-off-by: Trond Myklebust

    Trond Myklebust
     
  • NFSv3 currently returns the unsigned 64-bit cookie directly to
    userspace. The following patch causes the kernel to generate
    loff_t offsets for the benefit of userland.
    The current server-generated READDIR cookie is cached in the
    nfs_open_context instead of in filp->f_pos, so we still end up work
    correctly under directory insertions/deletion.

    Signed-off-by: Olivier Galibert
    Signed-off-by: Trond Myklebust

    Olivier Galibert
     
  • Attach acls to inodes in the icache to avoid unnecessary GETACL RPC
    round-trips. As long as the client doesn't retrieve any acls itself, only the
    default acls of exiting directories and the default and access acls of new
    directories will end up in the cache, which preserves some memory compared to
    always caching the access and default acl of all files.

    Signed-off-by: Andreas Gruenbacher
    Acked-by: Olaf Kirch
    Signed-off-by: Andrew Morton
    Signed-off-by: Trond Myklebust

    Andreas Gruenbacher