02 Oct, 2006

34 commits

  • Documentation/kprobes.txt updated to reflect:

    o In-kernel symbol resolution
    o CONFIG_KALLSYMS dependency
    o Usage of JPROBE_ENTRY
    o Addition of regs_return_value()

    Also update the references list and usage examples to use correct module
    interfaces.

    Signed-off-by: Ananth N Mavinakayanahalli
    Acked-by: Jim Keniston
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ananth N Mavinakayanahalli
     
  • Add the regs_return_value() macro to extract the return value in an
    architecture agnostic manner, given the pt_regs.

    Other architecture maintainers may want to add similar helpers.

    Signed-off-by: Ananth N Mavinakayanahalli
    Signed-off-by: Anil S Keshavamurthy
    Signed-off-by: Martin Schwidefsky
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ananth N Mavinakayanahalli
     
  • kallsyms_lookup_name() allows for style specification for
    looking up symbol addresses. Handle the case where the user specifies
    on powerpc, given that 64-bit powerpc uses function
    descriptors.

    Signed-off-by: Anil S Keshavamurthy
    Signed-off-by: Ananth N Mavinakayanahalli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ananth N Mavinakayanahalli
     
  • In an effort to make kprobe modules more portable, here is a patch that:

    o Introduces the "symbol_name" field to struct kprobe.
    The symbol->address resolution now happens in the kernel in an
    architecture agnostic manner. 64-bit powerpc users no longer have
    to specify the ".symbols"
    o Introduces the "offset" field to struct kprobe to allow a user to
    specify an offset into a symbol.
    o The legacy mechanism of specifying the kprobe.addr is still supported.
    However, if both the kprobe.addr and kprobe.symbol_name are specified,
    probe registration fails with an -EINVAL.
    o The symbol resolution code uses kallsyms_lookup_name(). So
    CONFIG_KPROBES now depends on CONFIG_KALLSYMS
    o Apparantly kprobe modules were the only legitimate out-of-tree user of
    the kallsyms_lookup_name() EXPORT. Now that the symbol resolution
    happens in-kernel, remove the EXPORT as suggested by Christoph Hellwig
    o Modify tcp_probe.c that uses the kprobe interface so as to make it
    work on multiple platforms (in its earlier form, the code wouldn't
    work, say, on powerpc)

    Signed-off-by: Ananth N Mavinakayanahalli
    Signed-off-by: Prasanna S Panchamukhi
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ananth N Mavinakayanahalli
     
  • Replaces the pid_t value with a struct pid to avoid pid wrap around
    problems.

    Signed-off-by: Cedric Le Goater
    Cc: Martin Schwidefsky
    Acked-by: Eric W. Biederman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cedric Le Goater
     
  • The problem with remembering a user space process by its pid is that it is
    possible that the process will exit, pid wrap around will occur.
    Converting to a struct pid avoid that problem, and paves the way for
    implementing a pid namespace.

    Also since usb is the only user of kill_proc_info_as_uid rename
    kill_proc_info_as_uid to kill_pid_info_as_uid and have the new version take
    a struct pid.

    Signed-off-by: Eric W. Biederman
    Acked-by: Greg Kroah-Hartman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric W. Biederman
     
  • This has been needed for a long time, but now with the advent of a
    reference counted struct pid there are real consequences for getting this
    wrong.

    Someone I think it was Oleg Nesterov pointed out that this construct was
    missing locking, when I introduced struct pid. After taking time to review
    the locking construct already present I figured out which lock needs to be
    taken. The other paths that access f_owner.pid take either the f_owner
    read or the write lock.

    Signed-off-by: Eric W. Biederman
    Cc: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric W. Biederman
     
  • Message queues can signal a process waiting for a message.

    This patch replaces the pid_t value with a struct pid to avoid pid wrap
    around problems.

    Signed-off-by: Cedric Le Goater
    Acked-by: Eric Biederman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cedric Le Goater
     
  • This updates my proc: readdir race fix (take 3) patch
    to account for the changes made by: Sukadev Bhattiprolu
    to introduce struct pspace.

    Signed-off-by: Eric W. Biederman
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric W. Biederman
     
  • Define a per-container pid space object. And create one instance of this
    object, init_pspace, to define the entire pid space. Subsequent patches
    will provide/use interfaces to create/destroy pid spaces.

    Its a subset/rework of Eric Biederman's patch
    http://lkml.org/lkml/2006/2/6/285 .

    Signed-off-by: Eric Biederman
    Signed-off-by: Sukadev Bhattiprolu
    Cc: Dave Hansen
    Cc: Serge Hallyn
    Cc: Cedric Le Goater
    Cc: Kirill Korotaev
    Cc: Andrey Savochkin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sukadev Bhattiprolu
     
  • Move struct pidmap and PIDMAP_ENTRIES to a new file, include/linux/pspace.h
    where it will be used in subsequent patches to define pid spaces.

    Its a subset of Eric Biederman's patch http://lkml.org/lkml/2006/2/6/285

    [akpm@osdl.org: cleanups]
    Signed-off-by: Eric W. Biederman
    Signed-off-by: Sukadev Bhattiprolu
    Cc: Dave Hansen
    Cc: Serge Hallyn
    Cc: Cedric Le Goater
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sukadev Bhattiprolu
     
  • I think it is hardly possible to read the current do_each_task_pid(). The
    new version is much simpler and makes the code smaller.

    Only the do_each_task_pid change is tested, the do_each_pid_task isn't.

    Signed-off-by: Oleg Nesterov
    Cc: "Eric W. Biederman"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Use struct pidmap instead of pidmap_t.

    This updates my proc: readdir race fix (take 3) patch
    to account for the changes made by: Sukadev Bhattiprolu
    to kill pidmap_t.

    Signed-off-by: Eric W. Biederman
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric W. Biederman
     
  • Use struct pidmap instead of pidmap_t.

    Its a subset of Eric Biederman's patch http://lkml.org/lkml/2006/2/6/271.

    Signed-off-by: Eric W. Biederman
    Signed-off-by: Sukadev Bhattiprolu
    Cc: Dave Hansen
    Cc: Serge Hallyn
    Cc: Cedric Le Goater
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sukadev Bhattiprolu
     
  • As part of an SMP cleanliness pass over UML, I consted a bunch of
    structures in order to not have to document their locking. One of these
    structures was a struct tty_operations. In order to const it in UML
    without introducing compiler complaints, the declaration of
    tty_set_operations needs to be changed, and then all of its callers need to
    be fixed.

    This patch declares all struct tty_operations in the tree as const. In all
    cases, they are static and used only as input to tty_set_operations. As an
    extra check, I ran an i386 allyesconfig build which produced no extra
    warnings.

    53 drivers are affected. I checked the history of a bunch of them, and in
    most cases, there have been only a handful of maintenance changes in the
    last six months. serial_core.c was the busiest one that I looked at.

    Signed-off-by: Jeff Dike
    Acked-by: Alan Cox
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jeff Dike
     
  • Only touch inode's i_mtime and i_ctime to make them equal to "now" in case
    they aren't yet (don't just update timestamp unconditionally). Uninline
    the hash function to save 259 Bytes.

    This tiny inode change which may improve cache behaviour also shaves off 8
    Bytes from file_update_time() on i386.

    Included a tiny codestyle cleanup, too.

    Signed-off-by: Andreas Mohr
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andreas Mohr
     
  • Everybody passes valid pointer there.

    Signed-off-by: Alexey Dobriyan
    Acked-by: OGAWA Hirofumi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     
  • File handles can be requested to send sigio and sigurg to processes. By
    tracking the destination processes using struct pid instead of pid_t we make
    the interface safe from all potential pid wrap around problems.

    Signed-off-by: Eric W. Biederman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric W. Biederman
     
  • I took a good hard look at the locking and it appears the locking on vt_pid
    is the console semaphore. Every modified path is called under the console
    semaphore except reset_vc when it is called from fn_SAK or do_SAK both of
    which appear to be in interrupt context. In addition I need to be careful
    because in the presence of an oops the console_sem may be arbitrarily
    dropped.

    Which leads me to conclude the current locking is inadequate for my needs.

    Given the weird cases we could hit because of oops printing instead of
    introducing an extra spin lock to protect the data and keep the pid to
    signal and the signal to send in sync, I have opted to use xchg on just the
    struct pid * pointer instead.

    Due to console_sem we will stay in sync between vt_pid and vt_mode except
    for a small window during a SAK, or oops handling. SAK handling should
    kill any user space process that care, and oops handling we are broken
    anyway. Besides the worst that can happen is that I try to send the wrong
    signal.

    Signed-off-by: Eric W. Biederman
    Cc: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric W. Biederman
     
  • This is such a rare path it took me a while to figure out how to test
    this after soring out the locking.

    This patch does several things.
    - The variables used are moved into a structure and declared in vt_kern.h
    - A spinlock is added so we don't have SMP races updating the values.
    - Instead of raw pid_t value a struct_pid is used to guard against
    pid wrap around issues, if the daemon to spawn a new console dies.

    Signed-off-by: Eric W. Biederman
    Cc: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric W. Biederman
     
  • As we stop storing pid_t's and move to storing struct pid *. We need a way to
    get the pid_t from the struct pid to report to user space what we have stored.

    Having a clean well defined way to do this is especially important as we move
    to multiple pid spaces as may need to report a different value to the caller
    depending on which pid space the caller is in.

    Signed-off-by: Eric W. Biederman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric W. Biederman
     
  • pids aren't something that drivers should care about. However there are a lot
    of helper layers in the kernel that do care, and are built as modules. Before
    I can convert them to using struct pid instead of pid_t I need to export the
    appropriate symbols so they can continue to be built.

    Signed-off-by: Eric W. Biederman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric W. Biederman
     
  • Currently the signal functions all either take a task or a pid_t argument.
    This patch implements variants that take a struct pid *. After all of the
    users have been update it is my intention to remove the variants that take a
    pid_t as using pid_t can be more work (an extra hash table lookup) and
    difficult to get right in the presence of multiple pid namespaces.

    There are two kinds of functions introduced in this patch. The are the
    general use functions kill_pgrp and kill_pid which take a priv argument that
    is ultimately used to create the appropriate siginfo information, Then there
    are _kill_pgrp_info, kill_pgrp_info, kill_pid_info the internal implementation
    helpers that take an explicit siginfo.

    The distinction is made because filling out an explcit siginfo is tricky, and
    will be even more tricky when pid namespaces are introduced.

    Signed-off-by: Eric W. Biederman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric W. Biederman
     
  • To avoid pid rollover confusion the kernel needs to work with struct pid *
    instead of pid_t. Currently there is not an iterator that walks through all
    of the tasks of a given pid type starting with a struct pid. This prevents us
    replacing some pid_t instances with struct pid. So this patch adds
    do_each_pid_task which walks through the set of task for a given pid type
    starting with a struct pid.

    Signed-off-by: Eric W. Biederman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric W. Biederman
     
  • In the last round of cleaning up the pid hash table a more general struct pid
    was introduced, that can be referenced counted.

    With the more general struct pid most if not all places where we store a pid_t
    we can now store a struct pid * and remove the need for a hash table lookup,
    and avoid any possible problems with pid roll over.

    Looking forward to the pid namespaces struct pid * gives us an absolute form a
    pid so we can compare and use them without caring which pid namespace we are
    in.

    This patchset introduces the infrastructure needed to use struct pid instead
    of pid_t, and then it goes on to convert two different kernel users that
    currently store a pid_t value.

    There are a lot more places to go but this is enough to get the basic idea.

    Before we can merge a pid namespace patch all of the kernel pid_t users need
    to be examined. Those that deal with user space processes need to be
    converted to using a struct pid *. Those that deal with kernel processes need
    to converted to using the kthread api. A rare few that only use their current
    processes pid values get to be left alone.

    This patch:

    task_session returns the struct pid of a tasks session.
    task_pgrp returns the struct pid of a tasks process group.
    task_tgid returns the struct pid of a tasks thread group.
    task_pid returns the struct pid of a tasks process id.

    These can be used to avoid unnecessary hash table lookups, and to implement
    safe pid comparisions in the face of a pid namespace.

    Signed-off-by: Eric W. Biederman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric W. Biederman
     
  • Helper functions in base.c like proc_pident_readdir and proc_pident_lookup
    assume the directories have an associated task, and cannot currently be used
    on the /proc root directory because it does not have such a task.

    This small changes allows for base.c to be simplified and later when multiple
    pid spaces are introduced it makes getting the needed context information
    trivial.

    Signed-off-by: Eric W. Biederman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric W. Biederman
     
  • Currently proc_pident_lookup gets the names and types from a table and then
    has a huge switch statement to get the inode and file operations it needs.
    That is silly and is becoming increasingly hard to maintain so I just put all
    of the information in the table.

    Signed-off-by: Eric W. Biederman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric W. Biederman
     
  • There were enough changes in my last round of cleaning up proc I had to break
    up the patch series into smaller chunks, and my last chunk never got resent.

    This patchset gives proc dynamic inode numbers (the static inode numbers were
    a pain to maintain and prevent all kinds of things), and removes the horrible
    switch statements that had to be kept in sync with everything else. Being
    fully table driver takes us 90% of the way of being able to register new
    process specific attributes in proc.

    This patch:

    Group the functions by what they implement instead of by type of operation.
    As it existed base.c was quickly approaching the point where it could not be
    followed.

    No functionality or code changes asside from adding/removing forward
    declartions are implemented in this patch.

    Signed-off-by: Eric W. Biederman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric W. Biederman
     
  • The problem: An opendir, readdir, closedir sequence can fail to report
    process ids that are continually in use throughout the sequence of system
    calls. For this race to trigger the process that proc_pid_readdir stops at
    must exit before readdir is called again.

    This can cause ps to fail to report processes, and it is in violation of
    posix guarantees and normal application expectations with respect to
    readdir.

    Currently there is no way to work around this problem in user space short
    of providing a gargantuan buffer to user space so the directory read all
    happens in on system call.

    This patch implements the normal directory semantics for proc, that
    guarantee that a directory entry that is neither created nor destroyed
    while reading the directory entry will be returned. For directory that are
    either created or destroyed during the readdir you may or may not see them.
    Furthermore you may seek to a directory offset you have previously seen.

    These are the guarantee that ext[23] provides and that posix requires, and
    more importantly that user space expects. Plus it is a simple semantic to
    implement reliable service. It is just a matter of calling readdir a
    second time if you are wondering if something new has show up.

    These better semantics are implemented by scanning through the pids in
    numerical order and by making the file offset a pid plus a fixed offset.

    The pid scan happens on the pid bitmap, which when you look at it is
    remarkably efficient for a brute force algorithm. Given that a typical
    cache line is 64 bytes and thus covers space for 64*8 == 200 pids. There
    are only 40 cache lines for the entire 32K pid space. A typical system
    will have 100 pids or more so this is actually fewer cache lines we have to
    look at to scan a linked list, and the worst case of having to scan the
    entire pid bitmap is pretty reasonable.

    If we need something more efficient we can go to a more efficient data
    structure for indexing the pids, but for now what we have should be
    sufficient.

    In addition this takes no additional locks and is actually less code than
    what we are doing now.

    Also another very subtle bug in this area has been fixed. It is possible
    to catch a task in the middle of de_thread where a thread is assuming the
    thread of it's thread group leader. This patch carefully handles that case
    so if we hit it we don't fail to return the pid, that is undergoing the
    de_thread dance.

    Thanks to KAMEZAWA Hiroyuki for
    providing the first fix, pointing this out and working on it.

    [oleg@tv-sign.ru: fix it]
    Signed-off-by: Eric W. Biederman
    Acked-by: KAMEZAWA Hiroyuki
    Signed-off-by: Oleg Nesterov
    Cc: Jean Delvare
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric W. Biederman
     
  • When listing loaded modules during an oops or panic, also list each
    module's Tainted flags if non-zero (P: Proprietary or F: Forced load only).

    If a module is did not taint the kernel, it is just listed like
    usbcore
    but if it did taint the kernel, it is listed like
    wizmodem(PF)

    Example:
    [ 3260.121718] Unable to handle kernel NULL pointer dereference at 0000000000000000 RIP:
    [ 3260.121729] [] :dump_test:proc_dump_test+0x99/0xc8
    [ 3260.121742] PGD fe8d067 PUD 264a6067 PMD 0
    [ 3260.121748] Oops: 0002 [1] SMP
    [ 3260.121753] CPU 1
    [ 3260.121756] Modules linked in: dump_test(P) snd_pcm_oss snd_mixer_oss snd_seq snd_seq_device ide_cd generic ohci1394 snd_hda_intel snd_hda_codec snd_pcm snd_timer snd ieee1394 snd_page_alloc piix ide_core arcmsr aic79xx scsi_transport_spi usblp
    [ 3260.121785] Pid: 5556, comm: bash Tainted: P 2.6.18-git10 #1

    [Alternatively, I can look into listing tainted flags with 'lsmod',
    but that won't help in oopsen/panics so much.]

    [akpm@osdl.org: cleanup]
    Signed-off-by: Randy Dunlap
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Randy Dunlap
     
  • The exported kernel interfaces of genpool allocator need to adhere to
    the requirements of kernel-doc.

    Signed-off-by: Dean Nelson
    Cc: Steve Wise
    Acked-by: Randy Dunlap
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dean Nelson
     
  • Modules using the genpool allocator need to be able to destroy the data
    structure when unloading.

    Signed-off-by: Steve Wise
    Cc: Randy Dunlap
    Cc: Dean Nelson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Steve Wise
     
  • The test for the error from pcmcia_replace_cis() was incorrect, and
    would always trigger (because if an error didn't happen, the "ret" value
    would not be zero, it would be the passed-in count).

    Reported and debugged by Fabrice Bellet

    Rather than just fix the single broken test, make the code in question
    use an understandable code-sequence instead, fixing the whole function
    to be more readable.

    Signed-off-by: Linus Torvalds

    Linus Torvalds
     
  • It's not clear how this thinko got through..

    Cc: Olaf Hering
    Cc: David Brownell
    Cc: Alessandro Zummo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     

01 Oct, 2006

6 commits

  • * master.kernel.org:/pub/scm/linux/kernel/git/davej/agpgart:
    [AGPGART] printk fixups.
    [AGPGART] Use pci_get_slot not pci_find_slot

    Linus Torvalds
     
  • * master.kernel.org:/pub/scm/linux/kernel/git/davej/cpufreq:
    [CPUFREQ] Make acpi-cpufreq unsticky again.
    [CPUFREQ] longhaul: remove duplicated code.
    [CPUFREQ] Longhaul - Disable arbiter CLE266
    [CPUFREQ] Fix section mismatch warning
    [CPUFREQ] Fix cut-n-paste bug in suspend printk

    Linus Torvalds
     
  • During tracking down a PAE compile failure, I found that config.h was being
    included in a bunch of places in i386 code. It is no longer necessary, so
    drop it.

    Signed-off-by: Zachary Amsden
    Cc: Rusty Russell
    Cc: Jeremy Fitzhardinge
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Zachary Amsden
     
  • Add a pte_update_hook which notifies about pte changes that have been made
    without using the set_pte / clear_pte interfaces. This allows shadow mode
    hypervisors which do not trap on page table access to maintain synchronized
    shadows.

    It also turns out, there was one pte update in PAE mode that wasn't using any
    accessor interface at all for setting NX protection. Considering it is PAE
    specific, and the accessor is i386 specific, I didn't want to add a generic
    encapsulation of this behavior yet.

    Signed-off-by: Zachary Amsden
    Cc: Rusty Russell
    Cc: Jeremy Fitzhardinge
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Zachary Amsden
     
  • Now that ptep_establish has a definition in PAE i386 3-level paging code, the
    only paging model which is insane enough to have multi-word hardware PTEs
    which are not efficient to set atomically, we can remove the ghost of
    set_pte_atomic from other architectures which falesly duplicated it, and
    remove all knowledge of it from the generic pgtable code.

    set_pte_atomic is now a private pte operator which is specific to i386

    Signed-off-by: Zachary Amsden
    Cc: Rusty Russell
    Cc: Jeremy Fitzhardinge
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Zachary Amsden
     
  • The ptep_establish macro is only used on user-level PTEs, for P->P mapping
    changes. Since these always happen under protection of the pagetable lock,
    the strong synchronization of a 64-bit cmpxchg is not needed, in fact, not
    even a lock prefix needs to be used. We can simply instead clear the P-bit,
    followed by a normal set. The write ordering is still important to avoid the
    possibility of the TLB snooping a partially written PTE and getting a bad
    mapping installed.

    Signed-off-by: Zachary Amsden
    Cc: Rusty Russell
    Cc: Jeremy Fitzhardinge
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Zachary Amsden