10 Mar, 2006

1 commit

  • Fix some bugs in mtd/jffs2 on 64bit platform.

    The MEMGETBADBLOCK/MEMSETBADBLOCK ioctl are not listed in compat_ioctl.h.

    And some variables in jffs2 are declared as uint32_t but used to hold
    size_t values.

    Signed-off-by: Atsushi Nemoto
    Cc: Thomas Gleixner
    Acked-by: David Woodhouse
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Atsushi Nemoto
     

09 Mar, 2006

4 commits

  • I have benchmarked this on an x86_64 NUMA system and see no significant
    performance difference on kernbench. Tested on both x86_64 and powerpc.

    The way we do file struct accounting is not very suitable for batched
    freeing. For scalability reasons, file accounting was
    constructor/destructor based. This meant that nr_files was decremented
    only when the object was removed from the slab cache. This is susceptible
    to slab fragmentation. With RCU based file structure, consequent batched
    freeing and a test program like Serge's, we just speed this up and end up
    with a very fragmented slab -

    llm22:~ # cat /proc/sys/fs/file-nr
    587730 0 758844

    At the same time, I see only a 2000+ objects in filp cache. The following
    patch I fixes this problem.

    This patch changes the file counting by removing the filp_count_lock.
    Instead we use a separate percpu counter, nr_files, for now and all
    accesses to it are through get_nr_files() api. In the sysctl handler for
    nr_files, we populate files_stat.nr_files before returning to user.

    Counting files as an when they are created and destroyed (as opposed to
    inside slab) allows us to correctly count open files with RCU.

    Signed-off-by: Dipankar Sarma
    Cc: "Paul E. McKenney"
    Cc: "David S. Miller"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dipankar Sarma
     
  • This patch adds new tunables for RCU queue and finished batches. There are
    two types of controls - number of completed RCU updates invoked in a batch
    (blimit) and monitoring for high rate of incoming RCUs on a cpu (qhimark,
    qlowmark).

    By default, the per-cpu batch limit is set to a small value. If the input
    RCU rate exceeds the high watermark, we do two things - force quiescent
    state on all cpus and set the batch limit of the CPU to INTMAX. Setting
    batch limit to INTMAX forces all finished RCUs to be processed in one shot.
    If we have more than INTMAX RCUs queued up, then we have bigger problems
    anyway. Once the incoming queued RCUs fall below the low watermark, the
    batch limit is set to the default.

    Signed-off-by: Dipankar Sarma
    Cc: "Paul E. McKenney"
    Cc: "David S. Miller"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dipankar Sarma
     
  • Implement percpu_counter_sum(). This is a more accurate but slower version of
    percpu_counter_read_positive().

    We need this for Alex's speedup-ext3_statfs patch and for the nr_file
    accounting fix. Otherwise these things would be too inaccurate on large CPU
    counts.

    Cc: Ravikiran G Thirumalai
    Cc: Alex Tomas
    Cc: "David S. Miller"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • They aren't used (nor even really usable) outside of pipe.c anyway

    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

07 Mar, 2006

4 commits

  • Systems with extemely large numbers of nodes or cpus need to kmalloc
    structures larger than is currently supported. This patch increases the
    maximum supported size for very large systems.

    This patch should have no effect on current systems.

    (akpm: why not just use alloc_pages() for sysfs_cpus?)

    Signed-off-by: Jack Steiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jack Steiner
     
  • include/linux/memory_hotplug.h:53: warning: 'struct page' declared inside parameter list

    (akpm: I tossed in a couple more possibly-needed-sometime struct decls too)

    Signed-off-by: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • Also from Thomas Gleixner

    Function next_timer_interrupt() got broken with a recent patch
    6ba1b91213e81aa92b5cf7539f7d2a94ff54947c as sys_nanosleep() was moved to
    hrtimer. This broke things as next_timer_interrupt() did not check hrtimer
    tree for next event.

    Function next_timer_interrupt() is needed with dyntick (CONFIG_NO_IDLE_HZ,
    VST) implementations, as the system can be in idle when next hrtimer event
    was supposed to happen. At least ARM and S390 currently use
    next_timer_interrupt().

    Signed-off-by: Thomas Gleixner
    Cc: Martin Schwidefsky
    Cc: Heiko Carstens
    Cc: Russell King
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tony Lindgren
     
  • Add new PCI IDs for HFC-S PCI based ISDN TA 'Primux II S0' and 'Primux II S0'
    from Gerdes AG

    Signed-off-by: Martin Bachem
    Signed-off-by: Karsten Keil
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Karsten Keil
     

03 Mar, 2006

1 commit

  • The bitmaps associated with generation numbers for directory entries
    are declared as an array of ints. On some platforms, this causes alignment
    exceptions.

    The following patch uses the standard bitmap declaration macros to
    declare the bitmaps, fixing the problem.

    Originally from Takashi Iwai.

    Signed-off-by: Takashi Iwai
    Acked-by: Jeff Mahoney
    Signed-off-by: Linus Torvalds

    Jeff Mahoney
     

01 Mar, 2006

3 commits


28 Feb, 2006

1 commit

  • The nfnetlink_log infrastructure changes broke compatiblity of the LOG
    targets. They currently use whatever log backend was registered first,
    which means that if ipt_ULOG was loaded first, no messages will be printed
    to the ring buffer anymore.

    Restore compatiblity by using the old log functions by default and only use
    the nf_log backend if the user explicitly said so.

    Signed-off-by: Patrick McHardy
    Signed-off-by: David S. Miller

    Patrick McHardy
     

25 Feb, 2006

2 commits

  • Linus Torvalds
     
  • I'm currently at the POSIX meeting and one thing covered was the
    incompatibility of Linux's link() with the POSIX definition. The name.
    Linux does not follow symlinks, POSIX requires it does.

    Even if somebody thinks this is a good default behavior we cannot change this
    because it would break the ABI. But the fact remains that some application
    might want this behavior.

    We have one chance to help implementing this without breaking the behavior.
    For this we could use the new linkat interface which would need a new
    flags parameter. If the new parameter is AT_SYMLINK_FOLLOW the new
    behavior could be invoked.

    I do not want to introduce such a patch now. But we could add the
    parameter now, just don't use it. The patch below would do this. Can we
    get this late patch applied before the release more or less fixes the
    syscall API?

    Signed-off-by: Ulrich Drepper
    Signed-off-by: Ralf Baechle
    Cc: Heiko Carstens
    Cc: Martin Schwidefsky
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ulrich Drepper
     

23 Feb, 2006

3 commits


22 Feb, 2006

2 commits


21 Feb, 2006

5 commits

  • The compat syscalls are added to sys_ni.c since they are not defined if the
    above CONFIG options are off. Also, nfs would not build with CONFIG_SYSCTL
    off.

    Noticed by Arthur Othieno.

    Signed-off-by: Stephen Rothwell
    Cc: "David S. Miller"
    Cc: Trond Myklebust
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Stephen Rothwell
     
  • Signed-off-by: Luke Yang
    Acked-by: David Howells
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Luke Yang
     
  • Currently, acpi video options can only be set on kernel command line. That's
    little inflexible; I'd like userland s2ram application that just works, and
    modifying kernel command line according to whitelist is not fun. It is better
    to just allow s2ram application to set video options just before suspend
    (according to the whitelist).

    This implements sysctl to allow setting suspend video options without reboot.

    (akpm: Documentation updates for this new sysctl are pending..)

    Signed-off-by: Pavel Machek
    Cc: "Brown, Len"
    Cc: "Antonino A. Daplas"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pavel Machek
     
  • Some allocations are restricted to a limited set of nodes (due to memory
    policies or cpuset constraints). If the page allocator is not able to find
    enough memory then that does not mean that overall system memory is low.

    In particular going postal and more or less randomly shooting at processes
    is not likely going to help the situation but may just lead to suicide (the
    whole system coming down).

    It is better to signal to the process that no memory exists given the
    constraints that the process (or the configuration of the process) has
    placed on the allocation behavior. The process may be killed but then the
    sysadmin or developer can investigate the situation. The solution is
    similar to what we do when running out of hugepages.

    This patch adds a check before we kill processes. At that point
    performance considerations do not matter much so we just scan the zonelist
    and reconstruct a list of nodes. If the list of nodes does not contain all
    online nodes then this is a constrained allocation and we should kill the
    current process.

    Signed-off-by: Christoph Lameter
    Cc: Nick Piggin
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • This patch makes ata_for_each_sg() start with pad_sgent when
    qc->n_elem is zero. Previously, ata_for_each_sg() unconditionally
    started with qc->__sg, handling the first sg to fill_sg() routines
    even when the entry was invalid. And while at it, unwind ?: in
    ata_qc_next_sg() into if statement.

    Signed-off-by: Tejun Heo
    Signed-off-by: Jeff Garzik

    Tejun Heo
     

18 Feb, 2006

2 commits

  • This provides an interface for arch code to find out how many
    nanoseconds are going to be added on to xtime by the next call to
    do_timer. The value returned is a fixed-point number in 52.12 format
    in nanoseconds. The reason for this format is that it gives the
    full precision that the timekeeping code is using internally.

    The motivation for this is to fix a problem that has arisen on 32-bit
    powerpc in that the value returned by do_gettimeofday drifts apart
    from xtime if NTP is being used. PowerPC is now using a lockless
    do_gettimeofday based on reading the timebase register and performing
    some simple arithmetic. (This method of getting the time is also
    exported to userspace via the VDSO.) However, the factor and offset
    it uses were calculated based on the nominal tick length and weren't
    being adjusted when NTP varied the tick length.

    Note that 64-bit powerpc has had the lockless do_gettimeofday for a
    long time now. It also had an extremely hairy routine that got called
    from the 32-bit compat routine for adjtimex, which adjusted the
    factor and offset according to what it thought the timekeeping code
    was going to do. Not only was this only called if a 32-bit task did
    adjtimex (i.e. not if a 64-bit task did adjtimex), it was also
    duplicating computations from kernel/timer.c and it wasn't clear that
    it was (still) correct.

    The simple solution is to ask the timekeeping code how long the
    current jiffy will be on each timer interrupt, after calling
    do_timer. If this jiffy will be a different length from the last one,
    we then need to compute new values for the factor and offset used in
    the lockless do_gettimeofday. In this way we can keep xtime and
    do_gettimeofday in sync, even when NTP is varying the tick length.

    Note that when adjtimex varies the tick length, it almost always
    introduces the variation from the next tick on. The only case I could
    see where adjtimex would vary the length of the current tick is when
    an old-style adjtime adjustment is being cancelled. (It's not clear
    to me why the adjustment has to be cancelled immediately rather than
    from the next tick on.) Thus I don't see any real need for a hook in
    adjtimex; the rare case of an old-style adjustment being cancelled can
    be fixed up at the next tick.

    Signed-off-by: Paul Mackerras
    Acked-by: john stultz
    Signed-off-by: Linus Torvalds

    Paul Mackerras
     
  • AMD SimNow!'s JIT doesn't like them at all in the guest. For distribution
    installation it's easiest if it's a boot time option.

    Also I moved the variable to a more appropiate place and make
    it independent from sysctl

    And marked __read_mostly which it is.

    Signed-off-by: Andi Kleen
    Signed-off-by: Linus Torvalds

    Andi Kleen
     

16 Feb, 2006

5 commits


15 Feb, 2006

4 commits

  • To find out if a packet needs to be handled by IPsec after SNAT, packets
    are currently rerouted in POST_ROUTING and a new xfrm lookup is done. This
    breaks SNAT of non-unicast packets to non-local addresses because the
    packet is routed as incoming packet and no neighbour entry is bound to the
    dst_entry. In general, it seems to be a bad idea to replace the dst_entry
    after the packet was already sent to the output routine because its state
    might not match what's expected.

    This patch changes the xfrm lookup in POST_ROUTING to re-use the original
    dst_entry without routing the packet again. This means no policy routing
    can be used for transport mode transforms (which keep the original route)
    when packets are SNATed to match the policy, but it looks like the best
    we can do for now.

    Signed-off-by: Patrick McHardy
    Signed-off-by: Herbert Xu
    Signed-off-by: David S. Miller

    Patrick McHardy
     
  • Revert commit d7102e95b7b9c00277562c29aad421d2d521c5f6:

    [PATCH] sched: filter affine wakeups

    Apparently caused more than 10% performance regression for aim7 benchmark.
    The setup in use is 16-cpu HP rx8620, 64Gb of memory and 12 MSA1000s with 144
    disks. Each disk is 72Gb with a single ext3 filesystem (courtesy of HP, who
    supplied benchmark results).

    The problem is, for aim7, the wake-up pattern is random, but it still needs
    load balancing action in the wake-up path to achieve best performance. With
    the above commit, lack of load balancing hurts that workload.

    However, for workloads like database transaction processing, the requirement
    is exactly opposite. In the wake up path, best performance is achieved with
    absolutely zero load balancing. We simply wake up the process on the CPU that
    it was previously run. Worst performance is obtained when we do load
    balancing at wake up.

    There isn't an easy way to auto detect the workload characteristics. Ingo's
    earlier patch that detects idle CPU and decide whether to load balance or not
    doesn't perform with aim7 either since all CPUs are busy (it causes even
    bigger perf. regression).

    Revert commit d7102e95b7b9c00277562c29aad421d2d521c5f6, which causes more
    than 10% performance regression with aim7.

    Signed-off-by: Ken Chen
    Acked-by: Ingo Molnar
    Cc: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Chen, Kenneth W
     
  • If 2 threads attached to the same process are blocking on different locks on
    different files (maybe even on different servers) but have the same lock
    arguments (i.e. same offset+length - actually quite common, since most
    processes try to lock the entire file) then the first GRANTED call that wakes
    one up will also wake the other.

    Currently when the NLM_GRANTED callback comes in, lockd walks the list of
    blocked locks in search of a match to the lock that the NLM server has
    granted. Although it checks the lock pid, start and end, it fails to check
    the filehandle and the server address.

    By checking the filehandle and server IP address, we ensure that this only
    happens if the locks truly are referencing the same file.

    Signed-off-by: Trond Myklebust
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Trond Myklebust
     
  • This patch reverts commit f93ea411b73594f7d144855fd34278bcf34a9afc:
    [PATCH] jbd: split checkpoint lists

    This broke journal_flush() for OCFS2, which is its method of being sure
    that metadata is sent to disk for another node.

    And two related commits 8d3c7fce2d20ecc3264c8d8c91ae3beacdeaed1b and
    43c3e6f5abdf6acac9b90c86bf03f995bf7d3d92 with the subjects:
    [PATCH] jbd: log_do_checkpoint fix
    [PATCH] jbd: remove_transaction fix

    These seem to be incremental bugfixes on the original patch and as such are
    no longer needed.

    Signed-off-by: Mark Fasheh
    Cc: Jan Kara
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mark Fasheh
     

12 Feb, 2006

3 commits

  • Add support for Geforce4 MX 4000 (0x185)

    Signed-off-by: Antonino Daplas
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Antonino A. Daplas
     
  • With David Woodhouse

    select() presently has a habit of increasing the value of the user's
    `timeout' argument on return.

    We were writing back a timeout larger than the original. We _deliberately_
    round up, since we know we must wait at _least_ as long as the caller asks
    us to.

    The patch adds a couple of helper functions for magnitude comparison of
    timespecs and of timevals, and uses them to prevent the various poll and
    select functions from returning a timeout which is larger than the one which
    was passed in.

    The patch also fixes a bug in compat_sys_pselect7(): it was adding the new
    timeout value to the old one and was returning that. It should just return
    the new timeout value.

    (We have various handy timespec/timeval-to-from-nsec conversion functions in
    time.h. But this code open-codes it all).

    Cc: "David S. Miller"
    Cc: Andi Kleen
    Cc: Ulrich Drepper
    Cc: Thomas Gleixner
    Cc: george anzinger
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • The *at patches introduced fstatat and, due to inusfficient research, I
    used the newfstat functions generally as the guideline. The result is that
    on 32-bit platforms we don't have all the information needed to implement
    fstatat64.

    This patch modifies the code to pass up 64-bit information if
    __ARCH_WANT_STAT64 is defined. I renamed the syscall entry point to make
    this clear. Other archs will continue to use the existing code. On x86-64
    the compat code is implemented using a new sys32_ function. this is what
    is done for the other stat syscalls as well.

    This patch might break some other archs (those which define
    __ARCH_WANT_STAT64 and which already wired up the syscall). Yet others
    might need changes to accomodate the compatibility mode. I really don't
    want to do that work because all this stat handling is a mess (more so in
    glibc, but the kernel is also affected). It should be done by the arch
    maintainers. I'll provide some stand-alone test shortly. Those who are
    eager could compile glibc and run 'make check' (no installation needed).

    The patch below has been tested on x86 and x86-64.

    Signed-off-by: Ulrich Drepper
    Cc: Christoph Hellwig
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ulrich Drepper