20 Jul, 2007

40 commits

  • Signed-off-by: Josef 'Jeff' Sipek
    Cc: Al Viro
    Acked-by: Christoph Hellwig
    Cc: Trond Myklebust
    Cc: Neil Brown
    Cc: Michael Halcrow
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Josef 'Jeff' Sipek
     
  • Signed-off-by: Josef 'Jeff' Sipek
    Cc: Al Viro
    Acked-by: Christoph Hellwig
    Cc: Trond Myklebust
    Cc: Neil Brown
    Cc: Michael Halcrow
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Josef 'Jeff' Sipek
     
  • use vfs_path_lookup instead of open-coding the necessary functionality.

    Signed-off-by: Josef 'Jeff' Sipek
    Acked-by: NeilBrown
    Cc: Al Viro
    Acked-by: Christoph Hellwig
    Cc: Trond Myklebust
    Cc: Michael Halcrow
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Josef 'Jeff' Sipek
     
  • use vfs_path_lookup instead of open-coding the necessary functionality.

    Signed-off-by: Josef 'Jeff' Sipek
    Acked-by: Trond Myklebust
    Cc: Al Viro
    Acked-by: Christoph Hellwig
    Cc: Neil Brown
    Cc: Michael Halcrow
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Josef 'Jeff' Sipek
     
  • Stackable file systems, among others, frequently need to lookup paths or
    path components starting from an arbitrary point in the namespace
    (identified by a dentry and a vfsmount). Currently, such file systems use
    lookup_one_len, which is frowned upon [1] as it does not pass the lookup
    intent along; not passing a lookup intent, for example, can trigger BUG_ON's
    when stacking on top of NFSv4.

    The first patch introduces a new lookup function to allow lookup starting
    from an arbitrary point in the namespace. This approach has been suggested
    by Christoph Hellwig [2].

    The second patch changes sunrpc to use vfs_path_lookup.

    The third patch changes nfsctl.c to use vfs_path_lookup.

    The fourth patch marks link_path_walk static.

    The fifth, and last patch, unexports path_walk because it is no longer
    unnecessary to call it directly, and using the new vfs_path_lookup is
    cleaner.

    For example, the following snippet of code, looks up "some/path/component"
    in a directory pointed to by parent_{dentry,vfsmnt}:

    err = vfs_path_lookup(parent_dentry, parent_vfsmnt,
    "some/path/component", 0, &nd);
    if (!err) {
    /* exits */

    ...

    /* once done, release the references */
    path_release(&nd);
    } else if (err == -ENOENT) {
    /* doesn't exist */
    } else {
    /* other error */
    }

    VFS functions such as lookup_create can be used on the nameidata structure
    to pass the create intent to the file system.

    Signed-off-by: Josef 'Jeff' Sipek
    Cc: Al Viro
    Acked-by: Christoph Hellwig
    Cc: Trond Myklebust
    Cc: Neil Brown
    Cc: Michael Halcrow
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Josef 'Jeff' Sipek
     
  • Remove the arg+env limit of MAX_ARG_PAGES by copying the strings directly from
    the old mm into the new mm.

    We create the new mm before the binfmt code runs, and place the new stack at
    the very top of the address space. Once the binfmt code runs and figures out
    where the stack should be, we move it downwards.

    It is a bit peculiar in that we have one task with two mm's, one of which is
    inactive.

    [a.p.zijlstra@chello.nl: limit stack size]
    Signed-off-by: Ollie Wild
    Signed-off-by: Peter Zijlstra
    Cc:
    Cc: Hugh Dickins
    [bunk@stusta.de: unexport bprm_mm_init]
    Signed-off-by: Adrian Bunk
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ollie Wild
     
  • The purpose of audit_bprm() is to log the argv array to a userspace daemon at
    the end of the execve system call. Since user-space hasn't had time to run,
    this array is still in pristine state on the process' stack; so no need to
    copy it, we can just grab it from there.

    In order to minimize the damage to audit_log_*() copy each string into a
    temporary kernel buffer first.

    Currently the audit code requires that the full argument vector fits in a
    single packet. So currently it does clip the argv size to a (sysctl) limit,
    but only when execve auditing is enabled.

    If the audit protocol gets extended to allow for multiple packets this check
    can be removed.

    Signed-off-by: Peter Zijlstra
    Signed-off-by: Ollie Wild
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Zijlstra
     
  • New arch macro STACK_TOP_MAX it gives the larges valid stack address for the
    architecture in question.

    It differs from STACK_TOP in that it will not distinguish between
    personalities but will always return the largest possible address.

    This is used to create the initial stack on execve, which we will move down to
    the proper location once the binfmt code has figured out where that is.

    Signed-off-by: Peter Zijlstra
    Signed-off-by: Ollie Wild
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Zijlstra
     
  • Currently most of the per cpu data, which is accessed by different cpus,
    has a ____cacheline_aligned_in_smp attribute. Move all this data to the
    new per cpu shared data section: .data.percpu.shared_aligned.

    This will seperate the percpu data which is referenced frequently by other
    cpus from the local only percpu data.

    Signed-off-by: Fenghua Yu
    Acked-by: Suresh Siddha
    Cc: Rusty Russell
    Cc: Christoph Lameter
    Cc: "Luck, Tony"
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Fenghua Yu
     
  • per cpu data section contains two types of data. One set which is
    exclusively accessed by the local cpu and the other set which is per cpu,
    but also shared by remote cpus. In the current kernel, these two sets are
    not clearely separated out. This can potentially cause the same data
    cacheline shared between the two sets of data, which will result in
    unnecessary bouncing of the cacheline between cpus.

    One way to fix the problem is to cacheline align the remotely accessed per
    cpu data, both at the beginning and at the end. Because of the padding at
    both ends, this will likely cause some memory wastage and also the
    interface to achieve this is not clean.

    This patch:

    Moves the remotely accessed per cpu data (which is currently marked
    as ____cacheline_aligned_in_smp) into a different section, where all the data
    elements are cacheline aligned. And as such, this differentiates the local
    only data and remotely accessed data cleanly.

    Signed-off-by: Fenghua Yu
    Acked-by: Suresh Siddha
    Cc: Rusty Russell
    Cc: Christoph Lameter
    Cc:
    Cc: "Luck, Tony"
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Fenghua Yu
     
  • I realise jprobes are a razor-blades-included type of interface, but that
    doesn't mean we can't try and make them safer to use. This guy I know once
    wrote code like this:

    struct jprobe jp = { .kp.symbol_name = "foo", .entry = "jprobe_foo" };

    And then his kernel exploded. Oops.

    This patch adds an arch hook, arch_deref_entry_point() (I don't like it
    either) which takes the void * in a struct jprobe, and gives back the text
    address that it represents.

    We can then use that in register_jprobe() to check that the entry point we're
    passed is actually in the kernel text, rather than just some random value.

    Signed-off-by: Michael Ellerman
    Cc: Prasanna S Panchamukhi
    Acked-by: Ananth N Mavinakayanahalli
    Cc: Anil S Keshavamurthy
    Cc: David S. Miller
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michael Ellerman
     
  • AFAICT now that jprobe.entry is a void *, JPROBE_ENTRY doesn't do anything
    useful - so remove it ..

    I've left a do-nothing version so that out-of-tree jprobes code will still
    compile without modifications.

    Signed-off-by: Michael Ellerman
    Cc: Prasanna S Panchamukhi
    Acked-by: Ananth N Mavinakayanahalli
    Cc: Anil S Keshavamurthy
    Cc: David S. Miller
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michael Ellerman
     
  • Currently jprobe.entry is a kprobe_opcode_t *, but that's a lie. On some
    platforms it doesn't point to an opcode at all, it points to a function
    descriptor.

    It's really a pointer to something that the arch code can turn into a function
    entry point. And that's what actually happens, none of the generic code ever
    looks at jprobe.entry, it's only ever dereferenced by arch code.

    So just make it a void *.

    Signed-off-by: Michael Ellerman
    Cc: Prasanna S Panchamukhi
    Acked-by: Ananth N Mavinakayanahalli
    Cc: Anil S Keshavamurthy
    Cc: David S. Miller
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michael Ellerman
     
  • Rename some file_ra_state variables and remove some accessors.

    It results in much simpler code.
    Kudos to Rusty!

    Signed-off-by: Fengguang Wu
    Cc: Rusty Russell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Fengguang Wu
     
  • Split ondemand readahead interface into two functions. I think this makes it
    a little clearer for non-readahead experts (like Rusty).

    Internally they both call ondemand_readahead(), but the page argument is
    changed to an obvious boolean flag.

    Signed-off-by: Rusty Russell
    Signed-off-by: Fengguang Wu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rusty Russell
     
  • Share the same page flag bit for PG_readahead and PG_reclaim.

    One is used only on file reads, another is only for emergency writes. One
    is used mostly for fresh/young pages, another is for old pages.

    Combinations of possible interactions are:

    a) clear PG_reclaim => implicit clear of PG_readahead
    it will delay an asynchronous readahead into a synchronous one
    it actually does _good_ for readahead:
    the pages will be reclaimed soon, it's readahead thrashing!
    in this case, synchronous readahead makes more sense.

    b) clear PG_readahead => implicit clear of PG_reclaim
    one(and only one) page will not be reclaimed in time
    it can be avoided by checking PageWriteback(page) in readahead first

    c) set PG_reclaim => implicit set of PG_readahead
    will confuse readahead and make it restart the size rampup process
    it's a trivial problem, and can mostly be avoided by checking
    PageWriteback(page) first in readahead

    d) set PG_readahead => implicit set of PG_reclaim
    PG_readahead will never be set on already cached pages.
    PG_reclaim will always be cleared on dirtying a page.
    so not a problem.

    In summary,
    a) we get better behavior
    b,d) possible interactions can be avoided
    c) racy condition exists that might affect readahead, but the chance
    is _really_ low, and the hurt on readahead is trivial.

    Compound pages also use PG_reclaim, but for now they do not interact with
    reclaim/readahead code.

    Signed-off-by: Fengguang Wu
    Cc: Rusty Russell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Fengguang Wu
     
  • Pass real splice size to page_cache_readahead_ondemand().

    The splice code works in chunks of 16 pages internally. The readahead code
    should be told of the overall splice size, instead of the internal chunk size.
    Otherwize bad things may happen. Imagine some 17-page random splice reads.
    The code before this patch will result in two readahead calls: readahead(16);
    readahead(1); That leads to one 16-page I/O and one 32-page I/O: one extra I/O
    and 31 readahead miss pages.

    Signed-off-by: Fengguang Wu
    Cc: Jens Axboe
    Cc: Rusty Russell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Fengguang Wu
     
  • Move synchronous page_cache_readahead_ondemand() call out of splice loop.

    This avoids one pointless page allocation/insertion in case of non-zero
    ra_pages, or many pointless readahead calls in case of zero ra_pages.

    Note that if a user sets ra_pages to less than PIPE_BUFFERS=16 pages, he will
    not get expected readahead behavior anyway. The splice code works in batches
    of 16 pages, which can be taken as another form of synchronous readahead.

    Signed-off-by: Fengguang Wu
    Cc: Jens Axboe
    Cc: Rusty Russell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Fengguang Wu
     
  • Remove the old readahead algorithm.

    Signed-off-by: Fengguang Wu
    Cc: Steven Pratt
    Cc: Ram Pai
    Cc: Rusty Russell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Fengguang Wu
     
  • Convert ext3/ext4 dir reads to use on-demand readahead.

    Readahead for dirs operates _not_ on file level, but on blockdev level. This
    makes a difference when the data blocks are not continuous. And the read
    routine is somehow opaque: there's no handy info about the status of current
    page. So a simplified call scheme is employed: to call into readahead
    whenever the current page falls out of readahead windows.

    Signed-off-by: Fengguang Wu
    Cc: Steven Pratt
    Cc: Ram Pai
    Cc: Rusty Russell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Fengguang Wu
     
  • Convert splice reads to use on-demand readahead.

    Signed-off-by: Fengguang Wu
    Cc: Steven Pratt
    Cc: Ram Pai
    Cc: Jens Axboe
    Cc: Rusty Russell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Fengguang Wu
     
  • Convert filemap reads to use on-demand readahead.

    The new call scheme is to
    - call readahead on non-cached page
    - call readahead on look-ahead page
    - update prev_index when finished with the read request

    Signed-off-by: Fengguang Wu
    Cc: Steven Pratt
    Cc: Ram Pai
    Cc: Rusty Russell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Fengguang Wu
     
  • This is a minimal readahead algorithm that aims to replace the current one.
    It is more flexible and reliable, while maintaining almost the same behavior
    and performance. Also it is full integrated with adaptive readahead.

    It is designed to be called on demand:
    - on a missing page, to do synchronous readahead
    - on a lookahead page, to do asynchronous readahead

    In this way it eliminated the awkward workarounds for cache hit/miss,
    readahead thrashing, retried read, and unaligned read. It also adopts the
    data structure introduced by adaptive readahead, parameterizes readahead
    pipelining with `lookahead_index', and reduces the current/ahead windows to
    one single window.

    HEURISTICS

    The logic deals with four cases:

    - sequential-next
    found a consistent readahead window, so push it forward

    - random
    standalone small read, so read as is

    - sequential-first
    create a new readahead window for a sequential/oversize request

    - lookahead-clueless
    hit a lookahead page not associated with the readahead window,
    so create a new readahead window and ramp it up

    In each case, three parameters are determined:

    - readahead index: where the next readahead begins
    - readahead size: how much to readahead
    - lookahead size: when to do the next readahead (for pipelining)

    BEHAVIORS

    The old behaviors are maximally preserved for trivial sequential/random reads.
    Notable changes are:

    - It no longer imposes strict sequential checks.
    It might help some interleaved cases, and clustered random reads.
    It does introduce risks of a random lookahead hit triggering an
    unexpected readahead. But in general it is more likely to do good
    than to do evil.

    - Interleaved reads are supported in a minimal way.
    Their chances of being detected and proper handled are still low.

    - Readahead thrashings are better handled.
    The current readahead leads to tiny average I/O sizes, because it
    never turn back for the thrashed pages. They have to be fault in
    by do_generic_mapping_read() one by one. Whereas the on-demand
    readahead will redo readahead for them.

    OVERHEADS

    The new code reduced the overheads of

    - excessively calling the readahead routine on small sized reads
    (the current readahead code insists on seeing all requests)

    - doing a lot of pointless page-cache lookups for small cached files
    (the current readahead only turns itself off after 256 cache hits,
    unfortunately most files are < 1MB, so never see that chance)

    That accounts for speedup of
    - 0.3% on 1-page sequential reads on sparse file
    - 1.2% on 1-page cache hot sequential reads
    - 3.2% on 256-page cache hot sequential reads
    - 1.3% on cache hot `tar /lib`

    However, it does introduce one extra page-cache lookup per cache miss, which
    impacts random reads slightly. That's 1% overheads for 1-page random reads on
    sparse file.

    PERFORMANCE

    The basic benchmark setup is
    - 2.6.20 kernel with on-demand readahead
    - 1MB max readahead size
    - 2.9GHz Intel Core 2 CPU
    - 2GB memory
    - 160G/8M Hitachi SATA II 7200 RPM disk

    The benchmarks show that
    - it maintains the same performance for trivial sequential/random reads
    - sysbench/OLTP performance on MySQL gains up to 8%
    - performance on readahead thrashing gains up to 3 times

    iozone throughput (KB/s): roughly the same
    ==========================================
    iozone -c -t1 -s 4096m -r 64k

    2.6.20 on-demand gain
    first run
    " Initial write " 61437.27 64521.53 +5.0%
    " Rewrite " 47893.02 48335.20 +0.9%
    " Read " 62111.84 62141.49 +0.0%
    " Re-read " 62242.66 62193.17 -0.1%
    " Reverse Read " 50031.46 49989.79 -0.1%
    " Stride read " 8657.61 8652.81 -0.1%
    " Random read " 13914.28 13898.23 -0.1%
    " Mixed workload " 19069.27 19033.32 -0.2%
    " Random write " 14849.80 14104.38 -5.0%
    " Pwrite " 62955.30 65701.57 +4.4%
    " Pread " 62209.99 62256.26 +0.1%

    second run
    " Initial write " 60810.31 66258.69 +9.0%
    " Rewrite " 49373.89 57833.66 +17.1%
    " Read " 62059.39 62251.28 +0.3%
    " Re-read " 62264.32 62256.82 -0.0%
    " Reverse Read " 49970.96 50565.72 +1.2%
    " Stride read " 8654.81 8638.45 -0.2%
    " Random read " 13901.44 13949.91 +0.3%
    " Mixed workload " 19041.32 19092.04 +0.3%
    " Random write " 14019.99 14161.72 +1.0%
    " Pwrite " 64121.67 68224.17 +6.4%
    " Pread " 62225.08 62274.28 +0.1%

    In summary, writes are unstable, reads are pretty close on average:

    access pattern 2.6.20 on-demand gain
    Read 62085.61 62196.38 +0.2%
    Re-read 62253.49 62224.99 -0.0%
    Reverse Read 50001.21 50277.75 +0.6%
    Stride read 8656.21 8645.63 -0.1%
    Random read 13907.86 13924.07 +0.1%
    Mixed workload 19055.29 19062.68 +0.0%
    Pread 62217.53 62265.27 +0.1%

    aio-stress: roughly the same
    ============================
    aio-stress -l -s4096 -r128 -t1 -o1 knoppix511-dvd-cn.iso
    aio-stress -l -s4096 -r128 -t1 -o3 knoppix511-dvd-cn.iso

    2.6.20 on-demand delta
    sequential 92.57s 92.54s -0.0%
    random 311.87s 312.15s +0.1%

    sysbench fileio: roughly the same
    =================================
    sysbench --test=fileio --file-io-mode=async --file-test-mode=rndrw \
    --file-total-size=4G --file-block-size=64K \
    --num-threads=001 --max-requests=10000 --max-time=900 run

    threads 2.6.20 on-demand delta
    first run
    1 59.1974s 59.2262s +0.0%
    2 58.0575s 58.2269s +0.3%
    4 48.0545s 47.1164s -2.0%
    8 41.0684s 41.2229s +0.4%
    16 35.8817s 36.4448s +1.6%
    32 32.6614s 32.8240s +0.5%
    64 23.7601s 24.1481s +1.6%
    128 24.3719s 23.8225s -2.3%
    256 23.2366s 22.0488s -5.1%

    second run
    1 59.6720s 59.5671s -0.2%
    8 41.5158s 41.9541s +1.1%
    64 25.0200s 23.9634s -4.2%
    256 22.5491s 20.9486s -7.1%

    Note that the numbers are not very stable because of the writes.
    The overall performance is close when we sum all seconds up:

    sum all up 495.046s 491.514s -0.7%

    sysbench oltp (trans/sec): up to 8% gain
    ========================================
    sysbench --test=oltp --oltp-table-size=10000000 --oltp-read-only \
    --mysql-socket=/var/run/mysqld/mysqld.sock \
    --mysql-user=root --mysql-password=readahead \
    --num-threads=064 --max-requests=10000 --max-time=900 run

    10000-transactions run
    threads 2.6.20 on-demand gain
    1 62.81 64.56 +2.8%
    2 67.97 70.93 +4.4%
    4 81.81 85.87 +5.0%
    8 94.60 97.89 +3.5%
    16 99.07 104.68 +5.7%
    32 95.93 104.28 +8.7%
    64 96.48 103.68 +7.5%
    5000-transactions run
    1 48.21 48.65 +0.9%
    8 68.60 70.19 +2.3%
    64 70.57 74.72 +5.9%
    2000-transactions run
    1 37.57 38.04 +1.3%
    2 38.43 38.99 +1.5%
    4 45.39 46.45 +2.3%
    8 51.64 52.36 +1.4%
    16 54.39 55.18 +1.5%
    32 52.13 54.49 +4.5%
    64 54.13 54.61 +0.9%

    That's interesting results. Some investigations show that
    - MySQL is accessing the db file non-uniformly: some parts are
    more hot than others
    - It is mostly doing 4-page random reads, and sometimes doing two
    reads in a row, the latter one triggers a 16-page readahead.
    - The on-demand readahead leaves many lookahead pages (flagged
    PG_readahead) there. Many of them will be hit, and trigger
    more readahead pages. Which might save more seeks.
    - Naturally, the readahead windows tend to lie in hot areas,
    and the lookahead pages in hot areas is more likely to be hit.
    - The more overall read density, the more possible gain.

    That also explains the adaptive readahead tricks for clustered random reads.

    readahead thrashing: 3 times better
    ===================================
    We boot kernel with "mem=128m single", and start a 100KB/s stream on every
    second, until reaching 200 streams.

    max throughput min avg I/O size
    2.6.20: 5MB/s 16KB
    on-demand: 15MB/s 140KB

    Signed-off-by: Fengguang Wu
    Cc: Steven Pratt
    Cc: Ram Pai
    Cc: Rusty Russell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Fengguang Wu
     
  • Extend struct file_ra_state to support the on-demand readahead logic. Also
    define some helpers for it.

    Signed-off-by: Fengguang Wu
    Cc: Steven Pratt
    Cc: Ram Pai
    Cc: Rusty Russell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Fengguang Wu
     
  • Define two convenient macros for read-ahead:
    - MAX_RA_PAGES: rounded down counterpart of VM_MAX_READAHEAD
    - MIN_RA_PAGES: rounded _up_ counterpart of VM_MIN_READAHEAD

    Note that the rounded up MIN_RA_PAGES will work flawlessly with _large_
    page sizes like 64k.

    Signed-off-by: Fengguang Wu
    Cc: Steven Pratt
    Cc: Ram Pai
    Cc: Rusty Russell
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Fengguang Wu
     
  • Add look-ahead support to __do_page_cache_readahead().

    It works by
    - mark the Nth backwards page with PG_readahead,
    (which instructs the page's first reader to invoke readahead)
    - and only do the marking for newly allocated pages.
    (to prevent blindly doing readahead on already cached pages)

    Look-ahead is a technique to achieve I/O pipelining:

    While the application is working through a chunk of cached pages, the kernel
    reads-ahead the next chunk of pages _before_ time of need. It effectively
    hides low level I/O latencies to high level applications.

    Signed-off-by: Fengguang Wu
    Cc: Steven Pratt
    Cc: Ram Pai
    Cc: Rusty Russell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Fengguang Wu
     
  • Introduce a new page flag: PG_readahead.

    It acts as a look-ahead mark, which tells the page reader: Hey, it's time to
    invoke the read-ahead logic. For the sake of I/O pipelining, don't wait until
    it runs out of cached pages!

    Signed-off-by: Fengguang Wu
    Cc: Steven Pratt
    Cc: Ram Pai
    Cc: Rusty Russell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Fengguang Wu
     
  • Fix type issue reported by latest 'sparse': kiocb.ki_flags should be
    "unsigned long" (not "long"), to match bitop type signature.

    Signed-off-by: David Brownell
    Signed-off-by: Benjamin LaHaise
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Brownell
     
  • There is another bug recently introduced into the ecryptfs_setattr()
    function in 2.6.22. eCryptfs will attempt to treat special files like
    regular eCryptfs files on chmod, chown, and so forth. This leads to a NULL
    pointer dereference. This patch validates that the file is a regular file
    before proceeding with operations related to the inode's crypt_stat.

    Thanks to Ryusuke Konishi for finding this bug and suggesting the fix.

    Signed-off-by: Michael Halcrow
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michael Halcrow
     
  • MBCS has a collection of things that searches say are not used elsewhere
    and could be static. If this is the case they should be static, if not
    then someone at SGI should rename things like "soft_list" so they don't
    pollute the global namespace with generic names...

    Signed-off-by: Alan Cox
    Acked-by: Bruce Losure
    Cc: Jes Sorensen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alan Cox
     
  • Optimize show_stat to collect per-irq information just once.

    On x86_64, with newer kernel versions, kstat_irqs is a bit of a problem.
    On every call to kstat_irqs, the process brings in per-cpu data from all
    online cpus. Doing this for NR_IRQS, which is now 256 + 32 * NR_CPUS
    results in (256+32*63) * 63 remote cpu references on a 64 cpu config.
    Considering the fact that we already compute this value per-cpu, we can
    save on the remote references as below.

    Signed-off-by: Alok N Kataria
    Signed-off-by: Ravikiran Thirumalai
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ravikiran G Thirumalai
     
  • Clarify that drivers using the GPIO operations don't need to issue io
    barrier instructions themselves. Previously this wasn't clear, and at
    least one platform assumed otherwise (and would thus break various
    otherwise-portable drivers which don't issue barriers).

    Signed-off-by: David Brownell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Brownell
     
  • unregister_chrdev() does not return meaningful value. This patch makes it
    return void like most unregister_* functions.

    Signed-off-by: Akinobu Mita
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Akinobu Mita
     
  • unregister_chrdev() always returns 0. There is no need to check the return
    value.

    Signed-off-by: Akinobu Mita
    Cc: "David S. Miller"
    Cc: Takashi Iwai
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Akinobu Mita
     
  • This patch converts UDF coding style to kernel coding style using Lindent.

    Signed-off-by: Cyrill Gorcunov
    Cc: Jan Kara
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cyrill Gorcunov
     
  • I guess it is time to clarify that suspend and hibernation are separate
    things, and add Rafael as a maintainer. Plus, people blame us for suspend
    problems, anyway, I guess it is fair to mark us as suspend maintainers,
    too.

    Signed-off-by: Pavel Machek
    Acked-by: Rafael J. Wysocki"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pavel Machek
     
  • Move "debug during resume from s2ram" into the variable we already use
    for real-mode flags to simplify code. It also closes nasty trap for
    the user in acpi_sleep_setup; order of parameters actually mattered there,
    acpi_sleep=s3_bios,s3_mode doing something different from
    acpi_sleep=s3_mode,s3_bios.

    Signed-off-by: Pavel Machek
    Signed-off-by: Rafael J. Wysocki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pavel Machek
     
  • Add a feature allowing the user to make the system beep during a resume from
    suspend to RAM, on x86_64 and i386.

    This is useful for the users with broken resume from RAM, so that they can
    verify if the control reaches the kernel after a wake-up event.

    Signed-off-by: Rafael J. Wysocki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nigel Cunningham
     
  • Introduce the pm_power_off_prepare() callback that can be registered by the
    interested platforms in analogy with pm_idle() and pm_power_off(), used for
    preparing the system to power off (needed by ACPI).

    This allows us to drop acpi_sysclass and device_acpi that are only defined in
    order to register the ACPI power off preparation callback, which is needed by
    pm_power_off() registered in a much different way.

    Signed-off-by: Rafael J. Wysocki
    Acked-by: Pavel Machek
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rafael J. Wysocki
     
  • Since we are now explicitly calling hibernation_ops->prepare() before
    hibernation_ops->enter() in hibernation_platform_enter() (defined in
    kernel/power/disk.c), ACPI should not call acpi_sleep_prepare(ACPI_STATE_S4)
    from acpi_shutdown().

    Signed-off-by: Rafael J. Wysocki
    Acked-by: Pavel Machek
    Cc: Len Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rafael J. Wysocki