28 Oct, 2011

21 commits

  • In setlease, we use i_writecount to decide whether we can give out a
    read lease.

    In open, we break leases before incrementing i_writecount.

    There is therefore a window between the break lease and the i_writecount
    increment when setlease could add a new read lease.

    This would leave us with a simultaneous write open and read lease, which
    shouldn't happen.

    Signed-off-by: J. Bruce Fields
    Signed-off-by: Christoph Hellwig

    J. Bruce Fields
     
  • This makes NFS follow the standard generic_file_llseek locking scheme.

    Cc: Trond.Myklebust@netapp.com
    Signed-off-by: Andi Kleen
    Signed-off-by: Christoph Hellwig

    Andi Kleen
     
  • This gives ext4 the benefits of unlocked llseek.

    Cc: tytso@mit.edu
    Signed-off-by: Andi Kleen
    Signed-off-by: Christoph Hellwig

    Andi Kleen
     
  • Add a generic_file_llseek variant to the VFS that allows passing in
    the maximum file size of the file system, instead of always
    using maxbytes from the superblock.

    This can be used to eliminate some cut'n'paste seek code in ext4.

    Signed-off-by: Andi Kleen
    Signed-off-by: Christoph Hellwig

    Andi Kleen
     
  • The i_mutex lock use of generic _file_llseek hurts. Independent processes
    accessing the same file synchronize over a single lock, even though
    they have no need for synchronization at all.

    Under high utilization this can cause llseek to scale very poorly on larger
    systems.

    This patch does some rethinking of the llseek locking model:

    First the 64bit f_pos is not necessarily atomic without locks
    on 32bit systems. This can already cause races with read() today.
    This was discussed on linux-kernel in the past and deemed acceptable.
    The patch does not change that.

    Let's look at the different seek variants:

    SEEK_SET: Doesn't really need any locking.
    If there's a race one writer wins, the other loses.

    For 32bit the non atomic update races against read()
    stay the same. Without a lock they can also happen
    against write() now. The read() race was deemed
    acceptable in past discussions, and I think if it's
    ok for read it's ok for write too.

    => Don't need a lock.

    SEEK_END: This behaves like SEEK_SET plus it reads
    the maximum size too. Reading the maximum size would have the
    32bit atomic problem. But luckily we already have a way to read
    the maximum size without locking (i_size_read), so we
    can just use that instead.

    Without i_mutex there is no synchronization with write() anymore,
    however since the write() update is atomic on 64bit it just behaves
    like another racy SEEK_SET. On non atomic 32bit it's the same
    as SEEK_SET.

    => Don't need a lock, but need to use i_size_read()

    SEEK_CUR: This has a read-modify-write race window
    on the same file. One could argue that any application
    doing unsynchronized seeks on the same file is already broken.
    But for the sake of not adding a regression here I'm
    using the file->f_lock to synchronize this. Using this
    lock is much better than the inode mutex because it doesn't
    synchronize between processes.

    => So still need a lock, but can use a f_lock.

    This patch implements this new scheme in generic_file_llseek.
    I dropped generic_file_llseek_unlocked and changed all callers.

    Signed-off-by: Andi Kleen
    Signed-off-by: Christoph Hellwig

    Andi Kleen
     
  • This doesn't change anything for the compiler, but hch thought it would
    make the code clearer.

    I moved the reference counting into its own little inline.

    Signed-off-by: Andi Kleen
    Acked-by: Jeff Moyer
    Signed-off-by: Christoph Hellwig

    Andi Kleen
     
  • Add inlines to all the submission path functions. While this increases
    code size it also gives gcc a lot of optimization opportunities
    in this critical hotpath.

    In particular -- together with some other changes -- this
    allows gcc to get rid of the unnecessary clearing of
    sdio at the beginning and optimize the messy parameter passing.
    Any non inlining of a function which takes a sdio parameter
    would break this optimization because they cannot be done if the
    address of a structure is taken.

    Note that benefits are only seen with CONFIG_OPTIMIZE_INLINING
    and CONFIG_CC_OPTIMIZE_FOR_SIZE both set to off.

    This gives about 2.2% improvement on a large database benchmark
    with a high IOPS rate.

    Signed-off-by: Andi Kleen
    Signed-off-by: Christoph Hellwig

    Andi Kleen
     
  • Only a single b_private field in the map_bh buffer head is needed after
    the submission path. Move map_bh separately to avoid storing
    this information in the long term slab.

    This avoids the weird 104 byte hole in struct dio_submit which also needed
    to be memseted early.

    Signed-off-by: Andi Kleen
    Signed-off-by: Christoph Hellwig

    Andi Kleen
     
  • A direct slab call is slightly faster than kmalloc and can be better cached
    per CPU. It also avoids rounding to the next kmalloc slab.

    In addition this enforces cache line alignment for struct dio to avoid
    any false sharing.

    Signed-off-by: Andi Kleen
    Acked-by: Jeff Moyer
    Signed-off-by: Christoph Hellwig

    Andi Kleen
     
  • Fix most problems reported by pahole.

    There is still a weird 104 byte hole after map_bh. I'm not sure what
    causes this.

    Signed-off-by: Andi Kleen
    Acked-by: Jeff Moyer
    Signed-off-by: Christoph Hellwig

    Andi Kleen
     
  • There's nothing on the stack, even before my changes.

    Signed-off-by: Andi Kleen
    Acked-by: Jeff Moyer
    Signed-off-by: Christoph Hellwig

    Andi Kleen
     
  • This large, but largely mechanic, patch moves all fields in struct dio
    that are only used in the submission path into a separate on stack
    data structure. This has the advantage that the memory is very likely
    cache hot, which is not guaranteed for memory fresh out of kmalloc.

    This also gives gcc more optimization potential because it can easier
    determine that there are no external aliases for these variables.

    The sdio initialization is a initialization now instead of memset.
    This allows gcc to break sdio into individual fields and optimize
    away unnecessary zeroing (after all the functions are inlined)

    Signed-off-by: Andi Kleen
    Acked-by: Jeff Moyer
    Signed-off-by: Christoph Hellwig

    Andi Kleen
     
  • We need to move the inode to the end of the list to actually make the
    spinning prevention explained in the comment above it work. With a
    plain list_move it will simply stay in place as we're always reclaiming
    from the head of the list.

    Signed-off-by: Christoph Hellwig

    Christoph Hellwig
     
  • Acked-by: J. Bruce Fields
    Acked-by: David Howells
    Signed-off-by: Andreas Gruenbacher
    Signed-off-by: Aneesh Kumar K.V
    Signed-off-by: Christoph Hellwig

    Andreas Gruenbacher
     
  • Acked-by: J. Bruce Fields
    Acked-by: David Howells
    Signed-off-by: Andreas Gruenbacher
    Signed-off-by: Aneesh Kumar K.V
    Signed-off-by: Christoph Hellwig

    Andreas Gruenbacher
     
  • We are going to add more flags and having them in hex format
    make it simpler

    Acked-by: J. Bruce Fields
    Acked-by: David Howells
    Signed-off-by: Aneesh Kumar K.V
    Signed-off-by: Christoph Hellwig

    Aneesh Kumar K.V
     
  • Acked-by: J. Bruce Fields
    Acked-by: David Howells
    Signed-off-by: Andreas Gruenbacher
    Signed-off-by: Aneesh Kumar K.V
    Signed-off-by: Christoph Hellwig

    Andreas Gruenbacher
     
  • This was found by inspection while tracking a similar
    bug in compat_statfs64, that has been fixed in mainline
    since decemeber.

    - This fixes a bug where not all of the f_spare fields
    were cleared on mips and s390.
    - Add the f_flags field to struct compat_statfs
    - Copy f_flags to userspace in case someone cares.
    - Use __clear_user to copy the f_spare field to userspace
    to ensure that all of the elements of f_spare are cleared.
    On some architectures f_spare is has 5 ints and on some
    architectures f_spare only has 4 ints. Which makes
    the previous technique of clearing each int individually
    broken.

    I don't expect anyone actually uses the old statfs system
    call anymore but if they do let them benefit from having
    the compat and the native version working the same.

    Signed-off-by: Eric W. Biederman
    Signed-off-by: Christoph Hellwig

    Eric W. Biederman
     
  • nfsiostat was failing to find mounted filesystems on kernels after
    2.6.38 because of changes to show_vfsstat() by commit
    c7f404b40a3665d9f4e9a927cc5c1ee0479ed8f9. This patch adds back the
    "device" tag before the nfs server entry so scripts can parse the
    mountstats file correctly.

    Signed-off-by: Bryan Schumaker
    CC: stable@kernel.org [>=2.6.39]
    Signed-off-by: Christoph Hellwig

    Bryan Schumaker
     
  • The patch is aganist 3.1-rc3.

    Signed-off-by: Wang Sheng-Hui
    Signed-off-by: Christoph Hellwig

    Wang Sheng-Hui
     
  • Currently, when you call iov_iter_advance, then the pointer to the iovec
    array can be incremented, but it does not decrement the nr_segs value in
    the iov_iter struct. The result is a iov_iter struct with a nr_segs
    value that goes beyond the end of the array.

    While I'm not aware of anything that's specifically broken by this, it
    seems odd and a bit dangerous not to decrement that value. If someone
    were to trust the nr_segs value to be correct, then they could end up
    walking off the end of the array.

    Changing this might also provide some micro-optimization when dealing
    with the last iovec in an array. Many of the other routines that deal
    with iov_iter have optimized codepaths when nr_segs == 1.

    Cc: Nick Piggin
    Signed-off-by: Jeff Layton
    Signed-off-by: Christoph Hellwig

    Jeff Layton
     

24 Oct, 2011

5 commits

  • Linus Torvalds
     
  • * git://git.infradead.org/iommu-2.6:
    intel-iommu: fix superpage support in pfn_to_dma_pte()
    intel-iommu: set iommu_superpage on VM domains to lowest common denominator
    intel-iommu: fix return value of iommu_unmap() API
    MAINTAINERS: Update VT-d entry for drivers/pci -> drivers/iommu move
    intel-iommu: Export a flag indicating that the IOMMU is used for iGFX.
    intel-iommu: Workaround IOTLB hang on Ironlake GPU
    intel-iommu: Fix AB-BA lockdep report

    Linus Torvalds
     
  • * 'for-linus' of http://people.redhat.com/agk/git/linux-dm:
    dm kcopyd: fix job_pool leak

    Linus Torvalds
     
  • Commit 4b239f458 ("x86-64, mm: Put early page table high") causes a S4
    regression since 2.6.39, namely the machine reboots occasionally at S4
    resume. It doesn't happen always, overall rate is about 1/20. But,
    like other bugs, once when this happens, it continues to happen.

    This patch fixes the problem by essentially reverting the memory
    assignment in the older way.

    Signed-off-by: Takashi Iwai
    Cc:
    Cc: Rafael J. Wysocki
    Cc: Yinghai Lu
    [ We'll hopefully find the real fix, but that's too late for 3.1 now ]
    Signed-off-by: Linus Torvalds

    Takashi Iwai
     
  • Fix memory leak introduced by commit a6e50b409d3f9e0833e69c3c9cca822e8fa4adbb
    (dm snapshot: skip reading origin when overwriting complete chunk).

    When allocating a set of jobs from kc->job_pool, job->master_job must be
    set (to point to itself) so that the mempool item gets freed when the
    master_job completes.

    master_job was introduced by commit c6ea41fbbe08f270a8edef99dc369faf809d1bd6
    (dm kcopyd: preallocate sub jobs to avoid deadlock)

    Reported-by: Michael Leun
    Cc: Mikulas Patocka
    Signed-off-by: Alasdair G Kergon

    Alasdair G Kergon
     

23 Oct, 2011

2 commits


22 Oct, 2011

1 commit

  • v2:
    - register_syscore_ops(&s3c24xx_irq_syscore_ops) does not need to be
    conditionally compiled out, it is already optimized out on !CONFIG_PM
    - fix also s3c2412 and s3c2416 affected by the same build issue

    v1:
    s3c2440.c fails to build if !CONFIG_PM because in such case
    s3c2410_pm_syscore_ops is not defined. Same error should happen also
    in s3c2410.c and s3c2442.c

    Signed-off-by: Domenico Andreoli
    Signed-off-by: Kukjin Kim

    Domenico Andreoli
     

21 Oct, 2011

6 commits

  • * git://github.com/herbertx/crypto:
    crypto: ghash - Avoid null pointer dereference if no key is set

    Linus Torvalds
     
  • * 'fix/hda' of git://github.com/tiwai/sound:
    ALSA: HDA: conexant support for Lenovo T520/W520
    ALSA: hda - Add position_fix quirk for Dell Inspiron 1010

    Linus Torvalds
     
  • The ghash_update function passes a pointer to gf128mul_4k_lle which will
    be NULL if ghash_setkey is not called or if the most recent call to
    ghash_setkey failed to allocate memory. This causes an oops. Fix this
    up by returning an error code in the null case.

    This is trivially triggered from unprivileged userspace through the
    AF_ALG interface by simply writing to the socket without setting a key.

    The ghash_final function has a similar issue, but triggering it requires
    a memory allocation failure in ghash_setkey _after_ at least one
    successful call to ghash_update.

    BUG: unable to handle kernel NULL pointer dereference at 00000670
    IP: [] gf128mul_4k_lle+0x23/0x60 [gf128mul]
    *pde = 00000000
    Oops: 0000 [#1] PREEMPT SMP
    Modules linked in: ghash_generic gf128mul algif_hash af_alg nfs lockd nfs_acl sunrpc bridge ipv6 stp llc

    Pid: 1502, comm: hashatron Tainted: G W 3.1.0-rc9-00085-ge9308cf #32 Bochs Bochs
    EIP: 0060:[] EFLAGS: 00000202 CPU: 0
    EIP is at gf128mul_4k_lle+0x23/0x60 [gf128mul]
    EAX: d69db1f0 EBX: d6b8ddac ECX: 00000004 EDX: 00000000
    ESI: 00000670 EDI: d6b8ddac EBP: d6b8ddc8 ESP: d6b8dda4
    DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
    Process hashatron (pid: 1502, ti=d6b8c000 task=d6810000 task.ti=d6b8c000)
    Stack:
    00000000 d69db1f0 00000163 00000000 d6b8ddc8 c101a520 d69db1f0 d52aa000
    00000ff0 d6b8dde8 d88d310f d6b8a3f8 d52aa000 00001000 d88d502c d6b8ddfc
    00001000 d6b8ddf4 c11676ed d69db1e8 d6b8de24 c11679ad d52aa000 00000000
    Call Trace:
    [] ? kmap_atomic_prot+0x37/0xa6
    [] ghash_update+0x85/0xbe [ghash_generic]
    [] crypto_shash_update+0x18/0x1b
    [] shash_ahash_update+0x22/0x36
    [] shash_async_update+0xb/0xd
    [] hash_sendpage+0xba/0xf2 [algif_hash]
    [] kernel_sendpage+0x39/0x4e
    [] ? 0xd88cdfff
    [] sock_sendpage+0x37/0x3e
    [] ? kernel_sendpage+0x4e/0x4e
    [] pipe_to_sendpage+0x56/0x61
    [] splice_from_pipe_feed+0x58/0xcd
    [] ? splice_from_pipe_begin+0x10/0x10
    [] __splice_from_pipe+0x36/0x55
    [] ? splice_from_pipe_begin+0x10/0x10
    [] splice_from_pipe+0x51/0x64
    [] ? default_file_splice_write+0x2c/0x2c
    [] generic_splice_sendpage+0x13/0x15
    [] ? splice_from_pipe_begin+0x10/0x10
    [] do_splice_from+0x5d/0x67
    [] sys_splice+0x2bf/0x363
    [] ? sysenter_exit+0xf/0x16
    [] ? trace_hardirqs_on_caller+0x10e/0x13f
    [] sysenter_do_call+0x12/0x32
    Code: 83 c4 0c 5b 5e 5f c9 c3 55 b9 04 00 00 00 89 e5 57 8d 7d e4 56 53 8d 5d e4 83 ec 18 89 45 e0 89 55 dc 0f b6 70 0f c1 e6 04 01 d6 a5 be 0f 00 00 00 4e 89 d8 e8 48 ff ff ff 8b 45 e0 89 da 0f
    EIP: [] gf128mul_4k_lle+0x23/0x60 [gf128mul] SS:ESP 0068:d6b8dda4
    CR2: 0000000000000670
    ---[ end trace 4eaa2a86a8e2da24 ]---
    note: hashatron[1502] exited with preempt_count 1
    BUG: scheduling while atomic: hashatron/1502/0x10000002
    INFO: lockdep is turned off.
    [...]

    Signed-off-by: Nick Bowler
    Cc: stable@kernel.org [2.6.37+]
    Signed-off-by: Herbert Xu

    Nick Bowler
     
  • Offsets of the irq controller registers were calculated
    correctly only for first GPIO bank. This patch fixes
    calculation of the register offsets for all GPIO banks.

    Reported-by: Sylwester Nawrocki
    Signed-off-by: Marek Szyprowski
    Signed-off-by: Kyungmin Park
    Signed-off-by: Kukjin Kim

    Marek Szyprowski
     
  • * git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc:
    sparc: Add alignment flag to PCI expansion resources
    sparc: Avoid calling sigprocmask()
    sparc: Use set_current_blocked()
    sparc32,leon: SRMMU MMU Table probe fix

    Linus Torvalds
     
  • * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net:
    fib_rules: fix unresolved_rules counting
    r8169: fix wrong eee setting for rlt8111evl
    r8169: fix driver shutdown WoL regression.
    ehea: Change maintainer to me
    pptp: pptp_rcv_core() misses pskb_may_pull() call
    tproxy: copy transparent flag when creating a time wait
    pptp: fix skb leak in pptp_xmit()
    bonding: use local function pointer of bond->recv_probe in bond_handle_frame
    smsc911x: Add support for SMSC LAN89218
    tg3: negate USE_PHYLIB flag check
    netconsole: enable netconsole can make net_device refcnt incorrent
    bluetooth: Properly clone LSM attributes to newly created child connections
    l2tp: fix a potential skb leak in l2tp_xmit_skb()
    bridge: fix hang on removal of bridge via netlink
    x25: Prevent skb overreads when checking call user data
    x25: Handle undersized/fragmented skbs
    x25: Validate incoming call user data lengths
    udplite: fast-path computation of checksum coverage
    IPVS netns shutdown/startup dead-lock
    netfilter: nf_conntrack: fix event flooding in GRE protocol tracker

    Linus Torvalds
     

20 Oct, 2011

5 commits

  • Since 8-bit temperature values are now handled in 16-bit struct
    members, values have to be cast to s8 for negative temperatures to be
    properly handled. This is broken since kernel version 2.6.39
    (commit bce26c58df86599c9570cee83eac58bdaae760e4.)

    Signed-off-by: Jean Delvare
    Cc: Guenter Roeck
    Cc: stable@kernel.org # 2.6.39+
    Signed-off-by: Guenter Roeck

    Jean Delvare
     
  • I don't usually pay much attention to the stale "? " addresses in
    stack backtraces, but this lucky report from Pawel Sikora hints that
    mremap's move_ptes() has inadequate locking against page migration.

    3.0 BUG_ON(!PageLocked(p)) in migration_entry_to_page():
    kernel BUG at include/linux/swapops.h:105!
    RIP: 0010:[] []
    migration_entry_wait+0x156/0x160
    [] handle_pte_fault+0xae1/0xaf0
    [] ? __pte_alloc+0x42/0x120
    [] ? do_huge_pmd_anonymous_page+0xab/0x310
    [] handle_mm_fault+0x181/0x310
    [] ? vma_adjust+0x537/0x570
    [] do_page_fault+0x11d/0x4e0
    [] ? do_mremap+0x2d5/0x570
    [] page_fault+0x1f/0x30

    mremap's down_write of mmap_sem, together with i_mmap_mutex or lock,
    and pagetable locks, were good enough before page migration (with its
    requirement that every migration entry be found) came in, and enough
    while migration always held mmap_sem; but not enough nowadays, when
    there's memory hotremove and compaction.

    The danger is that move_ptes() lets a migration entry dodge around
    behind remove_migration_pte()'s back, so it's in the old location when
    looking at the new, then in the new location when looking at the old.

    Either mremap's move_ptes() must additionally take anon_vma lock(), or
    migration's remove_migration_pte() must stop peeking for is_swap_entry()
    before it takes pagetable lock.

    Consensus chooses the latter: we prefer to add overhead to migration
    than to mremapping, which gets used by JVMs and by exec stack setup.

    Reported-and-tested-by: Paweł Sikora
    Signed-off-by: Hugh Dickins
    Acked-by: Andrea Arcangeli
    Acked-by: Mel Gorman
    Cc: stable@vger.kernel.org
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Currently no type of alignment is specified for PCI expansion roms while
    parsing the openfirmware tree. This causes calls to pci_map_rom() to fail.
    IORESOURCE_SIZEALIGN is the default alignment used for rom resouces in
    pci/probe.c, and has been verified to work with various cards on a ultra 10.

    Signed-off-By: Kjetil Oftedal
    Signed-off-by: David S. Miller

    Kjetil Oftedal
     
  • we should decrease ops->unresolved_rules when deleting a unresolved rule.

    Signed-off-by: Zheng Yan
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Yan, Zheng
     
  • Correct the wrong parameter for setting EEE for RTL8111E-VL.

    Signed-off-by: Hayes Wang
    Signed-off-by: David S. Miller

    hayeswang