18 Oct, 2011

2 commits


13 Oct, 2011

1 commit

  • ip_vs_mutext is used by both netns shutdown code and startup
    and both implicit uses sk_lock-AF_INET mutex.

    cleanup CPU-1 startup CPU-2
    ip_vs_dst_event() ip_vs_genl_set_cmd()
    sk_lock-AF_INET __ip_vs_mutex
    sk_lock-AF_INET
    __ip_vs_mutex
    * DEAD LOCK *

    A new mutex placed in ip_vs netns struct called sync_mutex is added.

    Comments from Julian and Simon added.
    This patch has been running for more than 3 month now and it seems to work.

    Ver. 3
    IP_VS_SO_GET_DAEMON in do_ip_vs_get_ctl protected by sync_mutex
    instead of __ip_vs_mutex as sugested by Julian.

    Signed-off-by: Hans Schillstrom
    Acked-by: Julian Anastasov
    Signed-off-by: Simon Horman
    Signed-off-by: Pablo Neira Ayuso

    Hans Schillstrom
     

06 Oct, 2011

1 commit


05 Oct, 2011

2 commits

  • * git://github.com/davem330/net:
    pch_gbe: Fixed the issue on which a network freezes
    pch_gbe: Fixed the issue on which PC was frozen when link was downed.
    make PACKET_STATISTICS getsockopt report consistently between ring and non-ring
    net: xen-netback: correctly restart Tx after a VM restore/migrate
    bonding: properly stop queuing work when requested
    can bcm: fix incomplete tx_setup fix
    RDSRDMA: Fix cleanup of rds_iw_mr_pool
    net: Documentation: Fix type of variables
    ibmveth: Fix oops on request_irq failure
    ipv6: nullify ipv6_ac_list and ipv6_fl_list when creating new socket
    cxgb4: Fix EEH on IBM P7IOC
    can bcm: fix tx_setup off-by-one errors
    MAINTAINERS: tehuti: Alexander Indenbaum's address bounces
    dp83640: reduce driver noise
    ptp: fix L2 event message recognition

    Linus Torvalds
     
  • Add the ability to disable PCI-E MPS turning and using the BIOS
    configured MPS defaults. Due to the number of issues recently
    discovered on some x86 chipsets, make this the default behavior.

    Also, add the option for peer to peer DMA MPS configuration. Peer to
    peer DMA is outside the scope of this patch, but MPS configuration could
    prevent it from working by having the MPS on one root port different
    than the MPS on another. To work around this, simply make the system
    wide MPS the smallest possible value (128B).

    Signed-off-by: Jon Mason
    Acked-by: Benjamin Herrenschmidt
    Signed-off-by: Linus Torvalds

    Jon Mason
     

01 Oct, 2011

1 commit

  • …for-linus' of git://tesla.tglx.de/git/linux-2.6-tip

    * 'irq-urgent-for-linus' of git://tesla.tglx.de/git/linux-2.6-tip:
    irq: Fix check for already initialized irq_domain in irq_domain_add
    irq: Add declaration of irq_domain_simple_ops to irqdomain.h

    * 'x86-urgent-for-linus' of git://tesla.tglx.de/git/linux-2.6-tip:
    x86/rtc: Don't recursively acquire rtc_lock

    * 'sched-urgent-for-linus' of git://tesla.tglx.de/git/linux-2.6-tip:
    posix-cpu-timers: Cure SMP wobbles
    sched: Fix up wchan borkage
    sched/rt: Migrate equal priority tasks to available CPUs

    Linus Torvalds
     

30 Sep, 2011

1 commit

  • David reported:

    Attached below is a watered-down version of rt/tst-cpuclock2.c from
    GLIBC. Just build it with "gcc -o test test.c -lpthread -lrt" or
    similar.

    Run it several times, and you will see cases where the main thread
    will measure a process clock difference before and after the nanosleep
    which is smaller than the cpu-burner thread's individual thread clock
    difference. This doesn't make any sense since the cpu-burner thread
    is part of the top-level process's thread group.

    I've reproduced this on both x86-64 and sparc64 (using both 32-bit and
    64-bit binaries).

    For example:

    [davem@boricha build-x86_64-linux]$ ./test
    process: before(0.001221967) after(0.498624371) diff(497402404)
    thread: before(0.000081692) after(0.498316431) diff(498234739)
    self: before(0.001223521) after(0.001240219) diff(16698)
    [davem@boricha build-x86_64-linux]$

    The diff of 'process' should always be >= the diff of 'thread'.

    I make sure to wrap the 'thread' clock measurements the most tightly
    around the nanosleep() call, and that the 'process' clock measurements
    are the outer-most ones.

    ---
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include

    static pthread_barrier_t barrier;

    static void *chew_cpu(void *arg)
    {
    pthread_barrier_wait(&barrier);
    while (1)
    __asm__ __volatile__("" : : : "memory");
    return NULL;
    }

    int main(void)
    {
    clockid_t process_clock, my_thread_clock, th_clock;
    struct timespec process_before, process_after;
    struct timespec me_before, me_after;
    struct timespec th_before, th_after;
    struct timespec sleeptime;
    unsigned long diff;
    pthread_t th;
    int err;

    err = clock_getcpuclockid(0, &process_clock);
    if (err)
    return 1;

    err = pthread_getcpuclockid(pthread_self(), &my_thread_clock);
    if (err)
    return 1;

    pthread_barrier_init(&barrier, NULL, 2);
    err = pthread_create(&th, NULL, chew_cpu, NULL);
    if (err)
    return 1;

    err = pthread_getcpuclockid(th, &th_clock);
    if (err)
    return 1;

    pthread_barrier_wait(&barrier);

    err = clock_gettime(process_clock, &process_before);
    if (err)
    return 1;

    err = clock_gettime(my_thread_clock, &me_before);
    if (err)
    return 1;

    err = clock_gettime(th_clock, &th_before);
    if (err)
    return 1;

    sleeptime.tv_sec = 0;
    sleeptime.tv_nsec = 500000000;
    nanosleep(&sleeptime, NULL);

    err = clock_gettime(th_clock, &th_after);
    if (err)
    return 1;

    err = clock_gettime(my_thread_clock, &me_after);
    if (err)
    return 1;

    err = clock_gettime(process_clock, &process_after);
    if (err)
    return 1;

    diff = process_after.tv_nsec - process_before.tv_nsec;
    printf("process: before(%lu.%.9lu) after(%lu.%.9lu) diff(%lu)\n",
    process_before.tv_sec, process_before.tv_nsec,
    process_after.tv_sec, process_after.tv_nsec, diff);
    diff = th_after.tv_nsec - th_before.tv_nsec;
    printf("thread: before(%lu.%.9lu) after(%lu.%.9lu) diff(%lu)\n",
    th_before.tv_sec, th_before.tv_nsec,
    th_after.tv_sec, th_after.tv_nsec, diff);
    diff = me_after.tv_nsec - me_before.tv_nsec;
    printf("self: before(%lu.%.9lu) after(%lu.%.9lu) diff(%lu)\n",
    me_before.tv_sec, me_before.tv_nsec,
    me_after.tv_sec, me_after.tv_nsec, diff);

    return 0;
    }

    This is due to us using p->se.sum_exec_runtime in
    thread_group_cputime() where we iterate the thread group and sum all
    data. This does not take time since the last schedule operation (tick
    or otherwise) into account. We can cure this by using
    task_sched_runtime() at the cost of having to take locks.

    This also means we can (and must) do away with
    thread_group_sched_runtime() since the modified thread_group_cputime()
    is now more accurate and would deadlock when called from
    thread_group_sched_runtime().

    Aside of that it makes the function safe on 32 bit systems. The old
    code added t->se.sum_exec_runtime unprotected. sum_exec_runtime is a
    64bit value and could be changed on another cpu at the same time.

    Reported-by: David Miller
    Signed-off-by: Peter Zijlstra
    Cc: stable@kernel.org
    Link: http://lkml.kernel.org/r/1314874459.7945.22.camel@twins
    Tested-by: David Miller
    Signed-off-by: Thomas Gleixner

    Peter Zijlstra
     

29 Sep, 2011

1 commit

  • The IEEE 1588 standard defines two kinds of messages, event and general
    messages. Event messages require time stamping, and general do not. When
    using UDP transport, two separate ports are used for the two message
    types.

    The BPF designed to recognize event messages incorrectly classifies L2
    general messages as event messages. This commit fixes the issue by
    extending the filter to check the message type field for L2 PTP packets.
    Event messages are be distinguished from general messages by testing
    the "general" bit.

    Signed-off-by: Richard Cochran
    Cc:
    Signed-off-by: David S. Miller

    Richard Cochran
     

28 Sep, 2011

1 commit


27 Sep, 2011

2 commits

  • That flag no longer makes sense, since we don't look up automount points
    as eagerly any more. Additionally, it turns out that the NO_AUTOMOUNT
    handling was buggy to begin with: it would avoid automounting even for
    cases where we really *needed* to do the automount handling, and could
    return ENOENT for autofs entries that hadn't been instantiated yet.

    With our new non-eager automount semantics, one discussion has been
    about adding a AT_AUTOMOUNT flag to vfs_fstatat (and thus the
    newfstatat() and fstatat64() system calls), but it's probably not worth
    it: you can always force at least directory automounting by simply
    adding the final '/' to the filename, which works for *all* of the stat
    family system calls, old and new.

    So AT_NO_AUTOMOUNT (and thus LOOKUP_NO_AUTOMOUNT) really were just a
    result of our bad default behavior.

    Acked-by: Ian Kent
    Acked-by: Trond Myklebust
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     
  • Since we've now turned around and made LOOKUP_FOLLOW *not* force an
    automount, we want to add the ability to force an automount event on
    lookup even if we don't happen to have one of the other flags that force
    it implicitly (LOOKUP_OPEN, LOOKUP_DIRECTORY, LOOKUP_PARENT..)

    Most cases will never want to use this, since you'd normally want to
    delay automounting as long as possible, which usually implies
    LOOKUP_OPEN (when we open a file or directory, we really cannot avoid
    the automount any more).

    But Trond argued sufficiently forcefully that at a minimum bind mounting
    a file and quotactl will want to force the automount lookup. Some other
    cases (like nfs_follow_remote_path()) could use it too, although
    LOOKUP_DIRECTORY would work there as well.

    This commit just adds the flag and logic, no users yet, though. It also
    doesn't actually touch the LOOKUP_NO_AUTOMOUNT flag that is related, and
    was made irrelevant by the same change that made us not follow on
    LOOKUP_FOLLOW.

    Cc: Trond Myklebust
    Cc: Ian Kent
    Cc: Jeff Layton
    Cc: Miklos Szeredi
    Cc: David Howells
    Cc: Al Viro
    Cc: Greg KH
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

26 Sep, 2011

1 commit

  • If optional discard support in dm-crypt is enabled, discards requests
    bypass the crypt queue and blocks of the underlying device are discarded.
    For the read path, discarded blocks are handled the same as normal
    ciphertext blocks, thus decrypted.

    So if the underlying device announces discarded regions return zeroes,
    dm-crypt must disable this flag because after decryption there is just
    random noise instead of zeroes.

    Signed-off-by: Milan Broz
    Signed-off-by: Alasdair G Kergon

    Milan Broz
     

23 Sep, 2011

1 commit


22 Sep, 2011

1 commit

  • * 'for-linus' of git://git.kernel.dk/linux-block:
    floppy: use del_timer_sync() in init cleanup
    blk-cgroup: be able to remove the record of unplugged device
    block: Don't check QUEUE_FLAG_SAME_COMP in __blk_complete_request
    mm: Add comment explaining task state setting in bdi_forker_thread()
    mm: Cleanup clearing of BDI_pending bit in bdi_forker_thread()
    block: simplify force plug flush code a little bit
    block: change force plug flush call order
    block: Fix queue_flag update when rq_affinity goes from 2 to 1
    block: separate priority boosting from REQ_META
    block: remove READ_META and WRITE_META
    xen-blkback: fixed indentation and comments
    xen-blkback: Don't disconnect backend until state switched to XenbusStateClosed.

    Linus Torvalds
     

20 Sep, 2011

2 commits

  • 598841ca9919d008b520114d8a4378c4ce4e40a1 ([S390] use gmap address
    spaces for kvm guest images) changed kvm on s390 to use a separate
    address space for kvm guests. We can now put KVM guests anywhere
    in the user address mode with a size up to 8PB - as long as the
    memory is 1MB-aligned. This change was done without KVM extension
    capability bit.
    The change was added after 3.0, but we still have a chance to add
    a feature bit before 3.1 (keeping the releases in a sane state).
    We use number 71 to avoid collisions with other pending kvm patches
    as requested by Alexander Graf.

    Signed-off-by: Christian Borntraeger
    Acked-by: Avi Kivity
    Cc: Alexander Graf
    Signed-off-by: Heiko Carstens

    Christian Borntraeger
     
  • irq_domain_simple_ops is exported, but is not declared in irqdomain.h,
    so add it.

    Signed-off-by: Rob Herring
    Cc: Grant Likely
    Cc: marc.zyngier@arm.com
    Cc: thomas.abraham@linaro.org
    Cc: jamie@jamieiles.com
    Cc: b-cousson@ti.com
    Cc: shawn.guo@linaro.org
    Cc: linux-arm-kernel@lists.infradead.org
    Cc: devicetree-discuss@lists.ozlabs.org
    Link: http://lkml.kernel.org/r/1316017900-19918-2-git-send-email-robherring2@gmail.com
    Signed-off-by: Thomas Gleixner

    Rob Herring
     

19 Sep, 2011

4 commits

  • * git://github.com/davem330/net:
    tcp: fix validation of D-SACK
    tcp: fix build error if !CONFIG_SYN_COOKIES

    Linus Torvalds
     
  • commit 946cedccbd7387 (tcp: Change possible SYN flooding messages)
    added a build error if CONFIG_SYN_COOKIES=n

    Reported-by: Markus Trippelsdorf
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • * 'for-linus' of git://git.infradead.org/users/sameo/mfd-2.6:
    mfd: Fix omap-usb-host build failure
    mfd: Make omap-usb-host TLL mode work again
    mfd: Set MAX8997 irq pointer
    mfd: Fix initialisation of tps65910 interrupts
    mfd: Check for twl4030-madc NULL pointer
    mfd: Copy the device pointer to the twl4030-madc structure
    mfd: Rename wm8350 static gpio_set_debounce()
    mfd: Fix value of WM8994_CONFIGURE_GPIO

    Linus Torvalds
     
  • * git://github.com/davem330/net: (62 commits)
    ipv6: don't use inetpeer to store metrics for routes.
    can: ti_hecc: include linux/io.h
    IRDA: Fix global type conflicts in net/irda/irsysctl.c v2
    net: Handle different key sizes between address families in flow cache
    net: Align AF-specific flowi structs to long
    ipv4: Fix fib_info->fib_metrics leak
    caif: fix a potential NULL dereference
    sctp: deal with multiple COOKIE_ECHO chunks
    ibmveth: Fix checksum offload failure handling
    ibmveth: Checksum offload is always disabled
    ibmveth: Fix issue with DMA mapping failure
    ibmveth: Fix DMA unmap error
    pch_gbe: support ML7831 IOH
    pch_gbe: added the process of FIFO over run error
    pch_gbe: fixed the issue which receives an unnecessary packet.
    sfc: Use 64-bit writes for TX push where possible
    Revert "sfc: Use write-combining to reduce TX latency" and follow-ups
    bnx2x: Fix ethtool advertisement
    bnx2x: Fix 578xx link LED
    bnx2x: Fix XMAC loopback test
    ...

    Linus Torvalds
     

17 Sep, 2011

3 commits

  • With the conversion of struct flowi to a union of AF-specific structs, some
    operations on the flow cache need to account for the exact size of the key.

    Signed-off-by: David Ward
    Signed-off-by: David S. Miller

    dpward
     
  • AF-specific flowi structs are now passed to flow_key_compare, which must
    also be aligned to a long.

    Signed-off-by: David Ward
    Signed-off-by: David S. Miller

    David Ward
     
  • Attempt to reduce the number of IP packets emitted in response to single
    SCTP packet (2e3216cd) introduced a complication - if a packet contains
    two COOKIE_ECHO chunks and nothing else then SCTP state machine corks the
    socket while processing first COOKIE_ECHO and then loses the association
    and forgets to uncork the socket. To deal with the issue add new SCTP
    command which can be used to set association explictly. Use this new
    command when processing second COOKIE_ECHO chunk to restore the context
    for SCTP state machine.

    Signed-off-by: Max Matveev
    Signed-off-by: David S. Miller

    Max Matveev
     

16 Sep, 2011

3 commits

  • David S. Miller
     
  • dev_forward_skb loops an skb back into host networking
    stack which might hang on the memory indefinitely.
    In particular, this can happen in macvtap in bridged mode.
    Copy the userspace fragments to avoid blocking the
    sender in that case.

    As this patch makes skb_copy_ubufs extern now,
    I also added some documentation and made it clear
    the SKBTX_DEV_ZEROCOPY flag automatically instead
    of doing it in all callers. This can be made into a separate
    patch if people feel it's worth it.

    Signed-off-by: Michael S. Tsirkin
    Signed-off-by: David S. Miller

    Michael S. Tsirkin
     
  • "Possible SYN flooding on port xxxx " messages can fill logs on servers.

    Change logic to log the message only once per listener, and add two new
    SNMP counters to track :

    TCPReqQFullDoCookies : number of times a SYNCOOKIE was replied to client

    TCPReqQFullDrop : number of times a SYN request was dropped because
    syncookies were not enabled.

    Based on a prior patch from Tom Herbert, and suggestions from David.

    Signed-off-by: Eric Dumazet
    CC: Tom Herbert
    Signed-off-by: David S. Miller

    Eric Dumazet
     

15 Sep, 2011

2 commits

  • Building a kernel with hotplug disabled results in a link failure:

    `bgpio_remove' referenced in section `___ksymtab_gpl+bgpio_remove' of drivers/built-in.o: defined in discarded section `.devexit.text' of drivers/built-in.o

    This is because of bgpio_remove() is exported. It is illegal to export
    symbols which are discarded either at link time or as part of an
    init/exit section.

    Fix this by dropping the __devexit attributation from bgpio_remove().
    Also drop the __devinit attributation from bgpio_init().

    Signed-off-by: Russell King
    Cc: Grant Likely
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Russell King
     
  • Revert the post-3.0 commit 82f9d486e59f5 ("memcg: add
    memory.vmscan_stat").

    The implementation of per-memcg reclaim statistics violates how memcg
    hierarchies usually behave: hierarchically.

    The reclaim statistics are accounted to child memcgs and the parent
    hitting the limit, but not to hierarchy levels in between. Usually,
    hierarchical statistics are perfectly recursive, with each level
    representing the sum of itself and all its children.

    Since this exports statistics to userspace, this may lead to confusion
    and problems with changing things after the release, so revert it now,
    we can try again later.

    Signed-off-by: Johannes Weiner
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Daisuke Nishimura
    Cc: Michal Hocko
    Cc: Ying Han
    Cc: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

09 Sep, 2011

2 commits


08 Sep, 2011

1 commit


06 Sep, 2011

3 commits


31 Aug, 2011

2 commits


30 Aug, 2011

1 commit

  • * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (42 commits)
    netpoll: fix incorrect access to skb data in __netpoll_rx
    cassini: init before use in cas_interruptN.
    can: ti_hecc: Fix uninitialized spinlock in probe
    can: ti_hecc: Fix unintialized variable
    net: sh_eth: fix the compile error
    net/phy: fix DP83865 phy interrupt handler
    sendmmsg/sendmsg: fix unsafe user pointer access
    ibmveth: Fix leak when recycling skb and hypervisor returns error
    arp: fix rcu lockdep splat in arp_process()
    bridge: fix a possible use after free
    bridge: Pseudo-header required for the checksum of ICMPv6
    mcast: Fix source address selection for multicast listener report
    MAINTAINERS: Update GIT trees for network development
    ath9k: Fix PS wrappers in ath9k_set_coverage_class
    carl9170: Fix mismatch in carl9170_op_set_key mutex lock-unlock
    wl12xx: add max_sched_scan_ssids value to the hw description
    wl12xx: Fix validation of pm_runtime_get_sync return value
    wl12xx: Remove obsolete testmode NVS push command
    bcma: add uevent to the bus, to autoload drivers
    ath9k_hw: Fix STA (AR9485) bringup issue due to incorrect MAC address
    ...

    Linus Torvalds
     

29 Aug, 2011

1 commit

  • The current cgroup context switch code was incorrect leading
    to bogus counts. Furthermore, as soon as there was an active
    cgroup event on a CPU, the context switch cost on that CPU
    would increase by a significant amount as demonstrated by a
    simple ping/pong example:

    $ ./pong
    Both processes pinned to CPU1, running for 10s
    10684.51 ctxsw/s

    Now start a cgroup perf stat:
    $ perf stat -e cycles,cycles -A -a -G test -C 1 -- sleep 100

    $ ./pong
    Both processes pinned to CPU1, running for 10s
    6674.61 ctxsw/s

    That's a 37% penalty.

    Note that pong is not even in the monitored cgroup.

    The results shown by perf stat are bogus:
    $ perf stat -e cycles,cycles -A -a -G test -C 1 -- sleep 100

    Performance counter stats for 'sleep 100':

    CPU1 cycles test
    CPU1 16,984,189,138 cycles # 0.000 GHz

    The second 'cycles' event should report a count @ CPU clock
    (here 2.4GHz) as it is counting across all cgroups.

    The patch below fixes the bogus accounting and bypasses any
    cgroup switches in case the outgoing and incoming tasks are
    in the same cgroup.

    With this patch the same test now yields:
    $ ./pong
    Both processes pinned to CPU1, running for 10s
    10775.30 ctxsw/s

    Start perf stat with cgroup:

    $ perf stat -e cycles,cycles -A -a -G test -C 1 -- sleep 10

    Run pong outside the cgroup:
    $ /pong
    Both processes pinned to CPU1, running for 10s
    10687.80 ctxsw/s

    The penalty is now less than 2%.

    And the results for perf stat are correct:

    $ perf stat -e cycles,cycles -A -a -G test -C 1 -- sleep 10

    Performance counter stats for 'sleep 10':

    CPU1 cycles test # 0.000 GHz
    CPU1 23,933,981,448 cycles # 0.000 GHz

    Now perf stat reports the correct counts for
    for the non cgroup event.

    If we run pong inside the cgroup, then we also get the
    correct counts:

    $ perf stat -e cycles,cycles -A -a -G test -C 1 -- sleep 10

    Performance counter stats for 'sleep 10':

    CPU1 22,297,726,205 cycles test # 0.000 GHz
    CPU1 23,933,981,448 cycles # 0.000 GHz

    10.001457237 seconds time elapsed

    Signed-off-by: Stephane Eranian
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/20110825135803.GA4697@quad
    Signed-off-by: Ingo Molnar

    Stephane Eranian
     

27 Aug, 2011

1 commit