01 Mar, 2012

35 commits

  • commit c6c1e4491dc8d1ed2509fa6aacffa7f34614fc38 upstream.

    Signed-off-by: Bruno Thomsen
    Signed-off-by: Greg Kroah-Hartman

    Bruno Thomsen
     
  • [ Upstream commit 0af2a0d0576205dda778d25c6c344fc6508fc81d ]

    This commit ensures that lost_cnt_hint is correctly updated in
    tcp_shifted_skb() for FACK TCP senders. The lost_cnt_hint adjustment
    in tcp_sacktag_one() only applies to non-FACK senders, so FACK senders
    need their own adjustment.

    This applies the spirit of 1e5289e121372a3494402b1b131b41bfe1cf9b7f -
    except now that the sequence range passed into tcp_sacktag_one() is
    correct we need only have a special case adjustment for FACK.

    Signed-off-by: Neal Cardwell
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Neal Cardwell
     
  • [ Upstream commit daef52bab1fd26e24e8e9578f8fb33ba1d0cb412 ]

    Fix the newly-SACKed range to be the range of newly-shifted bytes.

    Previously - since 832d11c5cd076abc0aa1eaf7be96c81d1a59ce41 -
    tcp_shifted_skb() incorrectly called tcp_sacktag_one() with the start
    and end sequence numbers of the skb it passes in set to the range just
    beyond the range that is newly-SACKed.

    This commit also removes a special-case adjustment to lost_cnt_hint in
    tcp_shifted_skb() since the pre-existing adjustment of lost_cnt_hint
    in tcp_sacktag_one() now properly handles this things now that the
    correct start sequence number is passed in.

    Signed-off-by: Neal Cardwell
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Neal Cardwell
     
  • [ Upstream commit cc9a672ee522d4805495b98680f4a3db5d0a0af9 ]

    This commit allows callers of tcp_sacktag_one() to pass in sequence
    ranges that do not align with skb boundaries, as tcp_shifted_skb()
    needs to do in an upcoming fix in this patch series.

    In fact, now tcp_sacktag_one() does not need to depend on an input skb
    at all, which makes its semantics and dependencies more clear.

    Signed-off-by: Neal Cardwell
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Neal Cardwell
     
  • [ Upstream commit 5ca3b72c5da47d95b83857b768def6172fbc080a ]

    Shlomo Pongratz reported GRO L2 header check was suited for Ethernet
    only, and failed on IB/ipoib traffic.

    He provided a patch faking a zeroed header to let GRO aggregates frames.

    Roland Dreier, Herbert Xu, and others suggested we change GRO L2 header
    check to be more generic, ie not assuming L2 header is 14 bytes, but
    taking into account hard_header_len.

    __napi_gro_receive() has special handling for the common case (Ethernet)
    to avoid a memcmp() call and use an inline optimized function instead.

    Signed-off-by: Eric Dumazet
    Reported-by: Shlomo Pongratz
    Cc: Roland Dreier
    Cc: Or Gerlitz
    Cc: Herbert Xu
    Tested-by: Sean Hefty
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Eric Dumazet
     
  • [ Upstream commit 936d7de3d736e0737542641269436f4b5968e9ef ]

    Commit a0417fa3a18a ("net: Make qdisc_skb_cb upper size bound
    explicit.") made it possible for a netdev driver to use skb->cb
    between its header_ops.create method and its .ndo_start_xmit
    method. Use this in ipoib_hard_header() to stash away the LL address
    (GID + QPN), instead of the "ipoib_pseudoheader" hack. This allows
    IPoIB to stop lying about its hard_header_len, which will let us fix
    the L2 check for GRO.

    Signed-off-by: Roland Dreier
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Roland Dreier
     
  • [ Upstream commit 16bda13d90c8d5da243e2cfa1677e62ecce26860 ]

    Just like skb->cb[], so that qdisc_skb_cb can be encapsulated inside
    of other data structures.

    This is intended to be used by IPoIB so that it can remember
    addressing information stored at hard_header_ops->create() time that
    it can fetch when the packet gets to the transmit routine.

    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    David S. Miller
     
  • [ Upstream commit 5dc7883f2a7c25f8df40d7479687153558cd531b ]

    This patch fix a bug which introduced by commit ac8a4810 (ipv4: Save
    nexthop address of LSRR/SSRR option to IPCB.).In that patch, we saved
    the nexthop of SRR in ip_option->nexthop and update iph->daddr until
    we get to ip_forward_options(), but we need to update it before
    ip_rt_get_source(), otherwise we may get a wrong src.

    Signed-off-by: Li Wei
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Li Wei
     
  • [ Upstream commit e2446eaab5585555a38ea0df4e01ff313dbb4ac9 ]

    Binding RST packet outgoing interface to incoming interface
    for tcp v4 when there is no socket associate with it.
    when sk is not NULL, using sk->sk_bound_dev_if instead.
    (suggested by Eric Dumazet).

    This has few benefits:
    1. tcp_v6_send_reset already did that.
    2. This helps tcp connect with SO_BINDTODEVICE set. When
    connection is lost, we still able to sending out RST using
    same interface.
    3. we are sending reply, it is most likely to be succeed
    if iif is used

    Signed-off-by: Shawn Lu
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Shawn Lu
     
  • [ Upstream commit e6b45241c57a83197e5de9166b3b0d32ac562609 ]

    Eric Dumazet found that commit 813b3b5db83
    (ipv4: Use caller's on-stack flowi as-is in output
    route lookups.) that comes in 3.0 added a regression.
    The problem appears to be that resulting flowi4_oif is
    used incorrectly as input parameter to some routing lookups.
    The result is that when connecting to local port without
    listener if the IP address that is used is not on a loopback
    interface we incorrectly assign RTN_UNICAST to the output
    route because no route is matched by oif=lo. The RST packet
    can not be sent immediately by tcp_v4_send_reset because
    it expects RTN_LOCAL.

    So, change ip_route_connect and ip_route_newports to
    update the flowi4 fields that are input parameters because
    we do not want unnecessary binding to oif.

    To make it clear what are the input parameters that
    can be modified during lookup and to show which fields of
    floiw4 are reused add a new function to update the flowi4
    structure: flowi4_update_output.

    Thanks to Yurij M. Plotnikov for providing a bug report including a
    program to reproduce the problem.

    Thanks to Eric Dumazet for tracking the problem down to
    tcp_v4_send_reset and providing initial fix.

    Reported-by: Yurij M. Plotnikov
    Signed-off-by: Julian Anastasov
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Julian Anastasov
     
  • [ Upstream commit b530b1930bbd9d005345133f0ff0c556d2a52b19 ]

    Initially diagnosed on Ubuntu 11.04 with kernel 2.6.38.

    velocity_close is not called during a suspend / resume cycle in this
    driver and it has no business playing directly with power states.

    Signed-off-by: David Lv
    Acked-by: Francois Romieu
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    David Lv
     
  • [ Upstream commit 237114384ab22c174ec4641e809f8e6cbcfce774 ]

    VETH_INFO_PEER carries struct ifinfomsg plus optional IFLA
    attributes. A minimal size of sizeof(struct ifinfomsg) must be
    enforced or we may risk accessing that struct beyond the limits
    of the netlink message.

    Signed-off-by: Thomas Graf
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Hagen Paul Pfeifer
     
  • [ Upstream commit eb10192447370f19a215a8c2749332afa1199d46 ]

    Not now, but it looks you are correct. q->qdisc is NULL until another
    additional qdisc is attached (beside tfifo). See 50612537e9ab2969312.
    The following patch should work.

    From: Hagen Paul Pfeifer

    netem: catch NULL pointer by updating the real qdisc statistic

    Reported-by: Vijay Subramanian
    Signed-off-by: Hagen Paul Pfeifer
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Hagen Paul Pfeifer
     
  • [ Upstream commit 58e05f357a039a94aa36475f8c110256f693a239 ]

    commit 5a698af53f (bond: service netpoll arp queue on master device)
    tested IFF_SLAVE flag against dev->priv_flags instead of dev->flags

    Signed-off-by: Eric Dumazet
    Cc: WANG Cong
    Acked-by: Neil Horman
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Eric Dumazet
     
  • [ Upstream commit 70620c46ac2b45c24b0f22002fdf5ddd1f7daf81 ]

    Commit 653241 (net: RFC3069, private VLAN proxy arp support) changed
    the behavior of arp proxy to send arp replies back out on the interface
    the request came in even if the private VLAN feature is disabled.

    Previously we checked rt->dst.dev != skb->dev for in scenarios, when
    proxy arp is enabled on for the netdevice and also when individual proxy
    neighbour entries have been added.

    This patch adds the check back for the pneigh_lookup() scenario.

    Signed-off-by: Thomas Graf
    Acked-by: Jesper Dangaard Brouer
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Thomas Graf
     
  • [ Upstream commit 3013dc0cceb9baaf25d5624034eeaa259bf99004 ]

    Jean Delvare reported bonding on top of 3c59x adapters was not detecting
    network cable removal fast enough.

    3c59x indeed uses a 60 seconds timer to check link status if carrier is
    on, and 5 seconds if carrier is off.

    This patch reduces timer period to 5 seconds if device is a bonding
    slave.

    Reported-by: Jean Delvare
    Acked-by: Jean Delvare
    Acked-by: Steffen Klassert
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Eric Dumazet
     
  • commit 8e43a905dd574f54c5715d978318290ceafbe275 upstream.

    Bootup with lockdep enabled has been broken on v7 since b46c0f74657d
    ("ARM: 7321/1: cache-v7: Disable preemption when reading CCSIDR").

    This is because v7_setup (which is called very early during boot) calls
    v7_flush_dcache_all, and the save_and_disable_irqs added by that patch
    ends up attempting to call into lockdep C code (trace_hardirqs_off())
    when we are in no position to execute it (no stack, MMU off).

    Fix this by using a notrace variant of save_and_disable_irqs. The code
    already uses the notrace variant of restore_irqs.

    Reviewed-by: Nicolas Pitre
    Acked-by: Stephen Boyd
    Cc: Catalin Marinas
    Signed-off-by: Rabin Vincent
    Signed-off-by: Russell King
    Signed-off-by: Greg Kroah-Hartman

    Rabin Vincent
     
  • commit b46c0f74657d1fe1c1b0c1452631cc38a9e6987f upstream.

    armv7's flush_cache_all() flushes caches via set/way. To
    determine the cache attributes (line size, number of sets,
    etc.) the assembly first writes the CSSELR register to select a
    cache level and then reads the CCSIDR register. The CSSELR register
    is banked per-cpu and is used to determine which cache level CCSIDR
    reads. If the task is migrated between when the CSSELR is written and
    the CCSIDR is read the CCSIDR value may be for an unexpected cache
    level (for example L1 instead of L2) and incorrect cache flushing
    could occur.

    Disable interrupts across the write and read so that the correct
    cache attributes are read and used for the cache flushing
    routine. We disable interrupts instead of disabling preemption
    because the critical section is only 3 instructions and we want
    to call v7_dcache_flush_all from __v7_setup which doesn't have a
    full kernel stack with a struct thread_info.

    This fixes a problem we see in scm_call() when flush_cache_all()
    is called from preemptible context and sometimes the L2 cache is
    not properly flushed out.

    Signed-off-by: Stephen Boyd
    Acked-by: Catalin Marinas
    Reviewed-by: Nicolas Pitre
    Signed-off-by: Russell King
    Signed-off-by: Greg Kroah-Hartman

    Stephen Boyd
     
  • commit abe9a6d57b4544ac208401f9c0a4262814db2be4 upstream.

    server_scope would never be freed if nfs4_check_cl_exchange_flags() returned
    non-zero

    Signed-off-by: Weston Andros Adamson
    Signed-off-by: Trond Myklebust
    Signed-off-by: Greg Kroah-Hartman

    Weston Andros Adamson
     
  • commit b9f9a03150969e4bd9967c20bce67c4de769058f upstream.

    To ensure that we don't just reuse the bad delegation when we attempt to
    recover the nfs4_state that received the bad stateid error.

    Signed-off-by: Trond Myklebust
    Signed-off-by: Greg Kroah-Hartman

    Trond Myklebust
     
  • commit 331818f1c468a24e581aedcbe52af799366a9dfe upstream.

    Commit bf118a342f10dafe44b14451a1392c3254629a1f (NFSv4: include bitmap
    in nfsv4 get acl data) introduces the 'acl_scratch' page for the case
    where we may need to decode multi-page data. However it fails to take
    into account the fact that the variable may be NULL (for the case where
    we're not doing multi-page decode), and it also attaches it to the
    encoding xdr_stream rather than the decoding one.

    The immediate result is an Oops in nfs4_xdr_enc_getacl due to the
    call to page_address() with a NULL page pointer.

    Signed-off-by: Trond Myklebust
    Cc: Andy Adamson
    Signed-off-by: Greg Kroah-Hartman

    Trond Myklebust
     
  • commit 4d6144de8ba263eb3691a737c547e5b2fdc45287 upstream.

    If the read or write buffer size associated with the command sent
    through the mmc_blk_ioctl is zero, do not prepare data buffer.

    This enables a ioctl(2) call to for instance send a MMC_SWITCH to set
    a byte in the ext_csd.

    Signed-off-by: Johan Rudholm
    Signed-off-by: Chris Ball
    Signed-off-by: Greg Kroah-Hartman

    Johan Rudholm
     
  • [Note that since the patch isn't applicable (and unnecessary) to
    3.3-rc, there is no corresponding upstream fix.]

    The cx5051 parser calls snd_hda_input_jack_add() in the init callback
    to create and initialize the jack detection instances. Since the init
    callback is called at each time when the device gets woken up after
    suspend or power-saving mode, the duplicated instances are accumulated
    at each call. This ends up with the kernel warnings with the too
    large array size.

    The fix is simply to move the calls of snd_hda_input_jack_add() into
    the parser section instead of the init callback.

    The fix is needed only up to 3.2 kernel, since the HD-audio jack layer
    was redesigned in the 3.3 kernel.

    Reported-by: Russell King
    Tested-by: Russell King
    Signed-off-by: Takashi Iwai
    Signed-off-by: Greg Kroah-Hartman

    Takashi Iwai
     
  • commit 46e33c606af8e0caeeca374103189663d877c0d6 upstream.

    This fixes the thrd->req_running field being accessed before thrd
    is checked for null. The error was introduced in

    abb959f: ARM: 7237/1: PL330: Fix driver freeze

    Reference:

    Signed-off-by: Mans Rullgard
    Acked-by: Javi Merino
    Signed-off-by: Russell King
    Signed-off-by: Greg Kroah-Hartman

    Javi Merino
     
  • commit e188dc02d3a9c911be56eca5aa114fe7e9822d53 upstream.

    d_inode_lookup() leaks a dentry reference on IS_DEADDIR().

    Signed-off-by: Miklos Szeredi
    Signed-off-by: Al Viro
    Signed-off-by: Greg Kroah-Hartman

    Miklos Szeredi
     
  • commit cf1eb40f8f5ea12c9e569e7282161fc7f194fd62 upstream.

    The conversion of the ktime to a value suitable for the clock comparator
    does not take changes to wall_to_monotonic into account. In fact the
    conversion just needs the boot clock (sched_clock_base_cc) and the
    total_sleep_time.

    This is applicable to 3.2+ kernels.

    Signed-off-by: Martin Schwidefsky
    Signed-off-by: Greg Kroah-Hartman

    Martin Schwidefsky
     
  • commit 545d680938be1e86a6c5250701ce9abaf360c495 upstream.

    After passing through a ->setxattr() call, eCryptfs needs to copy the
    inode attributes from the lower inode to the eCryptfs inode, as they
    may have changed in the lower filesystem's ->setxattr() path.

    One example is if an extended attribute containing a POSIX Access
    Control List is being set. The new ACL may cause the lower filesystem to
    modify the mode of the lower inode and the eCryptfs inode would need to
    be updated to reflect the new mode.

    https://launchpad.net/bugs/926292

    Signed-off-by: Tyler Hicks
    Reported-by: Sebastien Bacher
    Cc: John Johansen
    Signed-off-by: Greg Kroah-Hartman

    Tyler Hicks
     
  • commit 61cddc57dc14a5dffa0921d9a24fd68edbb374ac upstream.

    Currently registers with a value of 0 are ignored when initializing the register
    defaults from raw defaults. This worked in the past, because registers without a
    explicit default were assumed to have a default value of 0. This was changed in
    commit b03622a8 ("regmap: Ensure rbtree syncs registers set to zero properly").
    As a result registers, which have a raw default value of 0 are now assumed to
    have no default. This again can result in unnecessary writes when syncing the
    cache. It will also result in unnecessary reads for e.g. the first update
    operation. In the case where readback is not possible this will even let the
    update operation fail, if the register has not been written to before.

    So this patch removes the check. Instead it adds a check to ignore raw defaults
    for registers which are volatile, since those registers are not cached.

    Signed-off-by: Lars-Peter Clausen
    Signed-off-by: Mark Brown
    Signed-off-by: Greg Kroah-Hartman

    Lars-Peter Clausen
     
  • commit 72ba009b8a159e995e40d3b4e5d7d265acead983 upstream.

    BugLink: http://bugs.launchpad.net/bugs/900802

    Signed-off-by: Till Kamppeter
    Signed-off-by: Tim Gardner
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Tim Gardner
     
  • commit b57e6b560fc2a2742910ac5ca0eb2c46e45aeac2 upstream.

    read_lock(&tpt_trig->trig.leddev_list_lock) is accessed via the path
    ieee80211_open (->) ieee80211_do_open (->) ieee80211_mod_tpt_led_trig
    (->) ieee80211_start_tpt_led_trig (->) tpt_trig_timer before initializing
    it.
    the intilization of this read/write lock happens via the path
    ieee80211_led_init (->) led_trigger_register, but we are doing
    'ieee80211_led_init' after 'ieeee80211_if_add' where we
    register netdev_ops.
    so we access leddev_list_lock before initializing it and causes the
    following bug in chrome laptops with AR928X cards with the following
    script

    while true
    do
    sudo modprobe -v ath9k
    sleep 3
    sudo modprobe -r ath9k
    sleep 3
    done

    BUG: rwlock bad magic on CPU#1, wpa_supplicant/358, f5b9eccc
    Pid: 358, comm: wpa_supplicant Not tainted 3.0.13 #1
    Call Trace:

    [] rwlock_bug+0x3d/0x47
    [] do_raw_read_lock+0x19/0x29
    [] _raw_read_lock+0xd/0xf
    [] tpt_trig_timer+0xc3/0x145 [mac80211]
    [] ieee80211_mod_tpt_led_trig+0x152/0x174 [mac80211]
    [] ieee80211_do_open+0x11e/0x42e [mac80211]
    [] ? ieee80211_check_concurrent_iface+0x26/0x13c [mac80211]
    [] ieee80211_open+0x48/0x4c [mac80211]
    [] __dev_open+0x82/0xab
    [] __dev_change_flags+0x9c/0x113
    [] dev_change_flags+0x18/0x44
    [] devinet_ioctl+0x243/0x51a
    [] inet_ioctl+0x93/0xac
    [] sock_ioctl+0x1c6/0x1ea
    [] ? might_fault+0x20/0x20
    [] do_vfs_ioctl+0x46e/0x4a2
    [] ? fget_light+0x2f/0x70
    [] ? sys_recvmsg+0x3e/0x48
    [] sys_ioctl+0x46/0x69
    [] sysenter_do_call+0x12/0x2

    Cc: Gary Morain
    Cc: Paul Stewart
    Cc: Abhijit Pradhan
    Cc: Vasanthakumar Thiagarajan
    Cc: Rajkumar Manoharan
    Acked-by: Johannes Berg
    Tested-by: Mohammed Shafi Shajakhan
    Signed-off-by: Mohammed Shafi Shajakhan
    Signed-off-by: John W. Linville
    Signed-off-by: Greg Kroah-Hartman

    Mohammed Shafi Shajakhan
     
  • commit 71f6bd4a23130cd2f4b036010c5790b1295290b9 upstream.

    Fixes PCI device detection on IBM xSeries IBM 3850 M2 / x3950 M2
    when using ACPI resources (_CRS).
    This is default, a manual workaround (without this patch)
    would be pci=nocrs boot param.

    V2: Add dev_warn if the workaround is hit. This should reveal
    how common such setups are (via google) and point to possible
    problems if things are still not working as expected.
    -> Suggested by Jan Beulich.

    Tested-by: garyhade@us.ibm.com
    Signed-off-by: Yinghai Lu
    Signed-off-by: Jesse Barnes
    Signed-off-by: Greg Kroah-Hartman

    Yinghai Lu
     
  • commit b7f5b7dec3d539a84734f2bcb7e53fbb1532a40b upstream.

    MSI_REARM_EN register is a write only trigger register.
    There is no need RMW when re-arming.

    May fix:
    https://bugs.freedesktop.org/show_bug.cgi?id=41668

    Signed-off-by: Alex Deucher
    Signed-off-by: Dave Airlie
    Signed-off-by: Greg Kroah-Hartman

    Alex Deucher
     
  • commit e8c9dc93e27d891636defbc269f182a83e6abba8 upstream.

    Registration of at91_udc as a module will enable SoC
    related code.

    Fix following an idea from Karel Znamenacek.

    Signed-off-by: Nicolas Ferre
    Acked-by: Karel Znamenacek
    Acked-by: Jean-Christophe PLAGNIOL-VILLARD
    Signed-off-by: Greg Kroah-Hartman

    Nicolas Ferre
     
  • commit 9a45a9407c69d068500923480884661e2b9cc421 upstream.

    perf on POWER stopped working after commit e050e3f0a71b (perf: Fix
    broken interrupt rate throttling). That patch exposed a bug in
    the POWER perf_events code.

    Since the PMCs count upwards and take an exception when the top bit
    is set, we want to write 0x80000000 - left in power_pmu_start. We were
    instead programming in left which effectively disables the counter
    until we eventually hit 0x80000000. This could take seconds or longer.

    With the patch applied I get the expected number of samples:

    SAMPLE events: 9948

    Signed-off-by: Anton Blanchard
    Acked-by: Paul Mackerras
    Signed-off-by: Benjamin Herrenschmidt
    Signed-off-by: Greg Kroah-Hartman

    Anton Blanchard
     
  • commit 735e93c70434614bffac4a914ca1da72e37d43c0 upstream.

    This adds the .gitignore file for the autogenerated TOMOYO files to keep
    git from complaining after building things.

    Cc: Kentaro Takeda
    Cc: Tetsuo Handa
    Cc: James Morris
    Acked-by: Tetsuo Handa
    Signed-off-by: James Morris
    Signed-off-by: Greg Kroah-Hartman

    Greg Kroah-Hartman
     

28 Feb, 2012

5 commits

  • Greg Kroah-Hartman
     
  • commit 34ddc81a230b15c0e345b6b253049db731499f7e upstream.

    After all the FPU state cleanups and finally finding the problem that
    caused all our FPU save/restore problems, this re-introduces the
    preloading of FPU state that was removed in commit b3b0870ef3ff ("i387:
    do not preload FPU state at task switch time").

    However, instead of simply reverting the removal, this reimplements
    preloading with several fixes, most notably

    - properly abstracted as a true FPU state switch, rather than as
    open-coded save and restore with various hacks.

    In particular, implementing it as a proper FPU state switch allows us
    to optimize the CR0.TS flag accesses: there is no reason to set the
    TS bit only to then almost immediately clear it again. CR0 accesses
    are quite slow and expensive, don't flip the bit back and forth for
    no good reason.

    - Make sure that the same model works for both x86-32 and x86-64, so
    that there are no gratuitous differences between the two due to the
    way they save and restore segment state differently due to
    architectural differences that really don't matter to the FPU state.

    - Avoid exposing the "preload" state to the context switch routines,
    and in particular allow the concept of lazy state restore: if nothing
    else has used the FPU in the meantime, and the process is still on
    the same CPU, we can avoid restoring state from memory entirely, just
    re-expose the state that is still in the FPU unit.

    That optimized lazy restore isn't actually implemented here, but the
    infrastructure is set up for it. Of course, older CPU's that use
    'fnsave' to save the state cannot take advantage of this, since the
    state saving also trashes the state.

    In other words, there is now an actual _design_ to the FPU state saving,
    rather than just random historical baggage. Hopefully it's easier to
    follow as a result.

    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Linus Torvalds
     
  • commit f94edacf998516ac9d849f7bc6949a703977a7f3 upstream.

    This moves the bit that indicates whether a thread has ownership of the
    FPU from the TS_USEDFPU bit in thread_info->status to a word of its own
    (called 'has_fpu') in task_struct->thread.has_fpu.

    This fixes two independent bugs at the same time:

    - changing 'thread_info->status' from the scheduler causes nasty
    problems for the other users of that variable, since it is defined to
    be thread-synchronous (that's what the "TS_" part of the naming was
    supposed to indicate).

    So perfectly valid code could (and did) do

    ti->status |= TS_RESTORE_SIGMASK;

    and the compiler was free to do that as separate load, or and store
    instructions. Which can cause problems with preemption, since a task
    switch could happen in between, and change the TS_USEDFPU bit. The
    change to TS_USEDFPU would be overwritten by the final store.

    In practice, this seldom happened, though, because the 'status' field
    was seldom used more than once, so gcc would generally tend to
    generate code that used a read-modify-write instruction and thus
    happened to avoid this problem - RMW instructions are naturally low
    fat and preemption-safe.

    - On x86-32, the current_thread_info() pointer would, during interrupts
    and softirqs, point to a *copy* of the real thread_info, because
    x86-32 uses %esp to calculate the thread_info address, and thus the
    separate irq (and softirq) stacks would cause these kinds of odd
    thread_info copy aliases.

    This is normally not a problem, since interrupts aren't supposed to
    look at thread information anyway (what thread is running at
    interrupt time really isn't very well-defined), but it confused the
    heck out of irq_fpu_usable() and the code that tried to squirrel
    away the FPU state.

    (It also caused untold confusion for us poor kernel developers).

    It also turns out that using 'task_struct' is actually much more natural
    for most of the call sites that care about the FPU state, since they
    tend to work with the task struct for other reasons anyway (ie
    scheduling). And the FPU data that we are going to save/restore is
    found there too.

    Thanks to Arjan Van De Ven for pointing us to
    the %esp issue.

    Cc: Arjan van de Ven
    Reported-and-tested-by: Raphael Prevost
    Acked-and-tested-by: Suresh Siddha
    Tested-by: Peter Anvin
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Linus Torvalds
     
  • commit 4903062b5485f0e2c286a23b44c9b59d9b017d53 upstream.

    The AMD K7/K8 CPUs don't save/restore FDP/FIP/FOP unless an exception is
    pending. In order to not leak FIP state from one process to another, we
    need to do a floating point load after the fxsave of the old process,
    and before the fxrstor of the new FPU state. That resets the state to
    the (uninteresting) kernel load, rather than some potentially sensitive
    user information.

    We used to do this directly after the FPU state save, but that is
    actually very inconvenient, since it

    (a) corrupts what is potentially perfectly good FPU state that we might
    want to lazy avoid restoring later and

    (b) on x86-64 it resulted in a very annoying ordering constraint, where
    "__unlazy_fpu()" in the task switch needs to be delayed until after
    the DS segment has been reloaded just to get the new DS value.

    Coupling it to the fxrstor instead of the fxsave automatically avoids
    both of these issues, and also ensures that we only do it when actually
    necessary (the FP state after a save may never actually get used). It's
    simply a much more natural place for the leaked state cleanup.

    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Linus Torvalds
     
  • commit b3b0870ef3ffed72b92415423da864f440f57ad6 upstream.

    Yes, taking the trap to re-load the FPU/MMX state is expensive, but so
    is spending several days looking for a bug in the state save/restore
    code. And the preload code has some rather subtle interactions with
    both paravirtualization support and segment state restore, so it's not
    nearly as simple as it should be.

    Also, now that we no longer necessarily depend on a single bit (ie
    TS_USEDFPU) for keeping track of the state of the FPU, we migth be able
    to do better. If we are really switching between two processes that
    keep touching the FP state, save/restore is inevitable, but in the case
    of having one process that does most of the FPU usage, we may actually
    be able to do much better than the preloading.

    In particular, we may be able to keep track of which CPU the process ran
    on last, and also per CPU keep track of which process' FP state that CPU
    has. For modern CPU's that don't destroy the FPU contents on save time,
    that would allow us to do a lazy restore by just re-enabling the
    existing FPU state - with no restore cost at all!

    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Linus Torvalds