21 Oct, 2011

12 commits

  • Here is an update of Bob's original rbtree patch which, in addition, also
    resolves the rather strange ref counting that was being done relating to
    the bitmap blocks.

    Originally we had a dual system for journaling resource groups. The metadata
    blocks were journaled and also the rgrp itself was added to a list. The reason
    for adding the rgrp to the list in the journal was so that the "repolish
    clones" code could be run to update the free space, and potentially send any
    discard requests when the log was flushed. This was done by comparing the
    "cloned" bitmap with what had been written back on disk during the transaction
    commit.

    Due to this, there was a requirement to hang on to the rgrps' bitmap buffers
    until the journal had been flushed. For that reason, there was a rather
    complicated set up in the ->go_lock ->go_unlock functions for rgrps involving
    both a mutex and a spinlock (the ->sd_rindex_spin) to maintain a reference
    count on the buffers.

    However, the journal maintains a reference count on the buffers anyway, since
    they are being journaled as metadata buffers. So by moving the code which deals
    with the post-journal accounting for bitmap blocks to the metadata journaling
    code, we can entirely dispense with the rather strange buffer ref counting
    scheme and also the requirement to journal the rgrps.

    The net result of all this is that the ->sd_rindex_spin is left to do exactly
    one job, and that is to look after the rbtree or rgrps.

    This patch is designed to be a stepping stone towards using RCU for the rbtree
    of resource groups, however the reduction in the number of uses of the
    ->sd_rindex_spin is likely to have benefits for multi-threaded workloads,
    anyway.

    The patch retains ->go_lock and ->go_unlock for rgrps, however these maybe also
    be removed in future in favour of calling the functions directly where required
    in the code. That will allow locking of resource groups without needing to
    actually read them in - something that could be useful in speeding up statfs.

    In the mean time though it is valid to dereference ->bi_bh only when the rgrp
    is locked. This is basically the same rule as before, modulo the references not
    being valid until the following journal flush.

    Signed-off-by: Steven Whitehouse
    Signed-off-by: Bob Peterson
    Cc: Benjamin Marzinski

    Bob Peterson
     
  • We need to take the inode's glock whenever the inode's size
    is referenced, otherwise it might not be uptodate. Even
    though generic_file_llseek_unlocked() doesn't implement
    SEEK_DATA, SEEK_HOLE directly, it does reference the inode's
    size in those cases, so we need to add them to the list
    of origins which need the glock.

    Signed-off-by: Steven Whitehouse
    Cc: Andi Kleen

    Steven Whitehouse
     
  • If we pass through knowledge of whether the creation is intended to be
    exclusive or not, then we can deal with that in gfs2_create_inode
    and remove one set of locking. Also this removes the loop in
    gfs2_create and simplifies the code a bit.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     
  • The aim of this patch is to use the newly enhanced ->dirty_inode()
    super block operation to deal with atime updates, rather than
    piggy backing that code into ->write_inode() as is currently
    done.

    The net result is a simplification of the code in various places
    and a reduction of the number of gfs2_dinode_out() calls since
    this is now implied by ->dirty_inode().

    Some of the mark_inode_dirty() calls have been moved under glocks
    in order to take advantage of then being able to avoid locking in
    ->dirty_inode() when we already have suitable locks.

    One consequence is that generic_write_end() now correctly deals
    with file size updates, so that we do not need a separate check
    for that afterwards. This also, indirectly, means that fdatasync
    should work correctly on GFS2 - the current code always syncs the
    metadata whether it needs to or not.

    Has survived testing with postmark (with and without atime) and
    also fsx.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     
  • Journaled data requires that a complete flush of all dirty data for
    the file is done, in order that the ail flush which comes after
    will succeed.

    Also the recently enhanced bug trap can trigger falsely in case
    an ail flush from fsync races with a page read. This updates the
    bug trap such that it will ignore buffers which are locked and
    only trigger on dirty and/or pinned buffers when the ail flush
    is run from fsync. The original bug trap is retained when ail
    flush is run from ->go_sync()

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     
  • If we have got far enough through the inode allocation code
    path that an inode has already been allocated, then we must
    call iput to dispose of it, if an error occurs during a
    later part of the process. This will always be the final iput
    since there will be no other references to the inode.

    Unlike when the inode has been unlinked, its block state will
    be GFS2_BLKST_INODE rather than GFS2_BLKST_UNLINKED so we need
    to skip the test in ->evict_inode() for this one case in order
    to ensure that it will be deallocated correctly. This patch adds
    a new flag in order to ensure that this will happen correctly.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     
  • We do not need to start a transaction unless the atime
    check has proved positive. Also if we are going to flush
    the complete ail list anyway, we might as well skip the
    writeback for this specific inode's metadata, since that
    will be done as part of the ail writeback process in an
    order offering potentially more efficient I/O.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     
  • The assert was being tested under the wrong lock, a
    legacy of the original code. Also, if it does trigger,
    the resulting information was not always a lot of help.

    This moves the patch under the correct lock and also
    prints out more useful information in tacking down the
    source of the problem.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     
  • Now that the data writing is part of fsync proper, we can split
    the waiting part out and do it later on. This reduces the
    number of waits that we do during fsync on average.

    There is also no need to take the i_mutex unless we are flushing
    metadata to disk, so we can move that to within the metadata
    flushing code.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     
  • Since there is now only a single caller to gfs2_dir_read_data()
    and it has a number of constant arguments, we can factor
    those out. Also some tests relating to the inode size were
    being done twice.

    Signed-off-by: Steven Whitehouse

    Steven Whitehouse
     
  • * git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc:
    sparc: Add alignment flag to PCI expansion resources
    sparc: Avoid calling sigprocmask()
    sparc: Use set_current_blocked()
    sparc32,leon: SRMMU MMU Table probe fix

    Linus Torvalds
     
  • * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net:
    fib_rules: fix unresolved_rules counting
    r8169: fix wrong eee setting for rlt8111evl
    r8169: fix driver shutdown WoL regression.
    ehea: Change maintainer to me
    pptp: pptp_rcv_core() misses pskb_may_pull() call
    tproxy: copy transparent flag when creating a time wait
    pptp: fix skb leak in pptp_xmit()
    bonding: use local function pointer of bond->recv_probe in bond_handle_frame
    smsc911x: Add support for SMSC LAN89218
    tg3: negate USE_PHYLIB flag check
    netconsole: enable netconsole can make net_device refcnt incorrent
    bluetooth: Properly clone LSM attributes to newly created child connections
    l2tp: fix a potential skb leak in l2tp_xmit_skb()
    bridge: fix hang on removal of bridge via netlink
    x25: Prevent skb overreads when checking call user data
    x25: Handle undersized/fragmented skbs
    x25: Validate incoming call user data lengths
    udplite: fast-path computation of checksum coverage
    IPVS netns shutdown/startup dead-lock
    netfilter: nf_conntrack: fix event flooding in GRE protocol tracker

    Linus Torvalds
     

20 Oct, 2011

6 commits

  • I don't usually pay much attention to the stale "? " addresses in
    stack backtraces, but this lucky report from Pawel Sikora hints that
    mremap's move_ptes() has inadequate locking against page migration.

    3.0 BUG_ON(!PageLocked(p)) in migration_entry_to_page():
    kernel BUG at include/linux/swapops.h:105!
    RIP: 0010:[] []
    migration_entry_wait+0x156/0x160
    [] handle_pte_fault+0xae1/0xaf0
    [] ? __pte_alloc+0x42/0x120
    [] ? do_huge_pmd_anonymous_page+0xab/0x310
    [] handle_mm_fault+0x181/0x310
    [] ? vma_adjust+0x537/0x570
    [] do_page_fault+0x11d/0x4e0
    [] ? do_mremap+0x2d5/0x570
    [] page_fault+0x1f/0x30

    mremap's down_write of mmap_sem, together with i_mmap_mutex or lock,
    and pagetable locks, were good enough before page migration (with its
    requirement that every migration entry be found) came in, and enough
    while migration always held mmap_sem; but not enough nowadays, when
    there's memory hotremove and compaction.

    The danger is that move_ptes() lets a migration entry dodge around
    behind remove_migration_pte()'s back, so it's in the old location when
    looking at the new, then in the new location when looking at the old.

    Either mremap's move_ptes() must additionally take anon_vma lock(), or
    migration's remove_migration_pte() must stop peeking for is_swap_entry()
    before it takes pagetable lock.

    Consensus chooses the latter: we prefer to add overhead to migration
    than to mremapping, which gets used by JVMs and by exec stack setup.

    Reported-and-tested-by: Paweł Sikora
    Signed-off-by: Hugh Dickins
    Acked-by: Andrea Arcangeli
    Acked-by: Mel Gorman
    Cc: stable@vger.kernel.org
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Currently no type of alignment is specified for PCI expansion roms while
    parsing the openfirmware tree. This causes calls to pci_map_rom() to fail.
    IORESOURCE_SIZEALIGN is the default alignment used for rom resouces in
    pci/probe.c, and has been verified to work with various cards on a ultra 10.

    Signed-off-By: Kjetil Oftedal
    Signed-off-by: David S. Miller

    Kjetil Oftedal
     
  • we should decrease ops->unresolved_rules when deleting a unresolved rule.

    Signed-off-by: Zheng Yan
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Yan, Zheng
     
  • Correct the wrong parameter for setting EEE for RTL8111E-VL.

    Signed-off-by: Hayes Wang
    Signed-off-by: David S. Miller

    hayeswang
     
  • Due to commit 92fc43b4159b518f5baae57301f26d770b0834c9 ("r8169: modify the
    flow of the hw reset."), rtl8169_hw_reset stomps during driver shutdown on
    RxConfig bits which are needed for WOL on some versions of the hardware.

    As these bits were formerly set from the r81{0x, 68}_pll_power_down methods,
    factor them out for use in the driver shutdown (rtl_shutdown) handler.

    I favored __rtl8169_get_wol() -hardware state indication- over
    RTL_FEATURE_WOL as the latter has become a good candidate for removal.

    Signed-off-by: Francois Romieu
    Cc: Hayes
    Tested-by: Marc Ballarin
    Signed-off-by: David S. Miller

    françois romieu
     
  • Breno Leitao has passed the maintainership to me.

    Signed-off-by: Thadeu Lima de Souza Cascardo
    Cc: Breno Leitao
    Acked-by: Breno Leitão
    Signed-off-by: David S. Miller

    Thadeu Lima de Souza Cascardo
     

19 Oct, 2011

14 commits

  • * 'v4l_for_linus' of git://linuxtv.org/mchehab/for_linus:
    [media] videodev: fix a NULL pointer dereference in v4l2_device_release()

    Linus Torvalds
     
  • * 'drm-fixes' of git://people.freedesktop.org/~airlied/linux:
    drm/radeon/kms/atom: fix handling of FB scratch indices
    drm/radeon/kms/DCE4.1: fix Select_CrtcSource EncodeMode setting for DP bridges (v2)
    drm/radeon/kms/DCE4.1: ss is not supported on the internal pplls
    drm/radeon/kms/DCE4.1: fix dig encoder to transmitter mapping
    ttm: Fix error-path using an uninitialized value

    Linus Torvalds
     
  • The change in 8280b66 does not cover the case when v4l2_dev is already
    NULL, fix that.

    With a Kinect sensor, seen as an USB camera using GSPCA in this context,
    a NULL pointer dereference BUG can be triggered by just unplugging the
    device after the camera driver has been loaded.

    Signed-off-by: Antonio Ospite
    Signed-off-by: Mauro Carvalho Chehab

    Antonio Ospite
     
  • FB scratch indices are dword indices, but we were treating
    them as byte indices. As such, we were getting the wrong
    FB scratch data for non-0 indices. Fix the indices and
    guard the indexing against indices larger than the scratch
    allocation.

    Fixes memory corruption on some boards if data was written
    past the end of the FB scratch array.

    Signed-off-by: Alex Deucher
    Reported-by: Dave Airlie
    Tested-by: Dave Airlie
    Cc: stable@kernel.org
    Signed-off-by: Dave Airlie

    Alex Deucher
     
  • e1000e uses paged frags, so any layer incorrectly pulling bytes from skb
    can trigger a BUG in skb_pull()

    [951.142737] [] skb_pull+0x15/0x17
    [951.142737] [] pptp_rcv_core+0x126/0x19a [pptp]
    [951.152725] [] sk_receive_skb+0x69/0x105
    [951.163558] [] pptp_rcv+0xc8/0xdc [pptp]
    [951.165092] [] gre_rcv+0x62/0x75 [gre]
    [951.165092] [] ip_local_deliver_finish+0x150/0x1c1
    [951.177599] [] ? ip_local_deliver_finish+0x0/0x1c1
    [951.177599] [] NF_HOOK.clone.7+0x51/0x58
    [951.177599] [] ip_local_deliver+0x51/0x55
    [951.177599] [] ip_rcv_finish+0x31a/0x33e
    [951.177599] [] ? ip_rcv_finish+0x0/0x33e
    [951.204898] [] NF_HOOK.clone.7+0x51/0x58
    [951.214651] [] ip_rcv+0x21b/0x246

    pptp_rcv_core() is a nice example of a function assuming everything it
    needs is available in skb head.

    Reported-by: Bradley Peterson
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • The transparent socket option setting was not copied to the time wait
    socket when an inet socket was being replaced by a time wait socket. This
    broke the --transparent option of the socket match and may have caused
    that FIN packets belonging to sockets in FIN_WAIT2 or TIME_WAIT state
    were being dropped by the packet filter.

    Signed-off-by: KOVACS Krisztian
    Signed-off-by: David S. Miller

    KOVACS Krisztian
     
  • In case we cant transmit skb, we must free it

    Signed-off-by: Eric Dumazet
    CC: Dmitry Kozlov
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • The bond->recv_probe is called in bond_handle_frame() when
    a packet is received, but bond_close() sets it to NULL. So,
    a panic occurs when both functions work in parallel.

    Why this happen:
    After null pointer check of bond->recv_probe, an sk_buff is
    duplicated and bond->recv_probe is called in bond_handle_frame.
    So, a panic occurs when bond_close() is called between the
    check and call of bond->recv_probe.

    Patch:
    This patch uses a local function pointer of bond->recv_probe
    in bond_handle_frame(). So, it can avoid the null pointer
    dereference.

    Signed-off-by: Mitsuo Hayasaka
    Cc: Jay Vosburgh
    Cc: Andy Gospodarek
    Cc: Eric Dumazet
    Cc: WANG Cong
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Mitsuo Hayasaka
     
  • LAN89218 is register compatible with LAN911x.

    Signed-off-by: Phil Edworthy
    Signed-off-by: David S. Miller

    Phil Edworthy
     
  • USE_PHYLIB flag in tg3_remove_one() is being checked incorrectly. This
    results tg3_phy_fini->phy_disconnect is never called and when tg3 module
    is removed.

    In my case this resulted in panics in phy_state_machine calling function
    phydev->adjust_link.

    So correct this check.

    Signed-off-by: Jiri Pirko
    Acked-by: Matt Carlson
    Signed-off-by: David S. Miller

    Jiri Pirko
     
  • There is no check if netconsole is enabled current.
    so when exec echo 1 > enabled;
    the reference of net_device will increment always.

    Signed-off-by: Gao feng
    Acked-by: Flavio Leitner
    Signed-off-by: David S. Miller

    Gao feng
     
  • The Bluetooth stack has internal connection handlers for all of the various
    Bluetooth protocols, and unfortunately, they are currently lacking the LSM
    hooks found in the core network stack's connection handlers. I say
    unfortunately, because this can cause problems for users who have have an
    LSM enabled and are using certain Bluetooth devices. See one problem
    report below:

    * http://bugzilla.redhat.com/show_bug.cgi?id=741703

    In order to keep things simple at this point in time, this patch fixes the
    problem by cloning the parent socket's LSM attributes to the newly created
    child socket. If we decide we need a more elaborate LSM marking mechanism
    for Bluetooth (I somewhat doubt this) we can always revisit this decision
    in the future.

    Reported-by: James M. Cape
    Signed-off-by: Paul Moore
    Acked-by: James Morris
    Signed-off-by: David S. Miller

    Paul Moore
     
  • l2tp_xmit_skb() can leak one skb if skb_cow_head() returns an error.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • Need to cleanup bridge device timers and ports when being bridge
    device is being removed via netlink.

    This fixes the problem of observed when doing:
    ip link add br0 type bridge
    ip link set dev eth1 master br0
    ip link set br0 up
    ip link del br0

    which would cause br0 to hang in unregister_netdev because
    of leftover reference count.

    Reported-by: Sridhar Samudrala
    Signed-off-by: Stephen Hemminger
    Acked-by: Sridhar Samudrala
    Signed-off-by: David S. Miller

    stephen hemminger
     

18 Oct, 2011

8 commits