17 Nov, 2012

1 commit

  • When a node is removed that held a PW/EX lock, the
    existing master node should invalidate the lvb on the
    resource due to the purged lock.

    Previously, the existing master node was invalidating
    the lvb if it found only NL/CR locks on the resource
    during recovery for the removed node. This could lead
    to cases where it invalidated the lvb and shouldn't
    have, or cases where it should have invalidated and
    didn't.

    When recovery selects a *new* master node for a
    resource, and that new master finds only NL/CR locks
    on the resource after lock recovery, it should
    invalidate the lvb. This case was handled correctly
    (but was incorrectly applied to the existing master
    case also.)

    When a process exits while holding a PW/EX lock,
    the lvb on the resource should be invalidated.
    This was not happening.

    The lvb contents and VALNOTVALID flag should be
    recovered before granting locks in recovery so that
    the recovered lvb state is provided in the callback.
    The lvb was being recovered after the lock was granted.

    Signed-off-by: David Teigland

    David Teigland
     

02 Nov, 2012

2 commits


03 Oct, 2012

1 commit

  • Pull networking changes from David Miller:

    1) GRE now works over ipv6, from Dmitry Kozlov.

    2) Make SCTP more network namespace aware, from Eric Biederman.

    3) TEAM driver now works with non-ethernet devices, from Jiri Pirko.

    4) Make openvswitch network namespace aware, from Pravin B Shelar.

    5) IPV6 NAT implementation, from Patrick McHardy.

    6) Server side support for TCP Fast Open, from Jerry Chu and others.

    7) Packet BPF filter supports MOD and XOR, from Eric Dumazet and Daniel
    Borkmann.

    8) Increate the loopback default MTU to 64K, from Eric Dumazet.

    9) Use a per-task rather than per-socket page fragment allocator for
    outgoing networking traffic. This benefits processes that have very
    many mostly idle sockets, which is quite common.

    From Eric Dumazet.

    10) Use up to 32K for page fragment allocations, with fallbacks to
    smaller sizes when higher order page allocations fail. Benefits are
    a) less segments for driver to process b) less calls to page
    allocator c) less waste of space.

    From Eric Dumazet.

    11) Allow GRO to be used on GRE tunnels, from Eric Dumazet.

    12) VXLAN device driver, one way to handle VLAN issues such as the
    limitation of 4096 VLAN IDs yet still have some level of isolation.
    From Stephen Hemminger.

    13) As usual there is a large boatload of driver changes, with the scale
    perhaps tilted towards the wireless side this time around.

    Fix up various fairly trivial conflicts, mostly caused by the user
    namespace changes.

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1012 commits)
    hyperv: Add buffer for extended info after the RNDIS response message.
    hyperv: Report actual status in receive completion packet
    hyperv: Remove extra allocated space for recv_pkt_list elements
    hyperv: Fix page buffer handling in rndis_filter_send_request()
    hyperv: Fix the missing return value in rndis_filter_set_packet_filter()
    hyperv: Fix the max_xfer_size in RNDIS initialization
    vxlan: put UDP socket in correct namespace
    vxlan: Depend on CONFIG_INET
    sfc: Fix the reported priorities of different filter types
    sfc: Remove EFX_FILTER_FLAG_RX_OVERRIDE_IP
    sfc: Fix loopback self-test with separate_tx_channels=1
    sfc: Fix MCDI structure field lookup
    sfc: Add parentheses around use of bitfield macro arguments
    sfc: Fix null function pointer in efx_sriov_channel_type
    vxlan: virtual extensible lan
    igmp: export symbol ip_mc_leave_group
    netlink: add attributes to fdb interface
    tg3: unconditionally select HWMON support when tg3 is enabled.
    Revert "net: ti cpsw ethernet: allow reading phy interface mode from DT"
    gre: fix sparse warning
    ...

    Linus Torvalds
     

11 Sep, 2012

1 commit

  • It is a frequent mistake to confuse the netlink port identifier with a
    process identifier. Try to reduce this confusion by renaming fields
    that hold port identifiers portid instead of pid.

    I have carefully avoided changing the structures exported to
    userspace to avoid changing the userspace API.

    I have successfully built an allyesconfig kernel with this change.

    Signed-off-by: "Eric W. Biederman"
    Acked-by: Stephen Hemminger
    Signed-off-by: David S. Miller

    Eric W. Biederman
     

10 Sep, 2012

1 commit

  • device_write only checks whether the request size is big enough, but it doesn't
    check if the size is too big.

    At that point, it also tries to allocate as much memory as the user has requested
    even if it's too much. This can lead to OOM killer kicking in, or memory corruption
    if (count + 1) overflows.

    Signed-off-by: Sasha Levin
    Signed-off-by: David Teigland

    Sasha Levin
     

13 Aug, 2012

1 commit


10 Aug, 2012

2 commits


09 Aug, 2012

3 commits

  • The in_recovery rw_semaphore has always been acquired and
    released by different threads by design. To work around
    the "BUG: bad unlock balance detected!" messages, adjust
    things so the dlm_recoverd thread always does both down_write
    and up_write.

    Signed-off-by: David Teigland

    David Teigland
     
  • Use DEFINE_SPINLOCK for global dlm_cb_seq_spin.

    Signed-off-by: David Teigland

    David Teigland
     
  • A deadlock sometimes occurs between dlm_controld closing
    a lowcomms connection through configfs and dlm_send looking
    up the address for a new connection in configfs.

    dlm_controld does a configfs rmdir which calls
    dlm_lowcomms_close which waits for dlm_send to
    cancel work on the workqueues.

    The dlm_send workqueue thread has called
    tcp_connect_to_sock which calls dlm_nodeid_to_addr
    which does a configfs lookup and blocks on a lock
    held by dlm_controld in the rmdir path.

    The solution here is to save the node addresses within
    the lowcomms code so that the lowcomms workqueue does
    not need to step through configfs to get a node address.

    dlm_controld:
    wait_for_completion+0x1d/0x20
    __cancel_work_timer+0x1b3/0x1e0
    cancel_work_sync+0x10/0x20
    dlm_lowcomms_close+0x4c/0xb0 [dlm]
    drop_comm+0x22/0x60 [dlm]
    client_drop_item+0x26/0x50 [configfs]
    configfs_rmdir+0x180/0x230 [configfs]
    vfs_rmdir+0xbd/0xf0
    do_rmdir+0x103/0x120
    sys_rmdir+0x16/0x20

    dlm_send:
    mutex_lock+0x2b/0x50
    get_comm+0x34/0x140 [dlm]
    dlm_nodeid_to_addr+0x18/0xd0 [dlm]
    tcp_connect_to_sock+0xf4/0x2d0 [dlm]
    process_send_sockets+0x1d2/0x260 [dlm]
    worker_thread+0x170/0x2a0

    Signed-off-by: David Teigland

    David Teigland
     

17 Jul, 2012

6 commits

  • I don't know exactly how, but in some cases, a dir
    record is not removed, or a new one is created when
    it shouldn't be. The result is that the dir node
    lookup returns a master node where the rsb does not
    exist. In this case, The master node will repeatedly
    return -EBADR for requests, and the lock requests will
    be stuck.

    Until all possible ways for this to happen can be
    eliminated, a simple and effective way to recover from
    this situation is for the supposed master node to send
    a standard remove message to the dir node when it
    receives a request for a resource it has no rsb for.

    Signed-off-by: David Teigland

    David Teigland
     
  • The process of rebuilding locks on a new master during
    recovery could re-order the locks on the convert queue,
    creating an "in place" conversion deadlock that would
    not be resolved. Fix this by not considering queue
    order when granting conversions after recovery.

    Signed-off-by: David Teigland

    David Teigland
     
  • Use wait_event_timeout to avoid using a timer
    directly.

    Signed-off-by: David Teigland

    David Teigland
     
  • It was possible for a remove message on an old
    rsb to be sent after a lookup message on a new
    rsb, where the rsbs were for the same resource
    name. This could lead to a missing directory
    entry for the new rsb.

    It is fixed by keeping a copy of the resource
    name being removed until after the remove has
    been sent. A lookup checks if this in-progress
    remove matches the name it is looking up.

    Signed-off-by: David Teigland

    David Teigland
     
  • When a large number of resources are being recovered,
    a linear search of the recover_list takes a long time.
    Use an idr in place of a list.

    Signed-off-by: David Teigland

    David Teigland
     
  • Remove the dir hash table (dirtbl), and use
    the rsb hash table (rsbtbl) as the resource
    directory. It has always been an unnecessary
    duplication of information.

    This improves efficiency by using a single rsbtbl
    lookup in many cases where both rsbtbl and dirtbl
    lookups were needed previously.

    This eliminates the need to handle cases of rsbtbl
    and dirtbl being out of sync.

    In many cases there will be memory savings because
    the dir hash table no longer exists.

    Signed-off-by: David Teigland

    David Teigland
     

15 May, 2012

1 commit


03 May, 2012

1 commit

  • The "nodir" mode (statically assign master nodes instead
    of using the resource directory) has always been highly
    experimental, and never seriously used. This commit
    fixes a number of problems, making nodir much more usable.

    - Major change to recovery: recover all locks and restart
    all in-progress operations after recovery. In some
    cases it's not possible to know which in-progess locks
    to recover, so recover all. (Most require recovery
    in nodir mode anyway since rehashing changes most
    master nodes.)

    - Change the way nodir mode is enabled, from a command
    line mount arg passed through gfs2, into a sysfs
    file managed by dlm_controld, consistent with the
    other config settings.

    - Allow recovering MSTCPY locks on an rsb that has not
    yet been turned into a master copy.

    - Ignore RCOM_LOCK and RCOM_LOCK_REPLY recovery messages
    from a previous, aborted recovery cycle. Base this
    on the local recovery status not being in the state
    where any nodes should be sending LOCK messages for the
    current recovery cycle.

    - Hold rsb lock around dlm_purge_mstcpy_locks() because it
    may run concurrently with dlm_recover_master_copy().

    - Maintain highbast on process-copy lkb's (in addition to
    the master as is usual), because the lkb can switch
    back and forth between being a master and being a
    process copy as the master node changes in recovery.

    - When recovering MSTCPY locks, flag rsb's that have
    non-empty convert or waiting queues for granting
    at the end of recovery. (Rename flag from LOCKS_PURGED
    to RECOVER_GRANT and similar for the recovery function,
    because it's not only resources with purged locks
    that need grant a grant attempt.)

    - Replace a couple of unnecessary assertion panics with
    error messages.

    Signed-off-by: David Teigland

    David Teigland
     

27 Apr, 2012

5 commits

  • Change some existing error/debug messages to
    collect more useful information, and add
    some new error/debug messages to address
    recently found problems.

    Signed-off-by: David Teigland

    David Teigland
     
  • If the rsb is found in the "keep" tree, but is
    not the right type (i.e. not MASTER), we can
    return immediately with the result. There's
    no point in going on to search the "toss" list
    as if we hadn't found it.

    Signed-off-by: David Teigland

    David Teigland
     
  • Unify the checking for both types of ignored
    rcom messages, and replace the two log_debug
    statements with a single, rate limited debug
    message.

    Signed-off-by: David Teigland

    David Teigland
     
  • An outstanding remote operation (an lkb on the "waiter"
    list) could sometimes miss being resent during recovery.
    The decision was based on the lkb_nodeid field, which
    could have changed during an earlier aborted recovery,
    so it no longer represents the actual remote destination.
    The lkb_wait_nodeid is always the actual remote node,
    so it is the best value to use.

    Signed-off-by: David Teigland

    David Teigland
     
  • During lowcomms shutdown, a new connection could possibly
    be created, and attempt to use a workqueue that's been
    destroyed. Similarly, during startup, a new connection
    could attempt to use a workqueue that's not been set up
    yet. Add a global variable to indicate when new connections
    are allowed.

    Based on patch by: Christine Caulfield

    Reported-by: dann frazier
    Reviewed-by: dann frazier
    Signed-off-by: David Teigland

    David Teigland
     

24 Apr, 2012

2 commits


06 Apr, 2012

1 commit

  • Many users of debugfs copy the implementation of default_open() when
    they want to support a custom read/write function op. This leads to a
    proliferation of the default_open() implementation across the entire
    tree.

    Now that the common implementation has been consolidated into libfs we
    can replace all the users of this function with simple_open().

    This replacement was done with the following semantic patch:

    @ open @
    identifier open_f != simple_open;
    identifier i, f;
    @@
    -int open_f(struct inode *i, struct file *f)
    -{
    (
    -if (i->i_private)
    -f->private_data = i->i_private;
    |
    -f->private_data = i->i_private;
    )
    -return 0;
    -}

    @ has_open depends on open @
    identifier fops;
    identifier open.open_f;
    @@
    struct file_operations fops = {
    ...
    -.open = open_f,
    +.open = simple_open,
    ...
    };

    [akpm@linux-foundation.org: checkpatch fixes]
    Signed-off-by: Stephen Boyd
    Cc: Greg Kroah-Hartman
    Cc: Al Viro
    Cc: Julia Lawall
    Acked-by: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Stephen Boyd
     

22 Mar, 2012

1 commit


21 Mar, 2012

1 commit


09 Mar, 2012

2 commits


11 Jan, 2012

1 commit


04 Jan, 2012

3 commits

  • These new callbacks notify the dlm user about lock recovery.
    GFS2, and possibly others, need to be aware of when the dlm
    will be doing lock recovery for a failed lockspace member.

    In the past, this coordination has been done between dlm and
    file system daemons in userspace, which then direct their
    kernel counterparts. These callbacks allow the same
    coordination directly, and more simply.

    Signed-off-by: David Teigland

    David Teigland
     
  • Slot numbers are assigned to nodes when they join the lockspace.
    The slot number chosen is the minimum unused value starting at 1.
    Once a node is assigned a slot, that slot number will not change
    while the node remains a lockspace member. If the node leaves
    and rejoins it can be assigned a new slot number.

    A new generation number is also added to a lockspace. It is
    set and incremented during each recovery along with the slot
    collection/assignment.

    The slot numbers will be passed to gfs2 which will use them as
    journal id's.

    Signed-off-by: David Teigland

    David Teigland
     
  • Put all the calls to recovery barriers in the same function
    to clarify where they each happen. Should not change any behavior.
    Also modify some recovery debug lines to make them consistent.

    Signed-off-by: David Teigland

    David Teigland
     

23 Nov, 2011

1 commit


19 Nov, 2011

1 commit

  • Change the linked lists to rb_tree's in the rsb
    hash table to speed up searches. Slow rsb searches
    were having a large impact on gfs2 performance due
    to the large number of dlm locks gfs2 uses.

    Signed-off-by: Bob Peterson
    Signed-off-by: David Teigland

    Bob Peterson
     

26 Jul, 2011

1 commit

  • * 'for-3.1' of git://linux-nfs.org/~bfields/linux:
    nfsd: don't break lease on CLAIM_DELEGATE_CUR
    locks: rename lock-manager ops
    nfsd4: update nfsv4.1 implementation notes
    nfsd: turn on reply cache for NFSv4
    nfsd4: call nfsd4_release_compoundargs from pc_release
    nfsd41: Deny new lock before RECLAIM_COMPLETE done
    fs: locks: remove init_once
    nfsd41: check the size of request
    nfsd41: error out when client sets maxreq_sz or maxresp_sz too small
    nfsd4: fix file leak on open_downgrade
    nfsd4: remember to put RW access on stateid destruction
    NFSD: Added TEST_STATEID operation
    NFSD: added FREE_STATEID operation
    svcrpc: fix list-corrupting race on nfsd shutdown
    rpc: allow autoloading of gss mechanisms
    svcauth_unix.c: quiet sparse noise
    svcsock.c: include sunrpc.h to quiet sparse noise
    nfsd: Remove deprecated nfsctl system call and related code.
    NFSD: allow OP_DESTROY_CLIENTID to be only op in COMPOUND

    Fix up trivial conflicts in Documentation/feature-removal-schedule.txt

    Linus Torvalds
     

21 Jul, 2011

1 commit

  • Both the filesystem and the lock manager can associate operations with a
    lock. Confusingly, one of them (fl_release_private) actually has the
    same name in both operation structures.

    It would save some confusion to give the lock-manager ops different
    names.

    Signed-off-by: J. Bruce Fields

    J. Bruce Fields