05 Mar, 2011

3 commits

  • The standby logic used to be pretty dependent on the work requeueing
    behavior that changed when we switched to WQ_NON_REENTRANT. It was also
    very fragile.

    Restructure things so that:
    - We clear WRITE_PENDING when we set STANDBY. This ensures we will
    requeue work when we wake up later.
    - con_work backs off if STANDBY is set. There is nothing to do if we are
    in standby.
    - clear_standby() helper is called by both con_send() and con_keepalive(),
    the two actions that can wake us up again. Move the connect_seq++
    logic here.

    Signed-off-by: Sage Weil

    Sage Weil
     
  • There was some broken keepalive code using a dead variable. Shift to using
    the proper bit flag.

    Signed-off-by: Sage Weil

    Sage Weil
     
  • With commit f363e45f we replaced a bunch of hacky workqueue mutual
    exclusion logic with the WQ_NON_REENTRANT flag. One pieces of fallout is
    that the exponential backoff breaks in certain cases:

    * con_work attempts to connect.
    * we get an immediate failure, and the socket state change handler queues
    immediate work.
    * con_work calls con_fault, we decide to back off, but can't queue delayed
    work.

    In this case, we add a BACKOFF bit to make con_work reschedule delayed work
    next time it runs (which should be immediately).

    Signed-off-by: Sage Weil

    Sage Weil
     

04 Mar, 2011

2 commits

  • If we mark the connection CLOSED we will give up trying to reconnect to
    this server instance. That is appropriate for things like a protocol
    version mismatch that won't change until the server is restarted, at which
    point we'll get a new addr and reconnect. An authorization failure like
    this is probably due to the server not properly rotating it's secret keys,
    however, and should be treated as transient so that the normal backoff and
    retry behavior kicks in.

    Signed-off-by: Sage Weil

    Sage Weil
     
  • get_user_pages() can return fewer pages than we ask for. We were returning
    a bogus pointer/error code in that case. Instead, loop until we get all
    the pages we want or get an error we can return to the caller.

    Signed-off-by: Sage Weil

    Sage Weil
     

22 Feb, 2011

1 commit


26 Jan, 2011

2 commits

  • Pass errors from writing to the socket up the stack. If we get -EAGAIN,
    return 0 from the helper to simplify the callers' checks.

    Signed-off-by: Sage Weil

    Sage Weil
     
  • If we get EAGAIN when trying to read from the socket, it is not an error.
    Return 0 from the helper in this case to simplify the error handling cases
    in the caller (indirectly, try_read).

    Fix try_read to pass any error to it's caller (con_work) instead of almost
    always returning 0. This let's us respond to things like socket
    disconnects.

    Signed-off-by: Sage Weil

    Sage Weil
     

14 Jan, 2011

1 commit

  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
    rbd: fix cleanup when trying to mount inexistent image
    net/ceph: make ceph_msgr_wq non-reentrant
    ceph: fsc->*_wq's aren't used in memory reclaim path
    ceph: Always free allocated memory in osdmap_decode()
    ceph: Makefile: Remove unnessary code
    ceph: associate requests with opening sessions
    ceph: drop redundant r_mds field
    ceph: implement DIRLAYOUTHASH feature to get dir layout from MDS
    ceph: add dir_layout to inode

    Linus Torvalds
     

13 Jan, 2011

3 commits

  • ceph messenger code does a rather complex dancing around multithread
    workqueue to make sure the same work item isn't executed concurrently
    on different CPUs. This restriction can be provided by workqueue with
    WQ_NON_REENTRANT.

    Make ceph_msgr_wq non-reentrant workqueue with the default concurrency
    level and remove the QUEUED/BUSY logic.

    * This removes backoff handling in con_work() but it couldn't reliably
    block execution of con_work() to begin with - queue_con() can be
    called after the work started but before BUSY is set. It seems that
    it was an optimization for a rather cold path and can be safely
    removed.

    * The number of concurrent work items is bound by the number of
    connections and connetions are independent from each other. With
    the default concurrency level, different connections will be
    executed independently.

    Signed-off-by: Tejun Heo
    Cc: Sage Weil
    Cc: ceph-devel@vger.kernel.org
    Signed-off-by: Sage Weil

    Tejun Heo
     
  • Always free memory allocated to 'pi' in
    net/ceph/osdmap.c::osdmap_decode().

    Signed-off-by: Jesper Juhl
    Signed-off-by: Sage Weil

    Jesper Juhl
     
  • Add a ceph_dir_layout to the inode, and calculate dentry hash values based
    on the parent directory's specified dir_hash function. This is needed
    because the old default Linux dcache hash function is extremely week and
    leads to a poor distribution of files among dir fragments.

    Signed-off-by: Sage Weil

    Sage Weil
     

27 Dec, 2010

1 commit


21 Dec, 2010

1 commit


18 Dec, 2010

2 commits


14 Dec, 2010

1 commit


09 Dec, 2010

1 commit


30 Nov, 2010

1 commit

  • * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: (27 commits)
    af_unix: limit recursion level
    pch_gbe driver: The wrong of initializer entry
    pch_gbe dreiver: chang author
    ucc_geth: fix ucc halt problem in half duplex mode
    inet: Fix __inet_inherit_port() to correctly increment bsockets and num_owners
    ehea: Add some info messages and fix an issue
    hso: fix disable_net
    NET: wan/x25_asy, move lapb_unregister to x25_asy_close_tty
    cxgb4vf: fix setting unicast/multicast addresses ...
    net, ppp: Report correct error code if unit allocation failed
    DECnet: don't leak uninitialized stack byte
    au1000_eth: fix invalid address accessing the MAC enable register
    dccp: fix error in updating the GAR
    tcp: restrict net.ipv4.tcp_adv_win_scale (#20312)
    netns: Don't leak others' openreq-s in proc
    Net: ceph: Makefile: Remove unnessary code
    vhost/net: fix rcu check usage
    econet: fix CVE-2010-3848
    econet: fix CVE-2010-3850
    econet: disallow NULL remote addr for sendmsg(), fixes CVE-2010-3849
    ...

    Linus Torvalds
     

28 Nov, 2010

1 commit


24 Nov, 2010

1 commit

  • * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6:
    of/phylib: Use device tree properties to initialize Marvell PHYs.
    phylib: Add support for Marvell 88E1149R devices.
    phylib: Use common page register definition for Marvell PHYs.
    qlge: Fix incorrect usage of module parameters and netdev msg level
    ipv6: fix missing in6_ifa_put in addrconf
    SuperH IrDA: correct Baud rate error correction
    atl1c: Fix hardware type check for enabling OTP CLK
    net: allow GFP_HIGHMEM in __vmalloc()
    bonding: change list contact to netdev@vger.kernel.org
    e1000: fix screaming IRQ

    Linus Torvalds
     

23 Nov, 2010

1 commit


22 Nov, 2010

1 commit

  • We forgot to use __GFP_HIGHMEM in several __vmalloc() calls.

    In ceph, add the missing flag.

    In fib_trie.c, xfrm_hash.c and request_sock.c, using vzalloc() is
    cleaner and allows using HIGHMEM pages as well.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

10 Nov, 2010

3 commits

  • The alignment used for reading data into or out of pages used to be taken
    from the data_off field in the message header. This only worked as long
    as the page alignment matched the object offset, breaking direct io to
    non-page aligned offsets.

    Instead, explicitly specify the page alignment next to the page vector
    in the ceph_msg struct, and use that instead of the message header (which
    probably shouldn't be trusted). The alloc_msg callback is responsible for
    filling in this field properly when it sets up the page vector.

    Signed-off-by: Sage Weil

    Sage Weil
     
  • We used to infer alignment of IOs within a page based on the file offset,
    which assumed they matched. This broke with direct IO that was not aligned
    to pages (e.g., 512-byte aligned IO). We were also trusting the alignment
    specified in the OSD reply, which could have been adjusted by the server.

    Explicitly specify the page alignment when setting up OSD IO requests.

    Signed-off-by: Sage Weil

    Sage Weil
     
  • The offset/length arguments aren't used.

    Signed-off-by: Sage Weil

    Sage Weil
     

02 Nov, 2010

1 commit

  • If the client gets out of sync with the server message sequence number, we
    normally skip low seq messages (ones we already received). The skip code
    was also incrementing the expected seq, such that all subsequent messages
    also appeared old and got skipped, and an eventual timeout on the osd
    connection. This resulted in some lagging requests and console messages
    like

    [233480.882885] ceph: skipping osd22 10.138.138.13:6804 seq 2016, expected 2017
    [233480.882919] ceph: skipping osd22 10.138.138.13:6804 seq 2017, expected 2018
    [233480.882963] ceph: skipping osd22 10.138.138.13:6804 seq 2018, expected 2019
    [233480.883488] ceph: skipping osd22 10.138.138.13:6804 seq 2019, expected 2020
    [233485.219558] ceph: skipping osd22 10.138.138.13:6804 seq 2020, expected 2021
    [233485.906595] ceph: skipping osd22 10.138.138.13:6804 seq 2021, expected 2022
    [233490.379536] ceph: skipping osd22 10.138.138.13:6804 seq 2022, expected 2023
    [233495.523260] ceph: skipping osd22 10.138.138.13:6804 seq 2023, expected 2024
    [233495.923194] ceph: skipping osd22 10.138.138.13:6804 seq 2024, expected 2025
    [233500.534614] ceph: tid 6023602 timed out on osd22, will reset osd

    Reported-by: Theodore Ts'o
    Signed-off-by: Sage Weil

    Sage Weil
     

21 Oct, 2010

5 commits

  • Decrement the free page counter when removing a page from the free_list.

    Signed-off-by: Sage Weil

    Sage Weil
     
  • This only happened when parse_extra_token was not passed
    to ceph_parse_option() (hence, only happened in rbd).

    Signed-off-by: Yehuda Sadeh

    Yehuda Sadeh
     
  • These facilitate preallocation of pages so that we can encode into the pagelist
    in an atomic context.

    Signed-off-by: Greg Farnum
    Signed-off-by: Sage Weil

    Greg Farnum
     
  • The rados block device (rbd), based on osdblk, creates a block device
    that is backed by objects stored in the Ceph distributed object storage
    cluster. Each device consists of a single metadata object and data
    striped over many data objects.

    The rbd driver supports read-only snapshots.

    Signed-off-by: Yehuda Sadeh
    Signed-off-by: Sage Weil

    Yehuda Sadeh
     
  • This factors out protocol and low-level storage parts of ceph into a
    separate libceph module living in net/ceph and include/linux/ceph. This
    is mostly a matter of moving files around. However, a few key pieces
    of the interface change as well:

    - ceph_client becomes ceph_fs_client and ceph_client, where the latter
    captures the mon and osd clients, and the fs_client gets the mds client
    and file system specific pieces.
    - Mount option parsing and debugfs setup is correspondingly broken into
    two pieces.
    - The mon client gets a generic handler callback for otherwise unknown
    messages (mds map, in this case).
    - The basic supported/required feature bits can be expanded (and are by
    ceph_fs_client).

    No functional change, aside from some subtle error handling cases that got
    cleaned up in the refactoring process.

    Signed-off-by: Sage Weil

    Yehuda Sadeh