分布式系统

Jepsen测试

2020年12月19日 阅读(1,419)

在线性一致性理论中我们已经介绍了Jepsen测试的理论基础。通过本文我们来看下怎么编写运行一个简单的Jepsen测试。

1.Clojure语言介绍及入门

Jepsen本身基于Clojure开发,如果想要了解Jepsen测试框架的内部实现以及其他一些开源项目的Jespen测试代码,需要能够看懂Clojure。首先我们来介绍下Clojure,Clojure是一种函数式编程语言,本身运行基于jvm,跟java可以进行很好的交互,关于Clojure的更多优点可以参考此文。Clojure这个单词,C L J分别用了代表C Lisp Java,同时又跟Closure的拼写近似。除了Jepsen之外,另一个比较有名的采用了Clojure的开源系统是Storm,这里有一个Storm采用Clojure的原因介绍

Jepsen的作者Aphyr也写过一篇关于Clojure入门相关的文章

下面推荐几篇关于Clojure入门的文章:

clojure-by-example 结合Clojure解释器实际运行试试应该可以更快上手,第2节我们会介绍怎么准备一个Clojure运行环境

Reading Clojure Characters Clojure本身有很多语法糖,各种符号对于初学者来说容易造成困扰,此文是关于各种语法糖的一个总结

Clojure API 文档

2.Jepsen运行环境搭建

要运行Jepsen测试首先要有java和Clojure运行环境,通过安装lein(Clojure集成开发工具),可以把它们都准备好。我们可以参考Jepsen代码中的DockerFile制作一个docker image,该image包含运行Jepsen测试程序需要的所有环境依赖,同时将jepsen源代码copy到/jepsen目录。通过该docker image我们可以直接在测试机上启动容器,在容器里面运行Jepsen测试。

进入容器执行如下命令

#启动容器
sudo docker run -ti -d --hostname=jepsen_control --name=jepsen_control docker_image /usr/sbin/init
#进入容器
docker exec -ti jepsen_control bash

#进入demo代码
cd /jepsen/jepsen.etcdemo

#启动Clojure解释器
lein repl

通过Clojure解释器,可以运行一些示例代码,帮助学习Clojure语言。

2.3 运行Jepsen测试

2.3.1 启动控制节点和DB节点容器

运行Jepsen测试,我们需要至少启动两个docker容器,一个作为控制节点,另一个作为DB节点。

#启动控制节点
sudo docker run -ti -d --hostname=jepsen_control --name=jepsen_control docker_image /usr/sbin/init

#启动一个DB节点
sudo docker run -ti -d --hostname=n1 --name=n1 docker_image /usr/sbin/init

2.3.2 环境配置

为了确保Jepsen测试可以正常运行,还需要进行如下配置

1)确保控制节点到DB节点的ssh配置正确,具体参考:github.com/jepsen-io/je FAQ部分。

运行Jespen的默认ssh配置如下,如果要采用默认配置,还需要确保DB节点的root用户密码是root。

{:username "root",
  :password "root",
  :strict-host-key-checking false,
  :private-key-path nil}

2)修改两个容器的/etc/hosts,加入两个容器的hostname和ip,示例如下

#cat /etc/hosts
127.0.0.1       localhost
......
192.168.5.10    jepsen_control
192.168.5.11 n1

2.3.3 运行Jepsen测试

docker容器的/jepsen/jepsen.etcdemo目录下,有个默认的demo,目前是一个最基本的框架。Jespen官方指南就是以这个为起点的,可以在这个目录下一步步地按照官方指南进行学习试验。

我们可以直接进到控制节点容器中运行它。

docker exec -ti jepsen_control bash
cd /jepsen/jepsen.etcdemo
lein run test -n n1

这样我们就运行一个最简单的Jepsen测试程序,只不过这个程序目前什么也没有做。最终应该可以看到如下输出:

INFO [2018-06-11 11:08:22,258] jepsen results - jepsen.store Wrote /jepsen/jepsen.etcdemo/store/noop/20180611T110821.000+0800/results.edn
INFO [2018-06-11 11:08:22,260] main - jepsen.core {:valid? true}


Everything looks good! ヽ(‘ー`)ノ

3.Jepsen by example

Jepsen官方入门指南,内容非常详细,强烈推荐完整地看一遍,再实际实验一下会对Jepsen有更深入的理解。

3.1 DB&Client实现

在这里我们根据实际需求对这个例子进行了简化。Jepsen测试框架主要由如下几部分组成:

Generator DB Client Model Checker

在把实际系统接入Jepsen测试时,一定要实现的两个接口是DB和Client。DB用来完成系统的部署准备,Client则用来对系统产生压力。其他Generator/Model/Checker,通常直接使用Jespen自带的实现即可。这里我们主要看下DB和Client接口及其实现方法。

(defprotocol DB
  (setup!     [db test node] "Set up the database on this particular node.")
  (teardown!  [db test node] "Tear down the database on this particular node."))

(defprotocol Client
  (open! [client test node]
          "Set up the client to work with a particular node. Returns a client
          which is ready to accept operations via invoke! Open *should not*
          affect the logical state of the test; it should not, for instance,
          modify tables or insert records.")
  (close! [client test]
          "Close the client connection when work is completed or an invocation
           crashes the client. Close should not affect the logical state of the
          test.")
  (setup! [client test] [client test node]
          "Called once to set up database state for testing. 3 arity form is
           deprecated and will be removed in a future jepsen version.")
  (invoke! [client test operation]
           "Apply an operation to the client, returning an operation to be
           appended to the history. For multi-stage operations, the client may
           reach into the test and conj onto the history atom directly.")
  (teardown! [client test]
           "Tear down the client when work is complete."))

如上,是DB和Client的接口定义。

对于一个实际系统来说,通常都有自己的部署脚本和API,但是可能不是用Clojure实现的,那么对于这种情况,应该怎么实现DB和Client呢?参考下当前各种开源系统的Jespen测试,通常有如下有几种做法:

1.对于数据库来说,可以采用clojure.java.jdbc操纵数据库,比如tidb/xdb;

2.实现一个Clojure版本的库,比如etcd/zookeeper

3.直接调用binary,比如braft就是通过调用C++ binary来产生访问请求

通过在Clojure代码中直接exec一个binary,可以避免实现Clojure版本的部署脚本或者Client,直接复用原有的部署脚本和API实现。

现在看一下通过这种方式怎么编写一个简单的jepsen测试,假设部署是通过一个control.py的python脚本实现,Client访问请求通过调用C++版本的binary实现。简单起见,我们直接Mock了里面的实现,对于一个实际系统来说把里面的mock实现改成实际的实现即可。

其中control.py内容如下:

#!/usr/bin/env python
# This script runs TestService servers in a single machine.  It is useful for
# developer to test their local code changes.
import sys

def start():
    print "start"

def stop():
    print "stop"

def main():
    args = sys.argv[1:]
    cmd = args[0]
    if cmd == 'start':
        start()
    elif cmd == 'stop':
        stop()

if __name__ == "__main__":
    main()

c++ binary代码如下:

#include <stdint.h>
#include <stdio.h>
#include <iostream>

class Register
{
public:
    // return code:
    // 0 means succeed
    // 1 means failed
    // 2 or others means timeout and unknow
    // the register's init value must be set
    virtual int Init() = 0;
    // the register's value must be print in stdout
    virtual int Get() = 0;
    virtual int Set(int64_t value) = 0;
    virtual int Cas(int64_t oldValue, int64_t newValue) = 0;
};

class MockRegister : public Register
{
public:
    virtual int Init()
    {
        return 0;
    }
    virtual int Get()
    {
        printf("0");
        return 0;
    }
    virtual int Set(int64_t value)
    {
        return 0;
    }
    virtual int Cas(int64_t oldValue, int64_t newValue)
    {
        return 0;
    }
};

int main(int argc, char* argv[])
{
    MockRegister reg;
    reg.Init();
    if (std::string(argv[1]) == "get")
    {
        return reg.Get();
    }
    else if (std::string(argv[1]) == "set")
    {
        return reg.Set(0);
    }
    else if (std::string(argv[1]) == "cas")
    {
        return reg.Cas(0, 0);
    }
    else
    {
        std::cout << "unexpected command " << std::endl;
        return -1;
    }
    return 0;
}

Clojure中的DB和Client实现代码如下:

DB实现(通过调用control.py实现):

(defn startall!
  ""
  [node]
  (info node "start TestService")
  (c/cd bin-path
        (c/exec "./control.py" "start")
        (c/exec :sleep 1))
)

(defn stopall!
  ""
  [node]
  (info node "stop TestService")
  (c/cd bin-path
        (c/exec "./control.py" "stop")
  )
)

(defn DB
    "TestService for a particular version."
    [version]
    (reify db/DB
          (setup! [_ test node]
                  (info node "installing TestService" version)
                  (doto node (startall!)))

          (teardown! [_ test node]
                  (info node "tearing down TestService")
                  (doto node (stopall!)))
      ))

Client实现(通过调用jepsen_test实现)

(def bin-path "/root/jepsen_work_dir")

(defn reg-get!
    "get a value for id"
    [node id]
        (c/on "jepsen_control"
            (c/su
                (c/cd bin-path
                    (c/exec "./jepsen_test" "get")))))

(defn reg-set!
    "set a value for id"
    [node id value]
        (c/on "jepsen_control"
            (c/su
                (c/cd bin-path
                    (c/exec "./jepsen_test" "set")))))

(defn reg-cas!
    "cas set a value for id"
    [node id value1 value2]
        (c/on "jepsen_control"
            (c/su
                (c/cd bin-path
                    (c/exec "./jepsen_test" "cas")))))

(defrecord Client [k client]
  client/Client
  (open! [this test node]
    (assoc this :client node))
  (setup! [this test])

  (invoke! [this test op]
           (try
             (case (:f op)
               :read  (let [resp (-> client
                                     (reg-get! k))]
                        (assoc op :type :ok :value (parse-long resp)))
               :write (do (->> (:value op)
                               (reg-set! client k))
                        (assoc op :type :ok))

               :cas   (let [[value value'] (:value op)]
                         (reg-cas! client k value value')
                        (assoc op :type :ok )))
             (catch Exception e
               (let [msg (str/trim (.getMessage e))]
                 (cond
                   (str/includes? msg "returned non-zero exit status 1 on ") (assoc op :type :fail, :error :atomic-failed)
                   (str/includes? msg "returned non-zero exit status 2 on ") (assoc op :type (if (= :read (:f op)) :fail :info), :error :timed-out)
                   :else (assoc op :type :info, :error :unknow-error))))))

  (teardown! [_ test])
  (close! [_ test])
)

3.2 运行

现在看下怎么在容器中运行上面的Jepsen测试程序。具体命令如下:

# 在容器中创建如下目录
mkdir -p /root/jepsen_work_dir/

# copy control.py到DB节点该目录下面
docker cp control.py n1:/root/jepsen_work_dir/

# copy c++ binary jepsen_test到控制节点该目录下面
docker cp jepsen_test jepsen_control:/root/jepsen_work_dir/

# 用附录中的etcdemo.clj替换控制节点容器内部的文件/jepsen/jepsen.etcdemo/src/jepsen/etcdemo.clj
docker cp etcdemo.clj jepsen_control:/jepsen/jepsen.etcdemo/src/jepsen/etcdemo.clj

# 进入控制节点内部,运行如下命令
docker exec -ti jepsen_control bash
cd /jepsen/jepsen.etcdemo
lein run test -n n1

在上面的Mock实现中,实际上让所有操作都成功,并且所有Get都会返回0,这样实际上会导致违反线性一致性,运行时会报错。

运行结果如下:

:model {:msg "can't read 0 from register 2"}}]),
 :previous-ok
 {:process 0,
  :type :ok,
  :f :cas,
  :value [0 2],
  :index 5,
  :time 5137773669},
 :last-op
 {:process 0,
  :type :ok,
  :f :cas,
  :value [0 2],
  :index 5,
  :time 5137773669},
 :op
 {:process 0,
  :type :ok,
  :f :read,
  :value 0,
  :index 7,
  :time 6036683751}}


Analysis invalid! (ノಥ益ಥ)ノ ┻━┻

通过上面的这些介绍,目前应该可以方便地建立起一个jepsen测试环境实际动手体验一下。

4.附录

etcdemo.clj

(ns jepsen.etcdemo
    (:require [clojure.tools.logging :refer :all]
              [clojure.string :as str]
              [knossos.model :as model]
              [jepsen [cli :as cli]
                      [control :as c]
                      [db :as db]
                      [client :as client]
                      [generator :as gen]
                      [nemesis :as nemesis]
                      [checker :as checker]
                      [tests :as tests]]
              [jepsen.control.util :as cu]
              [jepsen.os.debian :as debian]))

(def bin-path "/root/jepsen_work_dir")

(defn parse-long
  "Parses a string to a Long. Passes through `nil`."
  [s]
  (when s (Long/parseLong s)))

(defn startall!
  ""
  [node]
  (info node "start TestService")
  (c/cd bin-path
        (c/exec "./control.py" "start")
        (c/exec :sleep 5))
)

(defn stopall!
  ""
  [node]
  (info node "stop TestService")
  (c/cd bin-path
        (c/exec "./control.py" "stop")
  )
)

(defn DB
    "TestService for a particular version."
    [version]
    (reify db/DB
          (setup! [_ test node]
                  (info node "installing TestService" version)
                  (doto node (startall!)))

          (teardown! [_ test node]
                  (info node "tearing down TestService")
                  (doto node (stopall!)))
      ))

(defn reg-get!
    "get a value for id"
    [node id]
        (c/on "jepsen_control"
            (c/su
                (c/cd bin-path
                    (c/exec "./jepsen_test" "get")))))

(defn reg-set!
    "set a value for id"
    [node id value]
        (c/on "jepsen_control"
            (c/su
                (c/cd bin-path
                    (c/exec "./jepsen_test" "set")))))

(defn reg-cas!
    "cas set a value for id"
    [node id value1 value2]
        (c/on "jepsen_control"
            (c/su
                (c/cd bin-path
                    (c/exec "./jepsen_test" "cas")))))

(defrecord Client [k client]
  client/Client
  (open! [this test node]
    (assoc this :client node))
  (setup! [this test])

  (invoke! [this test op]
           (try
             (case (:f op)
               :read  (let [resp (-> client
                                     (reg-get! k))]
                        (assoc op :type :ok :value (parse-long resp)))
               :write (do (->> (:value op)
                               (reg-set! client k))
                        (assoc op :type :ok))

               :cas   (let [[value value'] (:value op)]
                         (reg-cas! client k value value')
                        (assoc op :type :ok )))
             (catch Exception e
               (let [msg (str/trim (.getMessage e))]
                 (cond
                   (str/includes? msg "returned non-zero exit status 1 on ") (assoc op :type :fail, :error :atomic-failed)
                   (str/includes? msg "returned non-zero exit status 2 on ") (assoc op :type (if (= :read (:f op)) :fail :info), :error :timed-out)
                   :else (assoc op :type :info, :error :unknow-error))))))

  (teardown! [_ test])
  (close! [_ test])
)

(defn r [_ _] {:type :invoke, :f :read})
(defn w [_ _] {:type :invoke, :f :write, :value (rand-int 5)})
(defn cas [_ _] {:type :invoke, :f :cas, :value [(rand-int 5) (rand-int 5)]})

(defn TestService-test
  "
  A basic test
  "
  [name opts]
  (merge tests/noop-test
         {
          :name (str "TestService" name)
          :db (DB "v2.0.2")
          :client (Client. 0 nil)
          :generator (->> (gen/mix [r w cas])
                          (gen/stagger 1)
                          (gen/nemesis nil)
                          (gen/limit 60)
                          (gen/time-limit 60))
          :model (model/cas-register 0)
          :checker (checker/linearizable)
          }
          opts))

(defn TestService-base-test
  [opts]
  (TestService-test ".base" opts)
  )

(defn -main
  "I don't do a whole lot."
  [& args]
  (cli/run! (cli/single-test-cmd {:test-fn TestService-base-test})
                                 args))

You Might Also Like