在线性一致性理论中我们已经介绍了Jepsen测试的理论基础。通过本文我们来看下怎么编写运行一个简单的Jepsen测试。
1.Clojure语言介绍及入门
Jepsen本身基于Clojure开发,如果想要了解Jepsen测试框架的内部实现以及其他一些开源项目的Jespen测试代码,需要能够看懂Clojure。首先我们来介绍下Clojure,Clojure是一种函数式编程语言,本身运行基于jvm,跟java可以进行很好的交互,关于Clojure的更多优点可以参考此文。Clojure这个单词,C L J分别用了代表C Lisp Java,同时又跟Closure的拼写近似。除了Jepsen之外,另一个比较有名的采用了Clojure的开源系统是Storm,这里有一个Storm采用Clojure的原因介绍。
Jepsen的作者Aphyr也写过一篇关于Clojure入门相关的文章。
下面推荐几篇关于Clojure入门的文章:
clojure-by-example 结合Clojure解释器实际运行试试应该可以更快上手,第2节我们会介绍怎么准备一个Clojure运行环境
Reading Clojure Characters Clojure本身有很多语法糖,各种符号对于初学者来说容易造成困扰,此文是关于各种语法糖的一个总结
2.Jepsen运行环境搭建
要运行Jepsen测试首先要有java和Clojure运行环境,通过安装lein(Clojure集成开发工具),可以把它们都准备好。我们可以参考Jepsen代码中的DockerFile制作一个docker image,该image包含运行Jepsen测试程序需要的所有环境依赖,同时将jepsen源代码copy到/jepsen目录。通过该docker image我们可以直接在测试机上启动容器,在容器里面运行Jepsen测试。
进入容器执行如下命令
#启动容器
sudo docker run -ti -d --hostname=jepsen_control --name=jepsen_control docker_image /usr/sbin/init
#进入容器
docker exec -ti jepsen_control bash
#进入demo代码
cd /jepsen/jepsen.etcdemo
#启动Clojure解释器
lein repl
通过Clojure解释器,可以运行一些示例代码,帮助学习Clojure语言。
2.3 运行Jepsen测试
2.3.1 启动控制节点和DB节点容器
运行Jepsen测试,我们需要至少启动两个docker容器,一个作为控制节点,另一个作为DB节点。
#启动控制节点
sudo docker run -ti -d --hostname=jepsen_control --name=jepsen_control docker_image /usr/sbin/init
#启动一个DB节点
sudo docker run -ti -d --hostname=n1 --name=n1 docker_image /usr/sbin/init
2.3.2 环境配置
为了确保Jepsen测试可以正常运行,还需要进行如下配置
1)确保控制节点到DB节点的ssh配置正确,具体参考:https://github.com/jepsen-io/jepsen FAQ部分。
运行Jespen的默认ssh配置如下,如果要采用默认配置,还需要确保DB节点的root用户密码是root。
{:username "root",
:password "root",
:strict-host-key-checking false,
:private-key-path nil}
2)修改两个容器的/etc/hosts,加入两个容器的hostname和ip,示例如下
#cat /etc/hosts
127.0.0.1 localhost
......
192.168.5.10 jepsen_control
192.168.5.11 n1
2.3.3 运行Jepsen测试
docker容器的/jepsen/jepsen.etcdemo目录下,有个默认的demo,目前是一个最基本的框架。Jespen官方指南就是以这个为起点的,可以在这个目录下一步步地按照官方指南进行学习试验。
我们可以直接进到控制节点容器中运行它。
docker exec -ti jepsen_control bash
cd /jepsen/jepsen.etcdemo
lein run test -n n1
这样我们就运行一个最简单的Jepsen测试程序,只不过这个程序目前什么也没有做。最终应该可以看到如下输出:
INFO [2018-06-11 11:08:22,258] jepsen results - jepsen.store Wrote /jepsen/jepsen.etcdemo/store/noop/20180611T110821.000+0800/results.edn
INFO [2018-06-11 11:08:22,260] main - jepsen.core {:valid? true}
Everything looks good! ヽ(‘ー`)ノ
3.Jepsen by example
Jepsen官方入门指南,内容非常详细,强烈推荐完整地看一遍,再实际实验一下会对Jepsen有更深入的理解。
3.1 DB&Client实现
在这里我们根据实际需求对这个例子进行了简化。Jepsen测试框架主要由如下几部分组成:
Generator DB Client Model Checker
在把实际系统接入Jepsen测试时,一定要实现的两个接口是DB和Client。DB用来完成系统的部署准备,Client则用来对系统产生压力。其他Generator/Model/Checker,通常直接使用Jespen自带的实现即可。这里我们主要看下DB和Client接口及其实现方法。
(defprotocol DB
(setup! [db test node] "Set up the database on this particular node.")
(teardown! [db test node] "Tear down the database on this particular node."))
(defprotocol Client
(open! [client test node]
"Set up the client to work with a particular node. Returns a client
which is ready to accept operations via invoke! Open *should not*
affect the logical state of the test; it should not, for instance,
modify tables or insert records.")
(close! [client test]
"Close the client connection when work is completed or an invocation
crashes the client. Close should not affect the logical state of the
test.")
(setup! [client test] [client test node]
"Called once to set up database state for testing. 3 arity form is
deprecated and will be removed in a future jepsen version.")
(invoke! [client test operation]
"Apply an operation to the client, returning an operation to be
appended to the history. For multi-stage operations, the client may
reach into the test and conj onto the history atom directly.")
(teardown! [client test]
"Tear down the client when work is complete."))
如上,是DB和Client的接口定义。
对于一个实际系统来说,通常都有自己的部署脚本和API,但是可能不是用Clojure实现的,那么对于这种情况,应该怎么实现DB和Client呢?参考下当前各种开源系统的Jespen测试,通常有如下有几种做法:
1.对于数据库来说,可以采用clojure.java.jdbc操纵数据库,比如tidb/xdb;
2.实现一个Clojure版本的库,比如etcd/zookeeper
3.直接调用binary,比如braft就是通过调用C++ binary来产生访问请求
通过在Clojure代码中直接exec一个binary,可以避免实现Clojure版本的部署脚本或者Client,直接复用原有的部署脚本和API实现。
现在看一下通过这种方式怎么编写一个简单的jepsen测试,假设部署是通过一个control.py的python脚本实现,Client访问请求通过调用C++版本的binary实现。简单起见,我们直接Mock了里面的实现,对于一个实际系统来说把里面的mock实现改成实际的实现即可。
其中control.py内容如下:
#!/usr/bin/env python
# This script runs TestService servers in a single machine. It is useful for
# developer to test their local code changes.
import sys
def start():
print "start"
def stop():
print "stop"
def main():
args = sys.argv[1:]
cmd = args[0]
if cmd == 'start':
start()
elif cmd == 'stop':
stop()
if __name__ == "__main__":
main()
c++ binary代码如下:
#include <stdint.h>
#include <stdio.h>
#include <iostream>
class Register
{
public:
// return code:
// 0 means succeed
// 1 means failed
// 2 or others means timeout and unknow
// the register's init value must be set
virtual int Init() = 0;
// the register's value must be print in stdout
virtual int Get() = 0;
virtual int Set(int64_t value) = 0;
virtual int Cas(int64_t oldValue, int64_t newValue) = 0;
};
class MockRegister : public Register
{
public:
virtual int Init()
{
return 0;
}
virtual int Get()
{
printf("0");
return 0;
}
virtual int Set(int64_t value)
{
return 0;
}
virtual int Cas(int64_t oldValue, int64_t newValue)
{
return 0;
}
};
int main(int argc, char* argv[])
{
MockRegister reg;
reg.Init();
if (std::string(argv[1]) == "get")
{
return reg.Get();
}
else if (std::string(argv[1]) == "set")
{
return reg.Set(0);
}
else if (std::string(argv[1]) == "cas")
{
return reg.Cas(0, 0);
}
else
{
std::cout << "unexpected command " << std::endl;
return -1;
}
return 0;
}
Clojure中的DB和Client实现代码如下:
DB实现(通过调用control.py实现):
(defn startall!
""
[node]
(info node "start TestService")
(c/cd bin-path
(c/exec "./control.py" "start")
(c/exec :sleep 1))
)
(defn stopall!
""
[node]
(info node "stop TestService")
(c/cd bin-path
(c/exec "./control.py" "stop")
)
)
(defn DB
"TestService for a particular version."
[version]
(reify db/DB
(setup! [_ test node]
(info node "installing TestService" version)
(doto node (startall!)))
(teardown! [_ test node]
(info node "tearing down TestService")
(doto node (stopall!)))
))
Client实现(通过调用jepsen_test实现)
(def bin-path "/root/jepsen_work_dir")
(defn reg-get!
"get a value for id"
[node id]
(c/on "jepsen_control"
(c/su
(c/cd bin-path
(c/exec "./jepsen_test" "get")))))
(defn reg-set!
"set a value for id"
[node id value]
(c/on "jepsen_control"
(c/su
(c/cd bin-path
(c/exec "./jepsen_test" "set")))))
(defn reg-cas!
"cas set a value for id"
[node id value1 value2]
(c/on "jepsen_control"
(c/su
(c/cd bin-path
(c/exec "./jepsen_test" "cas")))))
(defrecord Client [k client]
client/Client
(open! [this test node]
(assoc this :client node))
(setup! [this test])
(invoke! [this test op]
(try
(case (:f op)
:read (let [resp (-> client
(reg-get! k))]
(assoc op :type :ok :value (parse-long resp)))
:write (do (->> (:value op)
(reg-set! client k))
(assoc op :type :ok))
:cas (let [[value value'] (:value op)]
(reg-cas! client k value value')
(assoc op :type :ok )))
(catch Exception e
(let [msg (str/trim (.getMessage e))]
(cond
(str/includes? msg "returned non-zero exit status 1 on ") (assoc op :type :fail, :error :atomic-failed)
(str/includes? msg "returned non-zero exit status 2 on ") (assoc op :type (if (= :read (:f op)) :fail :info), :error :timed-out)
:else (assoc op :type :info, :error :unknow-error))))))
(teardown! [_ test])
(close! [_ test])
)
3.2 运行
现在看下怎么在容器中运行上面的Jepsen测试程序。具体命令如下:
# 在容器中创建如下目录
mkdir -p /root/jepsen_work_dir/
# copy control.py到DB节点该目录下面
docker cp control.py n1:/root/jepsen_work_dir/
# copy c++ binary jepsen_test到控制节点该目录下面
docker cp jepsen_test jepsen_control:/root/jepsen_work_dir/
# 用附录中的etcdemo.clj替换控制节点容器内部的文件/jepsen/jepsen.etcdemo/src/jepsen/etcdemo.clj
docker cp etcdemo.clj jepsen_control:/jepsen/jepsen.etcdemo/src/jepsen/etcdemo.clj
# 进入控制节点内部,运行如下命令
docker exec -ti jepsen_control bash
cd /jepsen/jepsen.etcdemo
lein run test -n n1
在上面的Mock实现中,实际上让所有操作都成功,并且所有Get都会返回0,这样实际上会导致违反线性一致性,运行时会报错。
运行结果如下:
:model {:msg "can't read 0 from register 2"}}]),
:previous-ok
{:process 0,
:type :ok,
:f :cas,
:value [0 2],
:index 5,
:time 5137773669},
:last-op
{:process 0,
:type :ok,
:f :cas,
:value [0 2],
:index 5,
:time 5137773669},
:op
{:process 0,
:type :ok,
:f :read,
:value 0,
:index 7,
:time 6036683751}}
Analysis invalid! (ノಥ益ಥ)ノ ┻━┻
通过上面的这些介绍,目前应该可以方便地建立起一个jepsen测试环境实际动手体验一下。
4.附录
etcdemo.clj
(ns jepsen.etcdemo
(:require [clojure.tools.logging :refer :all]
[clojure.string :as str]
[knossos.model :as model]
[jepsen [cli :as cli]
[control :as c]
[db :as db]
[client :as client]
[generator :as gen]
[nemesis :as nemesis]
[checker :as checker]
[tests :as tests]]
[jepsen.control.util :as cu]
[jepsen.os.debian :as debian]))
(def bin-path "/root/jepsen_work_dir")
(defn parse-long
"Parses a string to a Long. Passes through `nil`."
[s]
(when s (Long/parseLong s)))
(defn startall!
""
[node]
(info node "start TestService")
(c/cd bin-path
(c/exec "./control.py" "start")
(c/exec :sleep 5))
)
(defn stopall!
""
[node]
(info node "stop TestService")
(c/cd bin-path
(c/exec "./control.py" "stop")
)
)
(defn DB
"TestService for a particular version."
[version]
(reify db/DB
(setup! [_ test node]
(info node "installing TestService" version)
(doto node (startall!)))
(teardown! [_ test node]
(info node "tearing down TestService")
(doto node (stopall!)))
))
(defn reg-get!
"get a value for id"
[node id]
(c/on "jepsen_control"
(c/su
(c/cd bin-path
(c/exec "./jepsen_test" "get")))))
(defn reg-set!
"set a value for id"
[node id value]
(c/on "jepsen_control"
(c/su
(c/cd bin-path
(c/exec "./jepsen_test" "set")))))
(defn reg-cas!
"cas set a value for id"
[node id value1 value2]
(c/on "jepsen_control"
(c/su
(c/cd bin-path
(c/exec "./jepsen_test" "cas")))))
(defrecord Client [k client]
client/Client
(open! [this test node]
(assoc this :client node))
(setup! [this test])
(invoke! [this test op]
(try
(case (:f op)
:read (let [resp (-> client
(reg-get! k))]
(assoc op :type :ok :value (parse-long resp)))
:write (do (->> (:value op)
(reg-set! client k))
(assoc op :type :ok))
:cas (let [[value value'] (:value op)]
(reg-cas! client k value value')
(assoc op :type :ok )))
(catch Exception e
(let [msg (str/trim (.getMessage e))]
(cond
(str/includes? msg "returned non-zero exit status 1 on ") (assoc op :type :fail, :error :atomic-failed)
(str/includes? msg "returned non-zero exit status 2 on ") (assoc op :type (if (= :read (:f op)) :fail :info), :error :timed-out)
:else (assoc op :type :info, :error :unknow-error))))))
(teardown! [_ test])
(close! [_ test])
)
(defn r [_ _] {:type :invoke, :f :read})
(defn w [_ _] {:type :invoke, :f :write, :value (rand-int 5)})
(defn cas [_ _] {:type :invoke, :f :cas, :value [(rand-int 5) (rand-int 5)]})
(defn TestService-test
"
A basic test
"
[name opts]
(merge tests/noop-test
{
:name (str "TestService" name)
:db (DB "v2.0.2")
:client (Client. 0 nil)
:generator (->> (gen/mix [r w cas])
(gen/stagger 1)
(gen/nemesis nil)
(gen/limit 60)
(gen/time-limit 60))
:model (model/cas-register 0)
:checker (checker/linearizable)
}
opts))
(defn TestService-base-test
[opts]
(TestService-test ".base" opts)
)
(defn -main
"I don't do a whole lot."
[& args]
(cli/run! (cli/single-test-cmd {:test-fn TestService-base-test})
args))