Space的蓝色天空

2009年10月28日星期三

lightcloud-c as a consistent hash server

最近将lightcloud-python用c改写了一下，并且作为一个consistent hash的服务器，性能能够达到1w/s的set和get操作，下一步就是实现socket接口的问题了。
这样lightcloud可以和tokyotyrant+tokyocabinet，memcached，redis等等合起来一起用。。。

2009年8月31日星期一

php for lightcloud

see it here:http://code.google.com/p/lightcloud-phpclient/
which is just rewriting the python client of lightcloud to php edition... and will add some more features later...

2009年4月16日星期四

懒得一张张插入图片了。。。可以在这里看有图片的。
http://docs.google.com/Doc?id=ddxvh3dd_6gdnrs8f8&hl=zh_CN

Cassandra简单评测（2009/4/16）
1. 安装
a) 环境
i. Windows Vista Sp1（其实最好是在linux下面安装，不过本本只有Vista和OS X）
ii. Java JDK 1.7 b50（http://www.java.net/download/jdk7/binaries/jdk-7-ea-bin-b50-windows-i586-05_mar_2009.exe，请一定要安装这个版本，因为程序就是基于这个jdk进行开发的，而这个jdk和之前的有很大的差别，不用这个jdk会发生很多类找不到的问题。不过貌似听作者说会将这个限制去掉，到时候其他jdk也可以了）
iii. Eclipse 3.3（这个随便哪个版本了）
iv. Firefox 3.3（这个就是为了看一下cassandra提供的http接口的）
b) 配置
i. 首先从svn（http://svn.apache.org/repos/asf/incubator/cassandra/trunk）上面把cassandra给checkout出来，放在cassandra目录中。我checkout的版本是3月27号的最后那个版本，具体revision忘记了。
ii. 在eclipse新建一个project，选择“java project from existing ant buildfile）
iii. 在ant buildfile 里面选择cassandra中的build.xml，工程名填入cassandra20090327，然后导入。
iv. 因为eclipse设置的workspace是E:\Program\Java2，所以在E:\Program\Java2\cassandra20090327新建一个文件夹叫做“null”，然后将cassandra目录中conf下面的两个文件：storage-conf.xml和log4j.properties复制到cassandra20090327/null下面，打开storage-conf.xml，接下来说明一下相关的配置：
1. 填入主机名称，我有两台机器，所以写成（当然可以写ip地址）：
a) lizhenyu-PC
b) lizhenyu-IBM
2. 里面写你需要新建或者读取的table的名称和columnfamily以及conlumntype，可以使用默认的。
3. 就是通过firefox访问时用的端口，可以使用默认的。
v. 配置完以后，我们在eclipse中运行service包中的CassandraServer.java文件，如果出现：

DEBUG - Submitting a major compaction task ...
DEBUG - Started compaction ...RecycleColumnFamily
DEBUG - Submitting a major compaction task ...
DEBUG - Finished compaction ...RecycleColumnFamily
DEBUG - Started compaction ...Super1
DEBUG - Finished compaction ...Super1
DEBUG - Done reading the block indexes, Index has been created
DEBUG - INDEX LOAD TIME: 13 ms.
DEBUG - Submitting a major compaction task ...
DEBUG - Started compaction ...Standard2
DEBUG - Finished compaction ...Standard2
DEBUG - Submitting a major compaction task ...
DEBUG - Started compaction ...Super2
DEBUG - Submitting a major compaction task ...
DEBUG - Finished compaction ...Super2
DEBUG - Submitting a major compaction task ...
DEBUG - Started compaction ...Standard1
DEBUG - Submitting a major compaction task ...
DEBUG - Finished compaction ...Standard1
DEBUG - Submitting a major compaction task ...
DEBUG - Started compaction ...LocationInfo
DEBUG - Submitting a major compaction task ...
DEBUG - Finished compaction ...LocationInfo
DEBUG - Started compaction ...StandardByTime1
DEBUG - Finished compaction ...StandardByTime1
DEBUG - Submitting a major compaction task ...
DEBUG - Started compaction ...TableMetadata
DEBUG - Finished compaction ...TableMetadata
DEBUG - Started compaction ...StandardByTime2
DEBUG - Finished compaction ...StandardByTime2
DEBUG - Started compaction ...HintsColumnFamily
DEBUG - Finished compaction ...HintsColumnFamily
DEBUG - Starting to listen on lizhenyu-IBM:7001
INFO - Running on stage GMFD
说明已经成功运行了，在两台机器上一次如此配置，我的笔记本名字是lizhenyu-IBM，台式机名称是lizhenyu-PC。如果两台机器都配置成功了，你会在eclipse的控制台上看到他们之间的交互信息。
2. 测试
a) 首先在lizhenyu-PC和lizhenyu-IBM上分别启动CassandraService.java
i. 在firefox中输入http://lizhenyu-ibm:7002/或者http://lizhenyu-pc:7002/都能够看到关于这个小小的cluster的信息
ii.
iii.
iv.
v. 我输入的是http://lizhenyu-ibm:7002/，所以写的sql语句也是影响ibm这台机器
vi. 点击SQL，进入数据插入和查询界面
vii. 输入如下图的信息
viii. 点击send，出现The insert was successful : columnfamily=Standard1 key=1 data=xyz
ix. 在查询框中输入
x. 点击send，得到的结果是
xi. 现在说明对于table的操作是没有问题的
b) 在org.apache.cassandra.test中有很多用来测试Cassandra的代码，我测试其中的SSTableTest.java（其他的可以按照这个例子自己尝试，但是StressTest.java里面还是有点问题，我还在和作者交流）。
i. 对于StressTest.java，默认是执行这段代码：
1. BloomFilter bf = new BloomFilter(1024*1024, 15);
2. for ( int i = 0; i < 1024*1024; ++i )
3. {
4. bf.fill(Integer.toString(i));
5. }
6.
7. DataOutputBuffer bufOut = new DataOutputBuffer();
8. BloomFilter.serializer().serialize(bf, bufOut);
9. FileOutputStream fos = new FileOutputStream("C:\\Engagements\\bf.dat", true);
10. fos.write(bufOut.getData(), 0, bufOut.getLength());
11. fos.close();
12.
13. FileInputStream fis = new FileInputStream("C:\\Engagements\\bf.dat");
14. byte[] bytes = new byte[fis.available()];
15. fis.read(bytes);
16. DataInputBuffer bufIn = new DataInputBuffer();
17. bufIn.reset(bytes, bytes.length );
18. BloomFilter bf2 = BloomFilter.serializer().deserialize(bufIn);
19.
20. int count = 0;
21. for ( int i = 0; i < 1024*1024; ++i )
22. {
23. if ( bf.isPresent(Integer.toString(i)) )
24. ++count;
25. }
26. System.out.println(count);
其实就是简单的对数据的序列化和反序列化，将序列化后的东西存储在硬盘上，当要读取数据的时候，对文件进行反序列化，就可以正常显示了。sstable的存储和获取就是一个序列化和反序列化的过程。
正确的结果是：1048576，如果和结果不一样，检查一下吧。
ii. 将这段代码注释掉，去掉LogUtil.init()，readSSTable()和rawSSTableWrite()的注释，变成：
1. LogUtil.init();
2. //DatabaseDescriptor.init();
3. //hashSSTableWrite();
4. rawSSTableWrite();
5. readSSTable();
将rawSSTableWrite（）函数的代码改成：
private static void rawSSTableWrite() throws Throwable
{
SSTable ssTable = new SSTable("C:\\Engagements\\Cassandra", "Table-Test-2");
DataOutputBuffer bufOut = new DataOutputBuffer();
BloomFilter bf = new BloomFilter(1000, 8);
byte[] bytes = new byte[64*1024];
Random random = new Random();
for ( int i = 100; i < 1000; ++i )
{
String key = Integer.toString(i);
ColumnFamily cf = new ColumnFamily("Standard1", "Standard");
bufOut.reset();
// random.nextBytes(bytes);
cf.addColumn("C", "Avinash Lakshman is a good man".getBytes(), i);
ColumnFamily.serializerWithIndexes().serialize(cf, bufOut);
ssTable.append(key, bufOut);
bf.fill(key);
}
ssTable.close(bf);
}

结果会出现
KEY:100
Standard1
C
KEY:101
Standard1
C
KEY:102
Standard1
C
……
……
……

因为这个版本的代码有点问题，不知道为什么作者认为101比102小，那么101的md5值也会比102的md5值小，奇怪的论调，貌似没有这样的关系吧。如果你出现类似于这样的exception：key is not written by ascending.那么可以更改org.apache.cassandra.dht包中的decorateKey(String key)函数：
原来的代码：public String decorateKey(String key)
{
return hash(key).toString() + ":" + key;
}
改后的代码：public String decorateKey(String key)
{
//return hash(key).toString() + ":" + key;
return key + ":" + hash(key);
}
c) 其他几个测试文件可以自己回去尝试去运行一下，这里就懒得写了，其实就是分别测试了memtable，对Cassandra的简单压力测试，数据的导入导出等等。
3. 如何使用Cassandra或者说它的接口
a) 其实Cassandra的接口很简单，都放在了org.apache.cassandra.service中的Cassandra.java文件中，主要有以下几个接口：
get_slice，get_slice_by_names，get_column，get_column_count，insert，batch_insert，batch_insert_blocking，remove，get_columns_since，get_slice_super，get_slice_super_by_names，get_superColumn，batch_insert_superColumn，batch_insert_superColumn_blocking，touch，getStringProperty，getStringListProperty，describeTable，executeQuery，关于怎么使用这些接口，可以看test包中的几个例子，都有介绍，而且看名称应该就能懂都是干嘛的了吧。
过两天我会写一个完整的程序，是关于使用这些interface，不过要做毕业设计，太忙了点这段时间^)^
4. 性能测试
a) 因为这些版本都是development版，所以有很多的println语句，导致我用一台机器让另一台机器插入10w条数据都用了100多秒，现在对于Cassandra的benchmark网上还找不到，Avinash Lakshman说他们facebook内部有对Cassandra和Mysql的benchmark，不过他说不知道能不能发出来，估计要问他的boss，等他的消息了。
5. 疑问
a) SStable的格式，因为源代码的注释中一下说index是放在所有的blocks之后的，在后面又说每隔128字节（一个block）就有一个index，晕死，回头再看看源代码
b) 消息序列化中还有点问题，Jonathan Ellis说源代码没有问题，但是我的程序一直都无法通过。
c) 对于Cassandra的failure detection机制我还没有看懂，Neophytos Demetriou建议我去看这篇论文The φ accrual failure detector （http://ddg.jaist.ac.jp/pub/HDY+04.pdf），感兴趣的可以看看，对读懂源代码为什么这样写有帮助。
6. 后续工作
a) 在Cassandra中设计一个庆祥上次给我的任务，看看效果怎么样（不过现在还不能看出性能是不是能比mysql优秀）。
b) 继续深入源代码，因为如果要使用Cassandra的话那么很多的参数变更都能影响整个服务器的性能的，例如对于SSTable中block大小的设定，默认是128bytes，而且没有对外提供更改的接口，如果想要部署Cassandra，还有很多工作要做，需要整理出自己的接口，需要对自己的用户量，业务进行分析，看看什么适合Cassandra
c) Thrift：因为我都是用java对Cassandra进行操作，所以直接跳过了thrift这一步，接下来需要了解一下thrift的详细内容，以后用其他语言操作Cassandra就方便了。
7. 对于是否能部署，我现在还不能下结论，一方面没有拿到facebook的benchmark，另外一方面facebook肯定还有其他的措施没有发出来的，我想Cassandra本身肯定是没有问题的，关键是了解他们内部工作机制，怎么去使用它，更改它使得它适合我们的条件，怎么去开发一些和Cassandra配合使用的东西，我想这才是我们需要关心的

2009年4月12日星期日

10000块钱看来也没有多少啊。【转自校内】

下面我就为大家分析一下月薪一万在上海的普通生活：
　　
　　　　 (一)月到手收入计算(人民币)
　　
　　　　收入：10000元
　　
　　　　扣除社保：养老8% 医疗2% 失业1%
　　
　　　　根据最新08年平均工资的三倍来算缴纳基数上限为9876元。
　　
　　　　所以扣除额度为 9876*11%=1086.36
　　
　　　　公积金：扣除额度上限为607元
　　
　　　　 (根据7%推算缴纳基数上限为8671.5元，09年很快就会调整更高)
　　
　　　　缴税工资：1000-1086.36-607=8306.64元
　　
　　　　缴纳个税：886.328元
　　
　　　　到手收入：8306.64 - 886.328=7420.312元
　　
　　　　 (二)月生活成本计算：对于一个无房无车，活的还凑合的水平来说。
　　
　　　　 1.住房：租房，在徐汇区这边，一室一厅全配，最少的1500元
　　
　　　　 (如果是买了房子个人还贷，根据情况，估计平均也要2000元)
　　
　　　　 2.水、电、煤气、宽带、有线电视、卫生管理费：
　　
　　　　电费110元左右
　　
　　　　 (空调、冰箱、电视、热水器、洗衣机、抽油烟机、电脑、手机、
　　
　　　　另外什么电热毯，饮水机，电饭锅等都是耗电大户)
　　
　　　　水费：洗澡做饭很费水哈，50元
　　
　　　　煤气：经常做饭的话也要20元
　　
　　　　宽带：120元
　　
　　　　有线电视：18元
　　
　　　　卫生管理费：5元
　　
　　　　合计：110+50+20+120+18+5=323元
　　
　　　　 3.交通费：大部分骑自行车
　　
　　　　但是考虑到偶尔坐地铁、公交、打车，比如周末，或者赶上下雨，
　　
　　　　加班很晚的情况，平均一下算200元。如果自己开车则更多。
　　
　　　　 4.饮食：
　　
　　　　吃饭：对于22个工作日来说
　　
　　　　早餐5元
　　
　　　　午餐15元
　　
　　　　晚餐做饭的话15元，否则出去吃要20-30元，折中算20元
　　
　　　　对于四个周末来说，出去吃饭喝水看电影，
　　
　　　　按照一天100元计算，算600元。
　　
　　　　水果，超市零食：一周至少80块吧，看看现在水果酸奶的价格。。算300元。
　　
　　　　小计 (5+15+20)*22+600+300=1780元
　　
　　　　 5.日常品费用：买书、日常用品，洗衣粉、牙膏、洗发水、卫生纸之类
　　
　　　　的算100元。
　　
　　　　 6.服装鞋子：这个我按照每个月200元计算吧，很低了。
　　
　　　　 7.手机费：100元
　　
　　　　 8.交友费用：没有女朋友的，和同学同事，一个月至少也要200块吧
　　
　　　　有女朋友的至少要500元吧。我取个折中的350元。
　　
　　　　 9.特殊日期：包括一些生日、情人节、圣诞节、结婚、生孩子等等礼物，
　　
　　　　一个人至少要300元吧，要是领导至少500、800吧，
　　
　　　　按照一年2000块左右来计算，月均摊为200元，够低了吧。
　　
　　　　 10.给父母：按照最低标准600元，也就意思一下吧。
　　
　　　　 11.看病：现在感冒都要200多，均摊月100元，希望没有。。
　　
　　　　 12.旅游：一年就算三次短途吧，一次住宿路费吃饭买东西至少500，算100元
　　
　　　　以上总计：1500+323+200+1780+100+200+100+350+200+600+100+100=5403元
　　
　　　　固定支出后，每个月剩余7420.312-5403=2017.312元
　　
　　　　以上只要是在上海混过的朋友，应该知道并体会我的标准，中等偏下生活水平
　　
　　　　因为没有涉及什么娱乐、学习费用。
　　
　　　　如果你买了房子、买了车、有了孩子、有其他投资活动，
　　
　　　　社交活动更多、旅游更多、家里负担更重、计划买电器贵重物品、
　　
　　　　抽烟喝酒、女朋友花销厉害、有被骗被偷等情况，
　　
　　　　那请酌情自己计算，多申请几张信用卡就很有必要了：-）
　　
　　　　如果没有，那么恭喜你，每个月理论上可以剩余2000元了，呵呵，
　　
　　　　但是一定要作为风险意外支出费用，千万别随便用，
　　
　　　　万一这个月需要买点什么特殊的东西呢。
　　
　　　　上海房价，在徐汇这边，二手房大概12000元，自己考虑吧。
　　
　　　　每天下班就最好乖乖回家，
　　
　　　　味千拉面和KFC都是奢侈，盖浇饭和兰州拉面是外出主打食品。。。
　　
　　　　否则，你将发现这2000块在上海，花出去简直太容易了，平均一天70元，
　　
　　　　还没注意，还没感觉，吃饭点菜的时候一冲动，没了！！
　　
　　　　还没到月底呢，一看卡上。。余额不小心变成个位数了！！！！月光了，嚎！

2009年4月3日星期五

VirtualBox实现host与guest文件共享(主机是Vista）

查看了很多网页，试了好多次才成功，很多网上写的都不对，要么就是没有空格，要么就是用单引号。
host的代码格式：
VBoxManage sharedfolder add “vmname” -name “sharename” -hostpath “sharefolderpath”
例如我的virtualbox安装在d盘的program files下面，那么首先进入d:\program files\sun\xvm virtualbox目录，输入：
vboxmanage.exe sharedfolder add "ubuntu" -name "cassandra" -hostpath "e\cassandra\cassandra20090403"

成功会显示all rights reserved...

guest用的是ubuntu，那么在terminal中输入：
mount -t vboxsf "sharename" "mountpoint"
其中mountpoint是指已经加载的盘符
我这里用了sudo mount -t vboxsf "cassandra" "cdrom"
直接加载到光驱上了。

2009年3月25日星期三

memcached性能测试

找了一台机器测试了一下memcached的效果，测试数据共30000条，大概是100m左右，设置的memcache内存为300m，在实际运行过程中占用了200m左右，也就是说每1m数据用memcached放在内存中后就会变成2m，这应该是包括建立索引等等需要的空间。

程序是随机访问数据库，对于non-memcache，循环100w次需要3000秒，而对于memcache，循环1000w次才需要3000秒，所以对于内存大于数据库两倍的情况下，这时候如果把所有数据都放在内存中而且设置成不失效的话，那么使用memcache的速度可以是non-memcache的速度的10倍，但是在实际情况下，因为硬盘数据远远大于内存数据，而且设置的失效时间可能会影响hit的成功率，特别是当访问量很小的时候，使用memcached反而会使程序整体速度变慢。

在使用windows下的memcached的时候，发现不管怎么设置memcached，给予memcached都是固定的64m（也就是memcached的默认值），很奇怪，不知道有什么解决办法，后来就放在Mac OS X 10.5.6上进行的测试（硬件是：AMD 3000+ ，2G DDR667内存，500G SATA2 7200硬盘）。

所以，对于小型网站，最好不要使用memcached，特别是内存远小于频繁数据（这些数据都会被用户频繁读取）的时候。当然，对于日访问量上千万我觉得使用memcached确实可以达到很好的效果，而且加上采用分布式的话，这样共享的内存可以更大，效果会更好，还有一点，当内存小于硬盘数据的时候，CPU的占用率会突然变高，可能是因为要执行LRU算法（也就是替换原来内存中的数据）。很多东西都要在实际中才能知道，我也是通过了这次测试才发现原来memcached并不适合我先改进的一个小型网站（日访问量也就100w）。

2009年3月16日星期一

memcached介绍

今天翻译了一小段，接下来会慢慢翻译完的。。。

protocol
-----------
协议

Clients of memcached communicate with server through TCP connections.
(A UDP interface is also available; details are below under "UDP
protocol.") A given running memcached server listens on some
(configurable) port; clients connect to that port, send commands to
the server, read responses, and eventually close the connection.
Memcached的客户端通过TCP连接和服务端进行通信（UDP接口也是可用的，具体可以参考UDP协议）。已经运行memcached的server会监听（配置的）端口；客户端连接那个端口，发送命令，读取回复，最终关闭连接。

There is no need to send any command to end the session. A client may
just close the connection at any moment it no longer needs it. Note,
however, that clients are encouraged to cache their connections rather
than reopen them every time they need to store or retrieve data. This
is because memcached is especially designed to work very efficiently
with a very large number (many hundreds, more than a thousand if
necessary) of open connections. Caching connections will eliminate the
overhead associated with establishing a TCP connection (the overhead
of preparing for a new connection on the server side is insignificant
compared to this).
要终结一个对话，并不需要客户端发送任何的命令。客户端可以在它不再需要通信的时候关闭这个连接。当然，我们鼓励客户端将他们的连接缓存起来而不是当他们每次需要存储或者取数据的时候都重新打开一个连接，因为memcached设计的初衷就是为了提供一个高效的、支持大量连接的东东^_^。连接缓存化（caching connections）可以使得因为建立TCP连接而出现的过载现象消失。

There are two kinds of data sent in the memcache protocol: text lines
and unstructured data. Text lines are used for commands from clients
and responses from servers. Unstructured data is sent when a client
wants to store or retrieve data. The server will transmit back
unstructured data in exactly the same way it received it, as a byte
stream. The server doesn't care about byte order issues in
unstructured data and isn't aware of them. There are no limitations on
characters that may appear in unstructured data; however, the reader
of such data (either a client or a server) will always know, from a
preceding text line, the exact length of the data block being
transmitted.
Memcache协议里面存在这两种可以被传输（发送）的数据：文本行（text lines）和未结构化数据（unstructure data）。文本行主要用于客户端向服务器发送命令和服务器返回信息给客户端。未结构化数据则是在客户端想要存取数据的时候才被使用（发送）。服务器会将未结构化数据以一种比特流的方式传输给客户端（这种方式和它从客户端接受数据的方式是一样的）。服务器并不关心比特流的所导致的未结构化数据，也不会意识到他未结构化的数据，未结构化数据中的字符也没有任何限制。当然，不管是客户端还是服务器来的数据，读者（reader）总是可以通过预先处理的文本行（可能例如client发送一个get_length(object)来获得这个数据块的长度，然后server处理以后直接返回长度，再传输数据）来得知被传输的数据块的长度。

Text lines are always terminated by \r\n. Unstructured data is _also_
terminated by \r\n, even though \r, \n or any other 8-bit characters
may also appear inside the data. Therefore, when a client retrieves
data from a server, it must use the length of the data block (which it
will be provided with) to determine where the data block ends, and not
the fact that \r\n follows the end of the data block, even though it
does.
文本行总是由\r\n作为结尾。未结构化数据也可以使用\r\n，即使是是\r,\n甚至是任何8bit长的字符都能出现在未结构化数据中。所以当客户端从服务器获得数据的同时，它必须使用服务器返回的数据块长度来确定什么时候数据块结束了，而不是用\r\n来决定数据块的完结，虽然事实上我们就使用\r\n来作为数据块的末尾的。

Keys
----
键值
Data stored by memcached is identified with the help of a key. A key
is a text string which should uniquely identify the data for clients
that are interested in storing and retrieving it. Currently the
length limit of a key is set at 250 characters (of course, normally
clients wouldn't need to use such long keys); the key must not include
control characters or whitespace.
存储在memcached中的数据是通过键值来识别的。所谓的键值就是唯一的文本字符串，这个字符串必须在用户存取数据的时候能唯一的确定数据。当前键值的长度被限制在250个字符，键值不能包括控制符和空格。

Commands
--------
命令行

There are three types of commands.
总共有三种命令
Storage commands (there are six: "set", "add", "replace", "append"
"prepend" and "cas") ask the server to store some data identified by a key. The client sends a command line, and then a data block; after that the client expects one line of response, which will indicate success or
faulure.

存储命令（总共有6个：set，add，replace，append，prepend和cas）是让服务器根据某个键值去存储数据。客户端首先发送一个命令，然后数据块；接着服务器会返回信息，告诉客户端时不时成功了。

Retrieval commands (there are two: "get" and "gets") ask the server to
retrieve data corresponding to a set of keys (one or more keys in one
request). The client sends a command line, which includes all the
requested keys; after that for each item the server finds it sends to
the client one response line with information about the item, and one
data block with the item's data; this continues until the server
finished with the "END" response line.
读取命令（总共有两个：get和gets）让服务器根据一系列的键值去获得数据。客户端首先发送一个包含键值（可以有多个）的命令，之后对于服务器所找个的每个item，服务器会返回给客户端这个item的信息，然后是item的数据库，知道服务器返回END这个命令，说明数据已经传输完毕。

All other commands don't involve unstructured data. In all of them,
the client sends one command line, and expects (depending on the
command) either one line of response, or several lines of response
ending with "END" on the last line.
其他所有的命令都不包括未结构化数据。客户端先发送命令，然后期望获得一个或者多个回应，这些回应会以END作为结尾。

A command line always starts with the name of the command, followed by
parameters (if any) delimited by whitespace. Command names are
lower-case and are case-sensitive.
命令行通常是以命令作为开头，后面跟着各种的参素。命令都是小写字母而且是大消息敏感的。

Expiration times
----------------
过期时间

Some commands involve a client sending some kind of expiration time
(relative to an item or to an operation requested by the client) to
the server. In all such cases, the actual value sent may either be
Unix time (number of seconds since January 1, 1970, as a 32-bit
value), or a number of seconds starting from current time. In the
latter case, this number of seconds may not exceed 60*60*24*30 (number
of seconds in 30 days); if the number sent by a client is larger than
that, the server will consider it to be real Unix time value rather
than an offset from current time.

Error strings
-------------

Each command sent by a client may be answered with an error string
from the server. These error strings come in three types:

- "ERROR\r\n"

means the client sent a nonexistent command name.

- "CLIENT_ERROR \r\n"

means some sort of client error in the input line, i.e. the input
doesn't conform to the protocol in some way. is a
human-readable error string.
意味着客户端在输入命令的时候出现了错误，例如输入没有在某些地方没有遵从协议。

- "SERVER_ERROR \r\n"

means some sort of server error prevents the server from carrying
out the command. is a human-readable error string. In cases
of severe server errors, which make it impossible to continue
serving the client (this shouldn't normally happen), the server will
close the connection after sending the error line. This is the only
case in which the server closes a connection to a client.
意味着服务器方面的错误使得这个命令没有被执行。对于一些非常严重的错误，这些错误使得服务器没法在为客户端提供服务了，这时候服务器就会先发送错误信息，然后关闭它和客户端的连接，这也是服务器唯一会主动关闭连接的情况

In the descriptions of individual commands below, these error lines
are not again specifically mentioned, but clients must allow for their
possibility.
在下面对每个单独命令的描述，这些错误提示不会再被详细的体积，但是客户端必须考虑到他们出现的可能。

Storage commands
----------------
存储命令
First, the client sends a command line which looks like this:
首先，客户端发送一个类似于下面的命令
[noreply]\r\n
cas [noreply]\r\n

- is "set", "add", "replace", "append" or "prepend"
命令名称可以使set，add，replace，append或者prepend

"set" means "store this data".
Set是存储数据
"add" means "store this data, but only if the server *doesn't* already
hold data for this key".
Add是如果是这个键值没有对应的数据，那么可以存储这个数据

"replace" means "store this data, but only if the server *does*
already hold data for this key".
Replace是如果这个键值有对应的数据，那么可以替换这个数据

"append" means "add this data to an existing key after existing data".
Append是指在原有的数据后面添加

"prepend" means "add this data to an existing key before existing data".
Prepend是指在原有的数据前面添加

The append and prepend commands do not accept flags or exptime.
They update existing data portions, and ignore new flag and exptime
settings.
Append和prepend这两个命令都不接受标记（flasg）和过期时间（exptime）。
他们会更新现存数据，同时忽略新的标记和过期时间。

"cas" is a check and set operation which means "store this data but
only if no one else has updated since I last fetched it."
Cas是一个检查和设置操作，这个命令仅仅在数据被我取出以后，没有人更新过这个数据的时候，才可以存储数据，也就是说如果A取了这个数据，B更新了，A就不能存储了。

- is the key under which the client asks to store the data
是客户端要求存储数据时用的键值

- is an arbitrary 16-bit unsigned integer (written out in
decimal) that the server stores along with the data and sends back
when the item is retrieved. Clients may use this as a bit field to
store data-specific information; this field is opaque to the server.
Note that in memcached 1.2.1 and higher, flags may be 32-bits, instead
of 16, but you might want to restrict yourself to 16 bits for
compatibility with older versions.
是一个任意的16位的无符号整数（用10进制表示），当客户端取数据的时候，服务器会将这个flag和数据一起传输给客户端。客户端也许会用这个作为1bit的字段来存储数据的特定信息，这个字段对服务器是不可见的。在1.2.1以后的版本，flags可能会是32位的，为了兼容，可能还是要使用16位的flags

- is expiration time. If it's 0, the item never expires
(although it may be deleted from the cache to make place for other
items). If it's non-zero (either Unix time or offset in seconds from
current time), it is guaranteed that clients will not be able to
retrieve this item after the expiration time arrives (measured by
server time).
就是过期时间。如果它的值是0，那么永不过期（当然可能因为内存没有空间导致这个item被删除）。如果不是0而且过期了，客户端就不能在获得这个数据了。

- is the number of bytes in the data block to follow, *not*
including the delimiting \r\n. may be zero (in which case
it's followed by an empty data block).

- is a unique 64-bit value of an existing entry.
Clients should use the value returned from the "gets" command
when issuing "cas" updates.

- "noreply" optional parameter instructs the server to not send the
reply. NOTE: if the request line is malformed, the server can't
parse "noreply" option reliably. In this case it may send the error
to the client, and not reading it on the client side will break
things. Client should construct only valid requests.
Noreply选项可以上服务器不返回任何消息。

After this line, the client sends the data block:
命令行过后，客户端就开始发送数据块了：

\r\n

- is a chunk of arbitrary 8-bit data of length
from the previous line.
数据块是长度为8bit的数据

After sending the command line and the data blockm the client awaits
the reply, which may be:

- "STORED\r\n", to indicate success.

- "NOT_STORED\r\n" to indicate the data was not stored, but not
because of an error. This normally means that either that the
condition for an "add" or a "replace" command wasn't met, or that the
item is in a delete queue (see the "delete" command below).

- "EXISTS\r\n" to indicate that the item you are trying to store with
a "cas" command has been modified since you last fetched it.
数据自从被你取出来以后被其他人修改过了，所以不能在更新了。

- "NOT_FOUND\r\n" to indicate that the item you are trying to store
with a "cas" command did not exist or has been deleted.
数据不存在了，可能被删掉了。

Retrieval command:
------------------

The retrieval commands "get" and "gets" operates like this:

get *\r\n
gets *\r\n

- * means one or more key strings separated by whitespace.

After this command, the client expects zero or more items, each of
which is received as a text line followed by a data block. After all
the items have been transmitted, the server sends the string

"END\r\n"

to indicate the end of response.

Each item sent by the server looks like this:
这个是item的表现形式：

VALUE []\r\n
\r\n

- is the key for the item being sent

- is the flags value set by the storage command

- is the length of the data block to follow, *not* including
its delimiting \r\n

- is a unique 64-bit integer that uniquely identifies
this specific item.
一个64位长度的整型来唯一标示item

- is the data for this item.

If some of the keys appearing in a retrieval request are not sent back
by the server in the item list this means that the server does not
hold items with such keys (because they were never stored, or stored
but deleted to make space for more items, or expired, or explicitly
deleted by a client).

Deletion
--------

The command "delete" allows for explicit deletion of items:

delete [] [noreply]\r\n

- is the key of the item the client wishes the server to delete

- is the amount of time in seconds (or Unix time until which)
the client wishes the server to refuse "add" and "replace" commands
with this key. For this amount of item, the item is put into a
delete queue, which means that it won't possible to retrieve it by
the "get" command, but "add" and "replace" command with this key
will also fail (the "set" command will succeed, however). After the
time passes, the item is finally deleted from server memory.

The parameter is optional, and, if absent, defaults to 0
(which means that the item will be deleted immediately and further
storage commands with this key will succeed).
当time这个参数的值被设置成0的时候意味着item会被马上删除。
- "noreply" optional parameter instructs the server to not send the
reply. See the note in Storage commands regarding malformed
requests.

The response line to this command can be one of:

- "DELETED\r\n" to indicate success

- "NOT_FOUND\r\n" to indicate that the item with this key was not
found.

See the "flush_all" command below for immediate invalidation
of all existing items.

Increment/Decrement
-------------------
增/减

Commands "incr" and "decr" are used to change data for some item
in-place, incrementing or decrementing it. The data for the item is
treated as decimal representation of a 64-bit unsigned integer. If the
current data value does not conform to such a representation, the
commands behave as if the value were 0. Also, the item must already
exist for incr/decr to work; these commands won't pretend that a
non-existent key exists with value 0; instead, they will fail.

The client sends the command line:

incr [noreply]\r\n

or

decr [noreply]\r\n

- is the key of the item the client wishes to change

- is the amount by which the client wants to increase/decrease
the item. It is a decimal representation of a 64-bit unsigned integer.

- "noreply" optional parameter instructs the server to not send the
reply. See the note in Storage commands regarding malformed
requests.

The response will be one of:

- "NOT_FOUND\r\n" to indicate the item with this value was not found

- \r\n , where is the new value of the item's data,
after the increment/decrement operation was carried out.

Note that underflow in the "decr" command is caught: if a client tries
to decrease the value below 0, the new value will be 0. Overflow in
the "incr" command will wrap around the 64 bit mark.

Note also that decrementing a number such that it loses length isn't
guaranteed to decrement its returned length. The number MAY be
space-padded at the end, but this is purely an implementation
optimization, so you also shouldn't rely on that.

Statistics
----------

The command "stats" is used to query the server about statistics it
maintains and other internal data. It has two forms. Without
arguments:

stats\r\n

it causes the server to output general-purpose statistics and
settings, documented below. In the other form it has some arguments:

stats \r\n

Depending on , various internal data is sent by the server. The
kinds of arguments and the data sent are not documented in this vesion
of the protocol, and are subject to change for the convenience of
memcache developers.

General-purpose statistics
--------------------------

Upon receiving the "stats" command without arguments, the server sents
a number of lines which look like this:

STAT \r\n

The server terminates this list with the line

END\r\n

In each line of statistics, is the name of this statistic, and
is the data. The following is the list of all names sent in
response to the "stats" command, together with the type of the value
sent for this name, and the meaning of the value.

In the type column below, "32u" means a 32-bit unsigned integer, "64u"
means a 64-bit unsigner integer. '32u:32u' means two 32-but unsigned
integers separated by a colon.

Name Type Meaning
----------------------------------
pid 32u Process id of this server process
uptime 32u Number of seconds this server has been running
time 32u current UNIX time according to the server
version string Version string of this server
pointer_size 32 Default size of pointers on the host OS
(generally 32 or 64)
rusage_user 32u:32u Accumulated user time for this process
(seconds:microseconds)
rusage_system 32u:32u Accumulated system time for this process
(seconds:microseconds)
curr_items 32u Current number of items stored by the server
total_items 32u Total number of items stored by this server
ever since it started
bytes 64u Current number of bytes used by this server
to store items
curr_connections 32u Number of open connections
total_connections 32u Total number of connections opened since
the server started running
connection_structures 32u Number of connection structures allocated
by the server
cmd_get 64u Cumulative number of retrieval requests
cmd_set 64u Cumulative number of storage requests
get_hits 64u Number of keys that have been requested and
found present
get_misses 64u Number of items that have been requested
and not found
evictions 64u Number of valid items removed from cache
to free memory for new items
bytes_read 64u Total number of bytes read by this server
from network
bytes_written 64u Total number of bytes sent by this server to
network
limit_maxbytes 32u Number of bytes this server is allowed to
use for storage.
threads 32u Number of worker threads requested.
(see doc/threads.txt)

Item statistics
---------------
CAVEAT: This section describes statistics which are subject to change in the
future.

The "stats" command with the argument of "items" returns information about
item storage per slab class. The data is returned in the format:

STAT items:: \r\n

The server terminates this list with the line

END\r\n

The slabclass aligns with class ids used by the "stats slabs" command. Where
"stats slabs" describes size and memory usage, "stats items" shows higher
level information.

The following item values are defined as of writing.

Name Meaning
------------------------------
number Number of items presently stored in this class. Expired
items are not automatically excluded.
age Age of the oldest item in the LRU.
evicted Number of times an item had to be evicted from the LRU
before it expired.
outofmemory Number of times the underlying slab class was unable to
store a new item. This means you are running with -M or
an eviction failed.

Note this will only display information about slabs which exist, so an empty
cache will return an empty set.

Item size statistics
--------------------
CAVEAT: This section describes statistics which are subject to change in the
future.

The "stats" command with the argument of "sizes" returns information about the
general size and count of all items stored in the cache.
WARNING: This command WILL lock up your cache! It iterates over *every item*
and examines the size. While the operation is fast, if you have many items
you could prevent memcached from serving requests for several seconds.

The data is returned in the following format:

\r\n

The server terminates this list with the line

END\r\n

'size' is an approximate size of the item, within 32 bytes.
'count' is the amount of items that exist within that 32-byte range.

This is essentially a display of all of your items if there was a slab class
for every 32 bytes. You can use this to determine if adjusting the slab growth
factor would save memory overhead. For example: generating more classes in the
lower range could allow items to fit more snugly into their slab classes, if
most of your items are less than 200 bytes in size.

Slab statistics
---------------
CAVEAT: This section describes statistics which are subject to change in the
future.

The "stats" command with the argument of "slabs" returns information about
each of the slabs created by memcached during runtime. This includes per-slab
information along with some totals. The data is returned in the format:

STAT : \r\n
STAT \r\n

The server terminates this list with the line

END\r\n

Name Meaning
------------------------------
chunk_size The amount of space each chunk uses. One item will use
one chunk of the appropriate size.
chunks_per_page How many chunks exist within one page. A page by
default is one megabyte in size. Slabs are allocated per
page, then broken into chunks.
total_pages Total number of pages allocated to the slab class.
total_chunks Total number of chunks allocated to the slab class.
used_chunks How many chunks have been allocated to items.
free_chunks Chunks not yet allocated to items, or freed via delete.
free_chunks_end Number of free chunks at the end of the last allocated
page.
active_slabs Total number of slab classes allocated.
total_malloced Total amount of memory allocated to slab pages.

Other commands
--------------

"flush_all" is a command with an optional numeric argument. It always
succeeds, and the server sends "OK\r\n" in response (unless "noreply"
is given as the last parameter). Its effect is to invalidate all
existing items immediately (by default) or after the expiration
specified. After invalidation none of the items will be returned in
response to a retrieval command (unless it's stored again under the
same key *after* flush_all has invalidated the items). flush_all
doesn't actually free all the memory taken up by existing items; that
will happen gradually as new items are stored. The most precise
definition of what flush_all does is the following: it causes all
items whose update time is earlier than the time at which flush_all
was set to be executed to be ignored for retrieval purposes.

The intent of flush_all with a delay, was that in a setting where you
have a pool of memcached servers, and you need to flush all content,
you have the option of not resetting all memcached servers at the
same time (which could e.g. cause a spike in database load with all
clients suddenly needing to recreate content that would otherwise
have been found in the memcached daemon).

The delay option allows you to have them reset in e.g. 10 second
intervals (by passing 0 to the first, 10 to the second, 20 to the
third, etc. etc.).

"version" is a command with no arguments:

version\r\n

In response, the server sends

"VERSION \r\n", where is the version string for the
server.

"verbosity" is a command with a numeric argument. It always succeeds,
and the server sends "OK\r\n" in response (unless "noreply" is given
as the last parameter). Its effect is to set the verbosity level of
the logging output.

"quit" is a command with no arguments:

quit\r\n

Upon receiving this command, the server closes the
connection. However, the client may also simply close the connection
when it no longer needs it, without issuing this command.

UDP protocol
------------

For very large installations where the number of clients is high enough
that the number of TCP connections causes scaling difficulties, there is
also a UDP-based interface. The UDP interface does not provide guaranteed
delivery, so should only be used for operations that aren't required to
succeed; typically it is used for "get" requests where a missing or
incomplete response can simply be treated as a cache miss.

Each UDP datagram contains a simple frame header, followed by data in the
same format as the TCP protocol described above. In the current
implementation, requests must be contained in a single UDP datagram, but
responses may span several datagrams. (The only common requests that would
span multiple datagrams are huge multi-key "get" requests and "set"
requests, both of which are more suitable to TCP transport for reliability
reasons anyway.)

The frame header is 8 bytes long, as follows (all values are 16-bit integers
in network byte order, high byte first):

0-1 Request ID
2-3 Sequence number
4-5 Total number of datagrams in this message
6-7 Reserved for future use; must be 0

The request ID is supplied by the client. Typically it will be a
monotonically increasing value starting from a random seed, but the client
is free to use whatever request IDs it likes. The server's response will
contain the same ID as the incoming request. The client uses the request ID
to differentiate between responses to outstanding requests if there are
several pending from the same server; any datagrams with an unknown request
ID are probably delayed responses to an earlier request and should be
discarded.

The sequence number ranges from 0 to n-1, where n is the total number of
datagrams in the message. The client should concatenate the payloads of the
datagrams for a given response in sequence number order; the resulting byte
stream will contain a complete response in the same format as the TCP
protocol (including terminating \r\n sequences).

訂閱：文章 (Atom)

2009年10月28日 星期三

lightcloud-c as a consistent hash server

2009年8月31日 星期一

php for lightcloud

2009年4月16日 星期四

Cassandra安装、部署和简单评测

2009年4月12日 星期日