本文是看懂ceph日志系列文章关于rocksdb部分日志的分析
看懂ceph日志系列之rocksdb
写在前面
松鼠哥的ceph专业课程上线啦!
面向新手同学,从0实战,全面入门ceph安装部署与运维,有需要的同学赶紧联系松鼠哥订购吧:
本文是看懂ceph日志系列的rocksdb专题,关于rocksdb的介绍和调优,之前就写过两篇博文,理解rocksdb的机制和行为,对ceph的性能调优至关重要,之前的两篇文章介绍的比较浅,对其原理没有太过深入的讲解,本文也是对其的一个补充
rocksdb的代码本身很好读,facebook的文章都很全,而且社区很活跃,学会用难度还不是很大,它有它天生的优势和缺陷,更深入的代码剖析及原理介绍请期待后续文章,这里先读懂ceph日志
第一阶段 - immutable的flush
ceph使用rocksdb的第一阶段是将immutable刷到level0处,相关参数为:
1 | max_write_buffer_number - 最大的memtable数量,一般配置可以大点,例如16 |
因为是rocksdb的补充,所以啰嗦说明一下osd的rocksdb写过程
1 | 1、osd将写入内容提交到rocksdb。提交的内容有两种情况: |
下面通过日志来观察rocksdb的flush过程:
1 | 2018-12-19 04:06:21.908053 7fa0d7bd9700 4 rocksdb: [/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.7/rpm/el7/BUILD/ceph-12.2.7/src/rocksdb/db/db_impl_write.cc:684] reusing log 357225 from recycle list |
从日志中看出,04:06:21.908053开始写第一个memtable,到04:26:33.561907的时候,已经有了8个immutable,这个过程花费了20分12秒,环境中min_write_buffer_number_to_merge=8,因此满足了flush的条件,这个过程中,重用了此前存在的wal文件,免去了重新创建文件的开销
1 | 2018-12-19 04:26:33.562723 7fa0e23ee700 4 rocksdb: (Original Log Time 2018/12/19-04:26:33.562387) [/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.7/rpm/el7/BUILD/ceph-12.2.7/src/rocksdb/db/db_impl_compaction_flush.cc:1158] Calling FlushMemTableToOutputFile with column family [default], flush slots available 8, compaction slots allowed 1, compaction slots scheduled 1 |
rocksdb将immutable在内存中按序合并,合并完成即将log文件提交到recycle list,等待一段时间后这些log文件即被回收
至此,immutable的flush就完成了,日志显示此时的lsm_state为[1, 12, 118, 1117, 2993, 0, 0],分别指level0、level1、… 、level6的sst数量,看到此时level0只有1个文件,其大小为61443510B,文件编号为357351
第二阶段 - level之间的compaction
sst从memtable flush到level0后,会立即检查各个level的score,选取score最大的level,判断是否需要进行compaction,如果score符合,则开始level的compaction。
这里先做个勘误,在ceph性能调优历程-rocksdb篇(2)
中有提到level0 -> level1的合并条件,原文强调只有符合level0_file_num_compaction_trigger才会触发compaction,实际上还有个条件就是level0的总大小超过max_bytes_for_level_base,也可以触发compaction,目前已经纠正 -.-,facebook的官方说明如下
Compaction Picking
When multiple levels trigger the compaction condition, RocksDB needs to pick which level to compact first. A score is generated for each level:
For non-zero levels, the score is total size of the level divided by the target size. If there is already files picked that are being compacted into the next level, the size of those files are not included into the total size, because they will soon go away.
for level-0, the score is the total number of files, divided by level0\_file_num_compaction_trigger, or total size over max_bytes_for_level_base, which ever is larger. (if the file size is smaller than level0_file_num_compaction_trigger, compaction won’t trigger from level 0, no matter how big the score is)
We compare the score of each level, and the level with highest score takes the priority to compact.
原文出处:Leveled Compaction
接上面的日志,继续看看
1 | 2018-12-19 04:26:35.185618 7fa0e23ee700 4 rocksdb: (Original Log Time 2018/12/19-04:26:35.185372) [/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.7/rpm/el7/BUILD/ceph-12.2.7/src/rocksdb/db/db_impl_compaction_flush.cc:132] [default] Level summary: base level 1 max bytes base 375809638 files[1 12 118 1117 2993 0 0] max score 1.83 |
合并的统计说明,首先判断score最大的是1.83的level0,Compacting 1@0 + 12@1 files to L1z表示了要从level0选取1个文件,从level1选取12个文件合并到L1,并在Compaction start summary中详细列出参与合并的文件的大小信息,环境中每个sst文件大小配置为32MB,注意到从level0输入的文件恰好就是刚才flush完成的编号为357351的sst文件
1 | 2018-12-19 04:26:35.836924 7fa0bab9f700 4 rocksdb: [/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.7/rpm/el7/BUILD/ceph-12.2.7/src/rocksdb/db/compaction_job.cc:1116] [default] [JOB 4780] Generated table #357353: 97370 keys, 34467695 bytes |
通过读出level0与level1中的sst文件并进行合并,排序后依次产生了编号为357352 - 357365的新的sst文件,共14个,从而合并完成后,level1共有14个文件,此时lsm_state为**[0, 14, 118, 1117, 2993, 0, 0]**,合并完成后,原来level1的旧文件被删除,这次参与合并的文件,输入输出大小为:in(58.6, 312.4) out(371.0),从而读写放大情况为read-write-amplify(12.7) write-amplify(6.3);这里啰嗦一下,从日志中看到,合并level0与level1时,显示“num_subcompactions”: 4,这个值可以再稍微调大点,加快这个过程
1 | 2018-12-19 04:26:37.255854 7fa0e3bf1700 4 rocksdb: (Original Log Time 2018/12/19-04:26:37.250949) [/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.7/rpm/el7/BUILD/ceph-12.2.7/src/rocksdb/db/db_impl_compaction_flush.cc:1509] [default] Moving #357352 to level-2 37746293 bytes |
leve0与level1合并完成后,发现有sst文件可以直接加入到level2,所以不通过合并而使用move的方式使某个level的sst文件进入下一个level,在下个level为空的时候常常发生这种情况,在下个level不空的时候也有可能触发这种情况,虽然并不常见;移动文件后,level1的文件数变成13,而level2的文件数变成119,score超过了1,为1.15,将触发level2与level3的文件合并
1 | 2018-12-19 04:26:37.257102 7fa0e3bf1700 4 rocksdb: [/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.7/rpm/el7/BUILD/ceph-12.2.7/src/rocksdb/db/compaction_job.cc:1403] [default] [JOB 4782] Compacting 1@2 + 6@3 files to L3, score 1.15 |
类似的,rocksdb从level2选择了一个文件,从level3选择了6个文件进行compaction,共产生了7个文件,所以level3在合并完成后文件数加1,最后删除了旧的sst文件,level2文件数减1,此时lsm_state为**[0, 13, 118, 1118, 2993, 0, 0],本次合并读写放大为:read-write-amplify(6.1) write-amplify(3.0)**
1 | 2018-12-19 04:26:38.926154 7fa0e2bef700 4 rocksdb: [/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.7/rpm/el7/BUILD/ceph-12.2.7/src/rocksdb/db/compaction_job.cc:1403] [default] [JOB 4783] Compacting 1@1 + 11@2 files to L2, score 1.04 |
接上,判断level1的score为1.04,触发compaction,值得一提的是本次的读写放大情况为:read-write-amplify(132.7) write-amplify(66.4),呵呵,开始变得夸张起来,合并完成后lsm_state为[0, 12, 119, 1118, 2993, 0, 0]
1 | 2018-12-19 04:26:40.815612 7fa0e2bef700 4 rocksdb: [/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.7/rpm/el7/BUILD/ceph-12.2.7/src/rocksdb/db/compaction_job.cc:1403] [default] [JOB 4784] Compacting 1@3 + 4@4 files to L4, score 1.01 |
level3 -> level4的compaction,score为1.01
总结
rocksdb的compaction过程本身还是比较简单的,其核心就是lsm,每次触发compaction的条件都搞清楚,并分析出各个过程产生的缺陷,就可以在一定程度上通过硬件的配置、参数的调整来适应实际环境的使用,例如compaction时产生极高的io请求,此时如果可以保证wal和memtable写入流畅,则不会影响写入
参考资料
- 本文作者: 奋斗的松鼠
- 本文链接: http://www.strugglesquirrel.com/2019/01/02/看懂ceph日志系列之rocksdb/
- 版权声明: 本博客所有文章除特别声明外,创作版权均为作者个人所有,未经允许禁止转载!