FIO 到底在干什么

fio 是一个常见的 IO 性能测试工具。它功能强大，上至文件系统，下至裸盘都能支持，但同时也参数繁多，目前的最新版本（3.27）中已经定义了至少237种一级参数，再加上这些参数的不同取值，一套组合拳下来能把人打的找不着北。所幸我们并不需要了解全部的参数，在使用的时候，只要到脚本库里面找到一个现成的命令，Ctrl+C / Ctrl+V一下即可（至少我是这样）。

比如我常用的一个命令模板：

1

fio -directory=/mnt/fio/ --ramp_time=60 --nrfiles=10 --size=2560m --thread --group_reporting --direct=1 --ioengine=libaio --bs=4k --rw=randrw --iodepth=64 --iodepth_batch_submit=8 --iodepth_batch_complete=8 --name=randrw_perfile[256m]_bs[4k]_iodepth[64] --time_based=1 --runtime=$((10*60)) --continue_on_error=none

但在反复的性能测试中，我观察到 fio 不合常理的慢，它经常卡在一个叫Laying out IO file(s)的阶段，有时甚至比它跑 benchmark 的时间还要长。对于这样的浪费我自然是不能忍受的，所以不禁好奇，fio 到底在干什么？

Laying out IO file(s)

当挂载好了一个全新的目录，开始跑 fio 的时候，我们会看到 fio 输出一段Laying out IO file(s)，然后就一动不动的 hang 住几分钟。看起来它是在准备些什么文件，具体都干了些啥，可以通过strace观察到:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14


strace -f -o /tmp/fio.strace fio ...

2998863 fallocate(3, 0, 0, 1048576)     = -1 EOPNOTSUPP (Operation not supported)
2998863 pwrite64(3, "\0", 1, 4095)      = 1
2998863 pwrite64(3, "\0", 1, 8191)      = 1
2998863 pwrite64(3, "\0", 1, 12287)     = 1
// ... 共 256 个 pwrite64
2998863 pwrite64(3, "\0", 1, 1048575)   = 1
2998863 ftruncate(3, 1048576)           = 0
2998863 write(3, "\303\214\344\234\371\216v\v\230\21;\355l\3\207\0363b\333\24\320iB\27F\354\201Q\314B\245\f"..., 4096) = 4096
2998863 write(3, "P\246\256 ]\255\"D\312\324\355PT\305\5\n\231\272\270\2373\220\234\5S\227\252\26*\302\210\0"..., 4096) = 4096
2998863 write(3, "\331K\250\225w\224xe{\211\310\314\272G\"\23/\221\333\24\223\342#\36%\362\3\232\344\361\3\v"..., 4096) = 4096
// ... 共 256 个 write
2998863 write(3, "F\364\251\310\340\266-Z\210>\362\36\207R?\4\321G\232\304\327Z\374\n\372\310\252\324\r\251j\n"..., 4096) = 4096

注意，这里一定要加-f，因为 fio 会创建子线程，并通过子线程完成 benchmark，如果不加-f的话，我们只能观察到主线程的行为，真正发生的 IO 却看不到。

打开strace日志，可以看到在真正的 benchmark 之前，有这么几个关键的 IO 调用

调用fallocate，失败了，因为底层文件系统目前还不支持这个语义
然后就顺序写各个 block，每次写一个字节，直到写满特定的 size
挨个写完以后，又来了一次ftruncate调用
接着又从头写了一遍文件，不过不同的是，这次每个 block 都填满了，写的是随机内容（随机程度跟 fio 那几个 buffer 参数有关）

原来所谓的Laying out IO files就是在随机创建文件啊，就这.gif？

但为啥会写两遍呢？明明有了第4步就够了，为啥要多做第2步，让我们多等一倍的时间？

这个迷惑的行为困扰了我很久，后来才无意中发现，这很有可能是 glibc 的行为，即在发现底层文件系统不支持fallocate之后，它模拟了一遍fallocate的语义，通过用\0来填充的方式，确保底层文件系统能预留指定的空间大小？

1
2
3
4
5
6
7
8
9


// https://code.woboq.org/userspace/glibc/sysdeps/posix/posix_fallocate.c.html#96
/* Write a null byte to every block */
for (offset += (len - 1) % increment; len > 0; offset += increment)
{
    len -= increment;
    // ...
    if (__pwrite (fd, "", 1, offset) != 1)
        return errno;
}

所以为了提升Laying out IO file(s)的速度，我们大概有两种思路：

要么就不要删掉测试文件，让 fio 可以复用。
要么就通过参数--fallocate=none，直接告诉它不要预调用fallocate，免得又触发 glibc 的一次顺序写，从而节省一半的时间。

IO Pattern

fio 的--rw参数支持多种 IO 模式，比如常见的顺序读写、随机读写。但在真正 benchmark 的过程中，发出的 IO 请求是怎样的，随机程度怎样，是否覆盖了所有的文件地址空间？

这个问题同样可以用strace来解答，我们截取了后半部分 IO 调用。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20


2998936 io_submit(0x7f24bdae7000, 1, [{pread, fildes=3, buf=0x7f2484010000, nbytes=4096, offset=61440}]) = 1
2998936 io_submit(0x7f24bdae7000, 1, [{pwrite, fildes=3, str="4\371\3152\217F\34\212&\277\277\351j\357g\1\344\367\244}\r\211\1\21\374\236\3023Or\325\33"..., nbytes=4096, offset=774144}]) = 1
2998936 io_submit(0x7f24bdae7000, 1, [{pwrite, fildes=3, str="\0\32\310\207^>\33\\@\3\371\303\347#T\2h \337v\200\242\26\25\r\344\247~\340\230:\32"..., nbytes=4096, offset=880640}]) = 1
2998936 io_submit(0x7f24bdae7000, 1, [{pread, fildes=3, buf=0x7f248400d000, nbytes=4096, offset=491520}]) = 1
2998936 io_submit(0x7f24bdae7000, 1, [{pwrite, fildes=3, str="|\263\316\244`\5\255Io\326\333\272DN\351\37\315\372#\254Z\314#\nY\377\35Xy#r\7"..., nbytes=4096, offset=417792}]) = 1
2998936 io_submit(0x7f24bdae7000, 1, [{pwrite, fildes=3, str="X4o\263\207U\221x\213\346A\\\371P\266\34\0\0\f\0\0\0\0\0\232\327\237h^\307\274\33"..., nbytes=4096, offset=786432}]) = 1
2998936 io_submit(0x7f24bdae7000, 1, [{pwrite, fildes=3, str="\\\376HS=\224\353\17\313\37;+\343\310b\5\371\343\201\325~\303\7\4\177\274\263\350\256\215P\20"..., nbytes=4096, offset=397312}]) = 1
2998936 io_submit(0x7f24bdae7000, 1, [{pread, fildes=3, buf=0x7f2484009000, nbytes=4096, offset=368640}]) = 1
2998936 io_submit(0x7f24bdae7000, 1, [{pwrite, fildes=3, str="\23\211R`R\246Rh\"\321\200\7\241$\246\20$\32_\270\323@2\7D\343\371\351\352kx\36"..., nbytes=4096, offset=954368}]) = 1
2998936 io_submit(0x7f24bdae7000, 1, [{pwrite, fildes=3, str="\364\264\206\255\246\21\3\247\236\3266{q\v\271\21\323\332\27\304\322cP\vZ{\31kn\252\205\22"..., nbytes=4096, offset=69632}]) = 1
2998936 io_submit(0x7f24bdae7000, 1, [{pwrite, fildes=3, str="|\331\327\237\375\332'\215/\373<\307s+\212\7e\37P\373\317a\300\t\354\203\267\357\221\16\323\26"..., nbytes=4096, offset=892928}]) = 1
2998936 io_submit(0x7f24bdae7000, 1, [{pwrite, fildes=3, str="\24\234#l8SyK\202sz?U\364^\17pN.n\315>\317\27\316\311\215\206\3020\243\f"..., nbytes=4096, offset=679936}]) = 1
2998936 io_submit(0x7f24bdae7000, 1, [{pwrite, fildes=3, str="\223\214'\242\326\251\345L\222q\373\r'd\366\0032n\366(\207E\351\f\306\315\5\2565\224\267\35"..., nbytes=4096, offset=970752}]) = 1
2998936 io_submit(0x7f24bdae7000, 1, [{pwrite, fildes=3, str=",%!\364JNk\224\245$\356\313\270\357\233\24\224D+\7\0\30\372\5\222h\233^j?\313\37"..., nbytes=4096, offset=258048}]) = 1
2998936 io_submit(0x7f24bdae7000, 1, [{pread, fildes=3, buf=0x7f2484002000, nbytes=4096, offset=409600}]) = 1
2998936 io_submit(0x7f24bdae7000, 1, [{pread, fildes=3, buf=0x7f2484001000, nbytes=4096, offset=585728}]) = 1
2998936 io_getevents(0x7f24bdae7000, 1, 1, [{data=0, obj=0x7f24840153e0, res=4096, res2=0}], NULL) = 1
2998936 io_submit(0x7f24bdae7000, 1, [{pread, fildes=3, buf=0x7f248400f000, nbytes=4096, offset=458752}]) = 1
2998936 io_getevents(0x7f24bdae7000, 1, 1, [{data=0, obj=0x7f2484015120, res=4096, res2=0}], NULL) = 1
2998936 io_submit(0x7f24bdae7000, 1, [{pread, fildes=3, buf=0x7f248400e000, nbytes=4096, offset=839680}]) = 1

由于我们的 ioengine 是libaio，所以 IO 请求通过io_submit来完成。可以看到，fio 一次发出了一批io_submit调用，然后调用io_getevents一次只等待一个结果，拿到结果后又立即发出一个io_submit请求，这样确保 in-flight 的请求正好是16个，正符合--iodepth=16设定的参数👏🏻。

但这个 log 观察 IO Pattern 就有点麻烦了，因为strace的输出里还夹杂着其他的系统调用。这时，我又找到了另一个 fio 的参数——--write_iolog，它可以将 IO 相关的参数直接 dump 出来，让用户有个量化的感知：

顺序读——read

1
2
3
4
5
6
7


/mnt/fio/out.fio open
/mnt/fio/out.fio read 0 4096
/mnt/fio/out.fio read 4096 4096
/mnt/fio/out.fio read 8192 4096
// ... 共256个读
/mnt/fio/out.fio read 1044480 4096
/mnt/fio/out.fio close

看起来很符合预期，顺序读就是按顺序每个 block 读取一遍。

顺序写——write

1
2
3
4
5
6
7


/mnt/fio/out.fio open
/mnt/fio/out.fio write 0 4096
/mnt/fio/out.fio write 4096 4096
/mnt/fio/out.fio write 8192 4096
// ... 共256个写
/mnt/fio/out.fio write 1044480 4096
/mnt/fio/out.fio close

行为同上。

随机读——randread

1
2
3
4
5
6
7
8


/mnt/fio/out.fio add
/mnt/fio/out.fio open
/mnt/fio/out.fio read 61440 4096
/mnt/fio/out.fio read 774144 4096
/mnt/fio/out.fio read 880640 4096
// ... 共256个读
/mnt/fio/out.fio read 782336 4096
/mnt/fio/out.fio close

随机读看起来是每次随机选取一个 offset，来读取一个 block 。而且统计后能发现，fio 把每个地址空间都不多不少的访问了一次。

随机写——randwrite

1
2
3
4
5
6
7
8


/mnt/fio/out.fio add
/mnt/fio/out.fio open
/mnt/fio/out.fio write 61440 4096
/mnt/fio/out.fio write 774144 4096
/mnt/fio/out.fio write 880640 4096
// ... 共256个写
/mnt/fio/out.fio write 782336 4096
/mnt/fio/out.fio close

行为同上。

随机读写——randrw

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17


/mnt/fio/out.fio add
/mnt/fio/out.fio open
/mnt/fio/out.fio read 61440 4096
/mnt/fio/out.fio write 774144 4096
/mnt/fio/out.fio write 880640 4096
/mnt/fio/out.fio read 491520 4096
/mnt/fio/out.fio write 417792 4096
/mnt/fio/out.fio write 786432 4096
/mnt/fio/out.fio write 397312 4096
/mnt/fio/out.fio read 368640 4096
// ...

/mnt/fio/out.fio write 745472 4096
/mnt/fio/out.fio read 749568 4096
/mnt/fio/out.fio write 765952 4096
/mnt/fio/out.fio read 782336 4096
/mnt/fio/out.fio close

看起来是读写操作穿插着进行，而且每一个 offset 仅且仅读写一次。