My big archiving project is slow. It’s hard to iterate on solutions when each meaningful test can take 10+ hours, and in some cases, multiple days.
The most important bottleneck in the process is reading and writing to disks: this task is io-bound. The table below shows the read and write speeds according to the Blackmagic speed test. All tests are run in MacOS 10.14.
disk & interface | read | write | type | interface |
---|---|---|---|---|
MBP2018-internal | 2533MB/s | 2675MB/s | ssd | nvm-e |
archives-2019 | 82MB/s | 86MB/s | 5400 | usb-c |
photos | 80MB/s | 88MB/s | ssd (samsung T5) | usb-c |
archives-2018 | 81MB/s | 86MB/s | 5400 | usb 3.1 |
MBP2013-internal | 400MB/s | 459MB/s | ssd | sata |
GDrive ext | 189MB/s | 187MB/s | 7200 | tbolt2+RAID1 |
GDrive ext | 184MB/s | 173MB/s | 7200 | usb-c/3.0 |
The external speeds are consistent across both machines, and this test shows the very best case. In real-world copying, the speed falls to extremely slow speeds — sometimes less than 1MB/s — which I attribute to a combination of lots of hardlinks (see below) and in some directories, hundreds of thousands of tiny files. I’m working on these latter two questions, but still, these raw, best-case speeds seem to me inadequate. I’m not sure why these disks are so slow.
Update: I think the limiting factor is input/output operations per second (IOPS, or reported as tps in iostat). This wikipedia article suggests that spinny disks (as opposed to SSD disks) can sustain 100-200 IOPS/s. Finding a specific sector where a file is is one IOPS, so this effectively limits a disk to reading 100-200 files/s, even if the files are very small. This is really slow when there are millions of files. SSD disks are ridiculously better at this kind of task.
That said, my 500GB Samsung T5 isn’t doing any better at the Blackmagic test, so I’m still a little vague about what’s going on.
In re hardlinks: When I’m copying, I’m using something like rsync -rtlOWSH
which copies hard-linked files as hard links on the destination filesystem. rsync
has a tough time with hardlinks because it needs to keep a table of all the inodes it has seen during the run. Even as rsync
eats up RAM, it is slowing down. I am writing a copy-by-inode
tool to work around this problem.