My big archiving project is slow. It’s hard to iterate on solutions when each meaningful test can take 10+ hours, and in some cases, multiple days.

The most important bottleneck in the process is reading and writing to disks: this task is io-bound. The table below shows the read and write speeds according to the Blackmagic speed test. All tests are run in MacOS 10.14.

disk & interface read write type interface
MBP2018-internal 2533MB/s 2675MB/s ssd nvm-e
archives-2019 82MB/s 86MB/s 5400 usb-c
photos 80MB/s 88MB/s ssd (samsung T5) usb-c
archives-2018 81MB/s 86MB/s 5400 usb 3.1
MBP2013-internal 400MB/s 459MB/s ssd sata
GDrive ext 189MB/s 187MB/s 7200 tbolt2+RAID1
GDrive ext 184MB/s 173MB/s 7200 usb-c/3.0

The external speeds are consistent across both machines, and this test shows the very best case. In real-world copying, the speed falls to extremely slow speeds — sometimes less than 1MB/s — which I attribute to a combination of lots of hardlinks (see below) and in some directories, hundreds of thousands of tiny files. I’m working on these latter two questions, but still, these raw, best-case speeds seem to me inadequate. I’m not sure why these disks are so slow.

Update: I think the limiting factor is input/output operations per second (IOPS, or reported as tps in iostat). This wikipedia article suggests that spinny disks (as opposed to SSD disks) can sustain 100-200 IOPS/s. Finding a specific sector where a file is is one IOPS, so this effectively limits a disk to reading 100-200 files/s, even if the files are very small. This is really slow when there are millions of files. SSD disks are ridiculously better at this kind of task.

That said, my 500GB Samsung T5 isn’t doing any better at the Blackmagic test, so I’m still a little vague about what’s going on.

In re hardlinks: When I’m copying, I’m using something like rsync -rtlOWSH which copies hard-linked files as hard links on the destination filesystem. rsync has a tough time with hardlinks because it needs to keep a table of all the inodes it has seen during the run. Even as rsync eats up RAM, it is slowing down. I am writing a copy-by-inode tool to work around this problem.