Archives for the month of: January, 2019

My big archiving project is slow. It’s hard to iterate on solutions when each meaningful test can take 10+ hours, and in some cases, multiple days.

The most important bottleneck in the process is reading and writing to disks: this task is io-bound. The table below shows the read and write speeds according to the Blackmagic speed test. All tests are run in MacOS 10.14.

disk & interface read write type interface
MBP2018-internal 2533MB/s 2675MB/s ssd nvm-e
archives-2019 82MB/s 86MB/s 5400 usb-c
photos 80MB/s 88MB/s ssd (samsung T5) usb-c
archives-2018 81MB/s 86MB/s 5400 usb 3.1
MBP2013-internal 400MB/s 459MB/s ssd sata
GDrive ext 189MB/s 187MB/s 7200 tbolt2+RAID1
GDrive ext 184MB/s 173MB/s 7200 usb-c/3.0

The external speeds are consistent across both machines, and this test shows the very best case. In real-world copying, the speed falls to extremely slow speeds — sometimes less than 1MB/s — which I attribute to a combination of lots of hardlinks (see below) and in some directories, hundreds of thousands of tiny files. I’m working on these latter two questions, but still, these raw, best-case speeds seem to me inadequate. I’m not sure why these disks are so slow.

Update: I think the limiting factor is input/output operations per second (IOPS, or reported as tps in iostat). This wikipedia article suggests that spinny disks (as opposed to SSD disks) can sustain 100-200 IOPS/s. Finding a specific sector where a file is is one IOPS, so this effectively limits a disk to reading 100-200 files/s, even if the files are very small. This is really slow when there are millions of files. SSD disks are ridiculously better at this kind of task.

That said, my 500GB Samsung T5 isn’t doing any better at the Blackmagic test, so I’m still a little vague about what’s going on.

In re hardlinks: When I’m copying, I’m using something like rsync -rtlOWSH which copies hard-linked files as hard links on the destination filesystem. rsync has a tough time with hardlinks because it needs to keep a table of all the inodes it has seen during the run. Even as rsync eats up RAM, it is slowing down. I am writing a copy-by-inode tool to work around this problem.

One of the steps in my massive file-archiving project requires that I save all the paths with their associated inodes and sizes from each filesystem I intend to integrate. I’ve decided to save the info to a sqlite database (the link has a great tutorial: if you already know how to use SQL but need the specific sqlite idioms, this is a great page).

The table below shows several approaches to getting the filesystem data into the database. I’ll list the winning command here, then explain the alternatives:

find "$SRC" -type f -printf "%p|%i|%s\n" |\
	pv --line-mode -bta |\
	sqlite3 -bail "$FSLOCATION" ".import /dev/stdin paths"

This is gnu find, I’m not sure if the BSD find that ships with MacOS has the same options. You can install gnu find with homebrew, and this link shows you how to use the default name (i.e., find rather than homebrew’s gfind) to override the BSD find.

Anyway, find prints a pipe-delimited list of path, inode, and size to stdout; pv writes a nice progress message; and then sqlite imports directly from stdin. Note that you need to create the table (paths) before this step.

test speed comment
base 0.01s just setup
find | sqlite 0.17s very simple
find->tmp; sqlite import 0.26s not as clean but simple
find->tmp; python+sqlite import 0.22s python buffers better?
os.walk over dirs +sqlite import 0.35s find is much faster

This table shows results on a test directory of about 21G including about 30K files. The find-piped to-sqlite is considerably faster than the other options: it’s slower to redirect find to a temporary file, then import it; it’s a little better to have python read the temporary file then insert the values into sqlite (I think python parses the file faster than sqlite’s import does: I’m using pandas to parse the file); and then using python’s os.walk instead of find is much slower.

My guess is that the find | sqlite option benefits from a bit of concurrency and smart buffering. The shell (zsh, in this case) is getting a chunk of data from find and passing it to sqlite, letting find run while sqlite does the import. On a much bigger directory, I can see both find and sqlite using CPU time. Eventually everything slows to the speed of the slowest process, but the buffer is big and both can happen mostly at once.

This is a big help for my coming tool which is a mass file-copy script that doesn’t choke on tons of hardlinks (which cp and rsync most definitely do).

Among the files I want to organize in this giant archiving project are photos. These could be scanned images of old paper photos, jpgs from my phone or shared with me, or jpgs and raw files from a couple of decades of electronic photography.

The problem is that the files are scattered across backup systems that go back decades. To collect all the images, I wrote a little python script called getpix.py. (Note that the filename is a hyperlink to the GitHub gist which I’m double-linking because wordpress doesn’t format a code literal+hyperlink in an intuitive way).

Anyway: the script recursively descends a source directory and moves every image it finds to a destination directory in the format bydate/YYYY/MM/DD.

At every directory, the script runs exiv2 on every file (this could be improved by making the subprocess call to find smarter). Files that have a timestamp use it for the directory sorting. If not, and there’s a timestamp in the path (which Apple photo directories often keep), that timestamp will be used. One could add a final fallback date to the file ctime, but at least for me, the file metadata is so badly mangled that it provokes more confusion than enlightenment.

Files with no dates are sorted into no_date.

The resulting bydate structure can be dragged into the rest of the archiving process. There will be lots and lots and lots of duplicate images, and this is a gigantic PITA. There is a Right Way To Do It: use PhotoSweeper. This app will review all the images, link duplicates, and delete extras using configurable and intelligent defaults.

Note to self: do not try to do this inside Lightroom, what a mess that is.

I’m left with about 30K images, which Lightroom can handle without even turning on the CPU fan. This is a step forward.

I have a lot of data from a lot of years: about 5TB with around 8 million files. It’s very redundant, lots of copies of the same stuff. Many of the files are tiny, e.g., 100,000 1-2KB files in Maildir.

Most of the data are now on medium-sized external disks (2-8TB each) accessed via USB or Thunderbolt. It’s time to get everything onto a small set of usable disks (I’ve tried this before and I didn’t get very far).

One of the things that slows me down is that no matter how I set up the copy (cp, rsync, Finder), after a few minutes, the copy slows to a crawl. These are reasonably fast disks on USB3.0 or Thunderbolt2.0. The r/w speed on the disk should be around 150MB/s, and the connection is 5Gb/s, but I’d often see read speeds around 0.5MB/s. Ooof. You’re not going to move a terabyte at that speed.

And now I think I know what’s happening: the directories get disorganized. I am frustrated that I can’t figure out what this means, but I discovered that after running DiskWarrior on the offending drive, it’s now copying at 50-100MB/s (I’m using iostat to watch the r/w speeds). A big, big win.

Of course, APFS makes this useful knowledge nearly obsolete. Ah, the story of my life, learning useful stuff just as it becomes a kind of vintage affectation.

Gotta go, I’m going to write some shell scripts to make my terminal prompt look cool.

You thought I was kidding?