One of the steps in my massive file-archiving project requires that I save all the paths with their associated inodes and sizes from each filesystem I intend to integrate. I’ve decided to save the info to a sqlite database (the link has a great tutorial: if you already know how to use SQL but need the specific sqlite idioms, this is a great page).

The table below shows several approaches to getting the filesystem data into the database. I’ll list the winning command here, then explain the alternatives:

find "$SRC" -type f -printf "%p|%i|%s\n" |\
	pv --line-mode -bta |\
	sqlite3 -bail "$FSLOCATION" ".import /dev/stdin paths"

This is gnu find, I’m not sure if the BSD find that ships with MacOS has the same options. You can install gnu find with homebrew, and this link shows you how to use the default name (i.e., find rather than homebrew’s gfind) to override the BSD find.

Anyway, find prints a pipe-delimited list of path, inode, and size to stdout; pv writes a nice progress message; and then sqlite imports directly from stdin. Note that you need to create the table (paths) before this step.

test speed comment
base 0.01s just setup
find | sqlite 0.17s very simple
find->tmp; sqlite import 0.26s not as clean but simple
find->tmp; python+sqlite import 0.22s python buffers better?
os.walk over dirs +sqlite import 0.35s find is much faster

This table shows results on a test directory of about 21G including about 30K files. The find-piped to-sqlite is considerably faster than the other options: it’s slower to redirect find to a temporary file, then import it; it’s a little better to have python read the temporary file then insert the values into sqlite (I think python parses the file faster than sqlite’s import does: I’m using pandas to parse the file); and then using python’s os.walk instead of find is much slower.

My guess is that the find | sqlite option benefits from a bit of concurrency and smart buffering. The shell (zsh, in this case) is getting a chunk of data from find and passing it to sqlite, letting find run while sqlite does the import. On a much bigger directory, I can see both find and sqlite using CPU time. Eventually everything slows to the speed of the slowest process, but the buffer is big and both can happen mostly at once.

This is a big help for my coming tool which is a mass file-copy script that doesn’t choke on tons of hardlinks (which cp and rsync most definitely do).