One of the steps in my massive file-archiving project requires that I save all the paths with their associated inodes and sizes from each filesystem I intend to integrate. I’ve decided to save the info to a sqlite database (the link has a great tutorial: if you already know how to use SQL but need the specific sqlite idioms, this is a great page).
The table below shows several approaches to getting the filesystem data into the database. I’ll list the winning command here, then explain the alternatives:
find "$SRC" -type f -printf "%p|%i|%s\n" |\
pv --line-mode -bta |\
sqlite3 -bail "$FSLOCATION" ".import /dev/stdin paths"
This is gnu find, I’m not sure if the BSD find that ships with MacOS has the same options. You can install gnu find with homebrew, and this link shows you how to use the default name (i.e., find
rather than homebrew’s gfind
) to override the BSD find.
Anyway, find prints a pipe-delimited list of path, inode, and size to stdout; pv writes a nice progress message; and then sqlite imports directly from stdin. Note that you need to create the table (paths
) before this step.
test | speed | comment |
---|---|---|
base | 0.01s | just setup |
find | sqlite | 0.17s | very simple |
find->tmp; sqlite import | 0.26s | not as clean but simple |
find->tmp; python+sqlite import | 0.22s | python buffers better? |
os.walk over dirs +sqlite import |
0.35s | find is much faster |
This table shows results on a test directory of about 21G including about 30K files. The find-piped to-sqlite is considerably faster than the other options: it’s slower to redirect find to a temporary file, then import it; it’s a little better to have python read the temporary file then insert the values into sqlite (I think python parses the file faster than sqlite’s import does: I’m using pandas to parse the file); and then using python’s os.walk
instead of find is much slower.
My guess is that the find | sqlite
option benefits from a bit of concurrency and smart buffering. The shell (zsh, in this case) is getting a chunk of data from find and passing it to sqlite, letting find run while sqlite does the import. On a much bigger directory, I can see both find and sqlite using CPU time. Eventually everything slows to the speed of the slowest process, but the buffer is big and both can happen mostly at once.
This is a big help for my coming tool which is a mass file-copy script that doesn’t choke on tons of hardlinks (which cp and rsync most definitely do).