Among the files I want to organize in this giant archiving project are photos. These could be scanned images of old paper photos, jpgs from my phone or shared with me, or jpgs and raw files from a couple of decades of electronic photography.
The problem is that the files are scattered across backup systems that go back decades. To collect all the images, I wrote a little python script called getpix.py
. (Note that the filename is a hyperlink to the GitHub gist which I’m double-linking because wordpress doesn’t format a code literal+hyperlink in an intuitive way).
Anyway: the script recursively descends a source directory and moves every image it finds to a destination directory in the format bydate/YYYY/MM/DD
.
At every directory, the script runs exiv2
on every file (this could be improved by making the subprocess
call to find
smarter). Files that have a timestamp use it for the directory sorting. If not, and there’s a timestamp in the path (which Apple photo directories often keep), that timestamp will be used. One could add a final fallback date to the file ctime, but at least for me, the file metadata is so badly mangled that it provokes more confusion than enlightenment.
Files with no dates are sorted into no_date
.
The resulting bydate
structure can be dragged into the rest of the archiving process. There will be lots and lots and lots of duplicate images, and this is a gigantic PITA. There is a Right Way To Do It: use PhotoSweeper. This app will review all the images, link duplicates, and delete extras using configurable and intelligent defaults.
Note to self: do not try to do this inside Lightroom, what a mess that is.
I’m left with about 30K images, which Lightroom can handle without even turning on the CPU fan. This is a step forward.