RAG SoftDoc: Notes on Practical Computing: Problem Solving: An Example of Using Tools To Enhance Software Usability

The context for this post is dealing with duplicate files. One of the add-ons I use for the Thunderbird email client is "remove duplicate messages". I've occasionally had hiccups, e.g., in downloading my core gmail account messages, to Thunderbird where the same messages apparently weren't purged on the server and get downloaded again. What the add-on does is checking for duplicate sets of messages and then provides a default filter which retains 1 email from each set and deletes the duplicates. (I've also discovered vendors may resend the same emails over time, so I also occasionally run the add-on in my other email folders.)

Now over the past several years I had retained over 200 CD's and later DVD's of various files I have saved over the years: family photos, pdf files, licensed software (e.g., my discontinued AskSam software), various audio files ripped from my CD collection, emails, etc, In many cases, I did not rename files, say 1.jpg downloaded from the Internet. Trying to manage my collections of files was difficult: I often had to use a tool like search everything to hunt for files. For example, after I had to migrate from a failed PC, I needed to search for my license information to reinstall AskSam plus my backup copies of the software itself. (I maintain a couple of external backup drives which held individual media image contents.) (Yes, I'm aware I could maintain license information in a spreadsheet. In this case, the license number is maintained in an asksam ini file, which is what I actually stored.)

It took time and effort to consolidate the contents of the media images into two primary folders with subfolders. In the process, I ran into thousands of file name conflicts, which I resolved by a filefrogger rename utility on source files (e.g., on disk 200, I renamed 1.jpg to 1_200.jpg). I knew that I had perhaps hundreds, if not thousands of duplicate files (not just names, but content).

The tool I chose to use in dedupping these massive folders is duplicate finder free. Now keep in mind that it's very possible there is functionality I may not know about in the free or paid versions, but I'm looking at what the design seems to suggest. I basically point to a file folder to do a search and initiate the scan/search. Eventually what pops up is a list of sets of duplicate sets, alternately color coded. There are boxes to mark files for deletion. There didn't seem to be what I considered obvious default functionality, say, to retain the first record in each set, and if you are dealing with thousands of file sets, this is very tedious busy work. (After you mark files for deletion, you then trigger the delete functionality.)

In the interim, I was interested in how I could save in-process duplicate lists, e.g., if the generated duplicate list did not appear after a system bounce. I then discovered that in the File options there is export/import function for *.csv files. This was my aha moment, because I knew I could use spreadsheet functionality to manipulate the first (delete file mark) column.

The basic solution starts with the observation that the CSV file was in sequence by the set id column (I). This means that the problem is a variation of a control break program. For the purposes of this example, I'm using Open Office Calc, but similar steps can be adapted for your preferred spreadsheet product:

Open the exported CSV file in Calc by choosing the comma delimiter (which oddly enough is not default)
In cell A2 (the first data mark cell), enter a formula, e.g., =if (i2=i3; 1;0) . Let me explain the logic: I'm intending to retain the last file of each set. So suppose record 4 (record 3 if you don't count the header record) is the last of the first set (ID 1). Record 5 would be the first of the second set (say, ID 2). So I4 != I5, which means A4 has value 0, while A2,A3=1.
Copy A2 and then paste it in A3:An where n is the number of duplicate file records (including the header record). Suppose n=2000. Enter A3:A2000 in the selection box, right-click on highlighted cells and select paste.
Convert A2:An into values: using the same example, select A2:A2000, right-click Copy; right-click Paste Special (ensure the NUMBER and only NUMBER column format option is selected).
Save the revised spreadsheet in the original CSV format. Exit Calc.
Open up Duplicate Finder Free, import the revised spreadsheet. Now invoke the delete files option.

Monday, July 11, 2016

Problem Solving: An Example of Using Tools To Enhance Software Usability