Thursday, May 4, 2023

File Systems and Archives: An Anecdotal Usability Note

In my day job as a database administrator, objects like tables are stored in extents (blocks of contiguous bytes), typically grouped with those of related objects in tablespace datafiles. If and when an object needs to grow to accommodate more records it tries to find available space in already allocated resources; if not, it tries to find big enough free spaces in preallocated existing datafiles. Traditionally we had to resize or add datafiles to accommodate database growth. Oracle made things more user-friendly for DBA's by allowing existing datafiles to autoextend on storage devices subject to available capacity/quota constraints, subject to certain design constraints, or add new datafiles across storage devices. By design constraints, I'm referring to classic structures like up to 4 million or so datafile blocks (lowest unit of bytes), with a default 8K block size. This led to de facto maximum 32 GB datafiles, even if there was additional storage capacity on relevant devices. So typically to avoid a showstopper failure of an object not being able to grow, I might define new extendible datafiles within or across accessible storage devices. I still needed to monitor storage utilization and fragmentation of free spaces in tablespace datafiles, but less "busy work" of manually maintaining datafiles than in earlier times, not to mention reorganizing datafiles across devices.

There are multiple approaches to securing sensitive files; for example, you can password-protect Word documents and pdf files.  There are system volume tools like Bitlocker. And then there are virtual encryption volume software tools like VeraCrypt and SafeHouse Explorer. Now there are a couple of analogous concepts at play, the volume, roughly comparable to concept of a tablespace datafile, and file system support for very large files vs. the 32 GB limit. In the case of the former, you create a volume or logical disk of a specified size, e.g., 10G, which you can mount on a drive letter by supplying a predefined password. In the latter case you may need a file system like NTFS if you're trying to store very big (>4 GB) on the volume. Some media, e.g., flash drives, may by default use a legacy file system NOT supporting very large files. Usually, I try to store redundant backup and/or volumes. I noticed VeraCrypt explicitly during volume creation asked if I needed very large file support .SafeHouse Explorer didn't.

So, here's the setup for my usability incident. I use a very nice freeware product, MailStore Home as my email archive. My principal email client is Thunderbird which uses an MBOX format for email folders. There are Thunderbird add-ons which enable import and export MBOX files. Mailstore has an archiving solution which is proprietary but can archive newer emails.  Usually, I'll do a daily archival into Mailstore and manually consolidate them into categories monthly. Mailstore also allows me to export/restore email folders back to email clients (including Outlook and its similar PST file construct).

So, I recently exported some of my consolidated folders for dedupping and trimming, saving the results in MBOX format. It turned out, and I didn't notice, that 2 of those files were over 6 GB .

I created a 40 GB VeraCrypt volume, mounted it , and copied my relevant MBOX and PST files to it without incident. I created a slightly larger SafeHouse volume. So, I was puzzled why Windows argued  my finance file was "too big" for my virtual SafeHouse  volume; I had more than enough slack space on the volume but the file was "too big" for the file system; I then checked the file systems under the two volumes; indeed the SafeHouse volume was using a legacy MS file system without very large file. Support.

There are some easy workarounds. For example, I could use the Thunderbird export/import add-on to dump folder eml files and run a simple bash script assigning even and odd numbered emails to separate directories and then importing the directories and exporting them as separate MBOX files.

But is there a way to change the volume to a different, nondestructive, accommodative file system? Yup.

C:\Windows\system32>convert J: /fs:ntfs
The type of the file system is FAT32.
Enter current volume label for drive J: MBOX23A
Volume MBOX23A created 5/3/2023 6:23 PM
Volume Serial Number is 2CCC-EAD4
Windows is verifying files and folders...
File and folder verification is complete.
Windows has scanned the file system and found no problems.
No further action is required.
   52,415,984 KB total disk space.
   29,836,240 KB in 66 files.
   22,579,728 KB are available.
       16,384 bytes in each allocation unit.
    3,275,999 total allocation units on disk.
    1,411,233 allocation units available on disk.
Determining disk space required for file system conversion...
Total disk space:              52428800 KB
Free space on volume:          22579728 KB
Space required for conversion:   131576 KB
Converting file system
Conversion complete

I was then able to copy over my two 6 GB MBOX files without issue;

Saturday, April 8, 2023

Notes on Dedupping a VLC Playlist on Android

 I have an extensive library of licensed music tracks stored on my Android SD card. I connect my phone to my car by Bluetooth, and I use VLC to play custom playlists, added to from searched track options. I usually play my core playlist in random order. Somehow my core playlist had ballooned to over 500 tracks, many of which I suspected to be duplicates. I later confirmed that there were only 188 unique tracks. I was annoyed this had happened. (Oddly, when I recently added some Carpenters tracks onto the playlist, I did get prompted in trying to add a duplicate track from my card to the same playlist, which I would expect. However, the popup query seems to suggest duplicates are allowed. I don't recall seeing popup warnings adding tracks in the past.)

I was more familiar with specific playlist files on my Windows PC. For example, I have a backlog of over 3000 Cato Institute podcast episodes and I have a playlist arranged in a certain order, which I can update after episodes are played. On Android, however, VLC stores playlists in a sqlite database.

Now SQL (structured query language) is the lingua franca for relational databases (originally developed by IBM); I took a couple of graduate school database classes at UH, taught undergrad DBMS at UTEP, and wrote a minimal manual on SQL for a research project I did with Minnie Yen (Alaska-Anchorage). In addition, I've spent almost all my post-academic career as a DBA, predominantly Oracle but some exposure to SQL Server. I really haven't dealt with sqlite but I was confident I could figure out to get what I wanted.

The basic solution was:

  • dump the media database on Android and transfer it to my PC
  • download sqlite tools to my PC
  • load/mount the database, identify the target table and columns, and execute a relevant query to identify duplicated tracks.
Note this still left a manual task of doing identified duplicate track searches on the playlist in VLC Android and deleting the duplicate references. This is largely a workaround because more direct access would likely involve rooting the phone.

In VLC Android you can dump a copy of VLC database by going to app settings, advanced settings and select the dump media DB option. On my phone, the generated file was vlc_media.db. I transferred the file to my PC. in Windows.

If your PC doesn't have the sqlite3 command, you may need to download it, e.g., sqlite-tools-win32-x86-3410200.zip from https://www.sqlite.org/download.html (pan down to Precompiled Binaries for Windows) [note the link is valid at post publication].

I unzipped the file and navigated to the folder at the command prompt.

sqlite3 [path]\vlc_media.db

where [path] is the location for your dumped media database 

SQLite version 3.41.2 2023-03-22 11:56:21
Enter ".help" for usage hints.
sqlite>

To get a list of tables:

sqlite> .tables
Album                   FolderFts_segments      Movie
AlbumFts                Genre                   Playlist
AlbumFts_content        GenreFts                PlaylistFts
AlbumFts_segdir         GenreFts_content        PlaylistFts_content
AlbumFts_segments       GenreFts_segdir         PlaylistFts_segdir
Artist                  GenreFts_segments       PlaylistFts_segments
ArtistFts               Label                   PlaylistMediaRelation
ArtistFts_content       LabelFileRelation       Settings
ArtistFts_segdir        Media                   Show
ArtistFts_segments      MediaArtistRelation     ShowEpisode
AudioTrack              MediaFts                ShowFts
Bookmark                MediaFts_content        ShowFts_content
Chapter                 MediaFts_segdir         ShowFts_segdir
Device                  MediaFts_segments       ShowFts_segments
DeviceMountpoint        MediaGroup              SubtitleTrack
File                    MediaGroupFts           Task
Folder                  MediaGroupFts_content   Thumbnail
FolderFts               MediaGroupFts_segdir    ThumbnailCleanup
FolderFts_content       MediaGroupFts_segments  ThumbnailLinking
FolderFts_segdir        Metadata                VideoTrack

To describe a table, I use: pragma table_info(my table name);

There are 3 tables I'm particularly interested in: Playlist; Media; PlaylistMediaRelation.

sqlite> pragma table_info(playlist);
0|id_playlist|INTEGER|0||1
1|name|TEXT|0||0
2|creation_date|UNSIGNED INT|1||0
3|artwork_mrl|TEXT|0||0
4|nb_video|UNSIGNED INT|1|0|0
5|nb_audio|UNSIGNED INT|1|0|0
6|nb_unknown|UNSIGNED INT|1|0|0
7|nb_present_video|UNSIGNED INT|1|0|0
8|nb_present_audio|UNSIGNED INT|1|0|0
9|nb_present_unknown|UNSIGNED INT|1|0|0
10|duration|UNSIGNED INT|1|0|0
11|nb_duration_unknown|UNSIGNED INT|1|0|0

sqlite> pragma table_info(media);
0|id_media|INTEGER|0||1
1|type|INTEGER|0||0
2|subtype|INTEGER|1|0|0
3|duration|INTEGER|0|-1|0
4|last_position|REAL|0|-1|0
5|last_time|INTEGER|0|-1|0
6|play_count|UNSIGNED INTEGER|1|0|0
7|last_played_date|UNSIGNED INTEGER|0||0
8|insertion_date|UNSIGNED INTEGER|0||0
9|release_date|UNSIGNED INTEGER|0||0
10|title|TEXT|0||0
11|filename|TEXT|0||0
12|is_favorite|BOOLEAN|1|0|0
13|is_present|BOOLEAN|1|1|0
14|device_id|INTEGER|0||0
15|nb_playlists|UNSIGNED INTEGER|1|0|0
16|folder_id|UNSIGNED INTEGER|0||0
17|import_type|UNSIGNED INTEGER|1||0
18|group_id|UNSIGNED INTEGER|0||0
19|forced_title|BOOLEAN|1|0|0
20|artist_id|UNSIGNED INTEGER|0||0
21|genre_id|UNSIGNED INTEGER|0||0
22|track_number|UNSIGEND INTEGER|0||0
23|album_id|UNSIGNED INTEGER|0||0
24|disc_number|UNSIGNED INTEGER|0||0
25|lyrics|TEXT|0||0


sqlite> pragma table_info(PlaylistMediaRelation);
0|media_id|INTEGER|0||0
1|playlist_id|INTEGER|0||0
2|position|INTEGER|0||0

The last table is where I'll find the duplicates; I use the media table to identify the track, and I use the playlist table to get the id for what I call the playlist in VLC:

sqlite> select id_playlist, name from playlist;
1|rag001
2|rag003
3|rag002
4|xmas
 
RAG001 is what I call the long playlist in VLC, so 1 is the id I what I want to use in querying PlaylistMediaRelation.

So here is the query I'm using to identify the duplicate tracks in my playlist:

sqlite> select filename from media where id_media in (select media_id from playlistmediarelation
(x1...> where playlist_id=1 group by media_id having count(*) >1);
03 If I Thought You'd Ever Change Yo.m4a
001 Feels So Right.m4a
002 Love in the First Degree.m4a
001 Rock Me Gently.m4a
01 (Where Do I Begin) Love Story.mp3
06 Speak Softly Love (Love Theme fro.mp3
11 A Song For You.mp3
03 Don't Worry Baby.mp3
001 Woman in Love.m4a
002 Somewhere [From West Side Story].m4a
[and the list goes on...)

Thursday, March 16, 2023

Some Notes on Migrating Thunderbird

It just occurred to me I haven't posted here in over a year. Just a related note: I'll often post shorter usability segments in my signature blog (rguillem.blogspot.com) journal posts (#xxxx J)

To begin this post, the context is a workhorse laptop that has seen better days. On a trip to Texas, the power button dislodged and I now have to use a narrow-tip screwdriver to manually restart the system. I started running into baffling issues with USB devices, and my touchpad and touchscreen stopped working. (I thought initially they were driver issues one day when I had disconnected devices in dealing with reboot issues, functionality returned). I had started using a new computer for browsing and already designated it for my new workhouse. One of the first steps was migrating my email client, Thunderbird, with various local folders and related filters.

Looking over past posts. one particular focus has been emails. For the longest time I used email clients as a bridge solution to an email archive. I used Outlook Express during its 1996-2008 lifetime and developed my own corruption/archiving/migration tools. Mozilla Thunderbird started in 2003. When Microsoft transitioned to Vista and Outlook Express' successor, Microsoft Mail, I ran into chronic usability issues starting MS Mail with any nontrivial number of emails, and I soon thereafter migrated to Thunderbird, which provided startup times comparable to what I was used to with Outlook 

My archiving solution used to be AskSam, a freestyle database software I first licensed in the late 1980's. Probably my biggest use was to store my emails and write some powerful and flexible queries on contents, sender, dates, etc. It's last version (7) was released in 2008 and I think its websites had stopped to function by some time in 2013 (by a user observation I found via Google) I know I had an operational copy on one of my PC's and had exported some of my askSam files as of 2016 in text format. I had tried to reinstall on this PC I'm drafting this post and failed, even though I was sure I had my legitimate serial number. It wasn't clear why it was failing. I hypothesized it was doing some validation through the Internet to some server online no longer operable. I couldn't find a copy of the free viewer--in my files or via Google. [MyInfo implied it had an add-on that could read *.ASK files. Very misleading: you have to export data in a specific format, which means chicken-or-the-egg: you have to have a functioning version of AskSam to generate the data in the required format. Not very helpful,]

It had struck me I hadn't tried to install AskSam on the existing laptop. I didn't expect any difference since both PCs run essentially the same version of Windows. To my surprise, I was able to install and get it to accept my license number; there are some quirks, e.g., I got a popup to register online, which didn't work, of course, but it allows you to register offline. I'm not going back to resume archive at least the bulk of my emails, but back around 2016, I had exported only a fraction of my files, so I may export the remaining files which I haven't done to date

Now, of course, Thunderbird has rudimentary search utilities plus useful add-ons. For example, occasionally I have seen redundant emails downloaded, and so one add-on allows us to dedup emails. Another is an export-import tool which in theory would allow me to bridge restoring groups of emails back up to Gmail.

But in a manner similar to how I used to archive emails in AskSam, I started using free (for personal use) MailStore Home, which supports Thunderbird emails to local directory clones, typically on external drives (and these are periodically replicated elsewhere). So, I'll typically update Mailstore daily. The MailStore local folders are monthly transferred to perm folders. Hence, if my PC dies, I shouldn't lose more than a day of emails. This does not control geographical risk or home loss, but I do maintain some cloud archives.

So how did I migrate? Well, first, you need to install at least one email address for the Thunderbird install, typically my core gmail account. [Note: I use the POP vs. IMAP download messages to PC option.] Typically, I store my local folders on a cloud-mirrored mount point and override the local default. in Thunderbird. The message filter file exists but doesn't seem to work until I replace the target Appdata\Roaming\Thunderbird\Profiles\xxxxx-default-release with a copy of the corresponding contents from the source. After a restart you should see all your email accounts from the source and you may need to update passwords.

I ran into two practical problems: configuration of my Microsoft (live.com) email account and a problem with saving drafts in my core gmail account.

I followed an analogous procedure to this University of Texas manual configuration for my outlook.com account. I think the default configuration gives you a separate popup of your emails.

In a past post I discussed Enigmail, a PGP add-on to Thunderbird. More recently, Thunderbird has built in support for OpenPGP. Almost no one I correspond with has set up puhlic/private key pairs, so I've mostly used it among my own accounts. more for a proof of concept/demo case. Anyway, I think I first got a hint it was encryption-related here. At some point I got the a more explicit key message and ended manually deleting referenced lines in prefs.js in the profile release directory. In hindsight, it may have been all that was needed was to disable the option in the end-to-end encryption panel for the gmail account and/or export/copy the certs in question from the source and import them on the target.