One of the key concepts in the discussion of usable technology is its natural or intuitive design: we immediately realize how to use it from the context of our own knowledge and experience. I know how to use a new toaster out of the box, without reading an accompanying instruction pamphlet or asking for someone's help. If I come to a door with a flat vertical metal panel, I know to push; if I see a vertical handle, I know to pull. Perhaps as infants or toddlers, we learned shapes using shape sorting cubes or similar toys: we know how to connect two- or three-pronged plugs to electrical outlets, or we know where and how to attach a network or monitor cable, power plug, or USB device on a laptop computer.
If and When Technology Fails, It Should Provide Usable Feedback For Restoring Functionality
Error codes or messages from the human-computer interface can be misleading, demeaning, unintuitive or obscure. In particular, computer programmers and administrators are dealing with complex compute software and systems; the usability or cognitive fit of the computer feedback to the technologist's background enables quick, efficacious target task resolution.
Quite often, the interface design is poorly designed and implemented, providing unusable feedback. Harry McCracken has compiled, from an IT professional perspective, a tongue-in-cheek "
The Thirteen Greatest Error Messages of All Time" (and a sequel, "
The 13 Other..."); perhaps my favorite, from a reader contribution, is: "Keyboard failure. Press F1 to continue."
Inadequate Error Trapping: A Real-World Example
I have a relevant example from my own professional experience. A large defense conglomerate had failed twice in attempts to install a new version of Oracle Application Server. There was a legacy GUI-based application built with an older version Oracle Developer Server. The subcontractor Oracle programmers or developers told the military installation that they had already upgraded the application for migration to Application Server. As a former Oracle Consulting consultant, I was assigned to the project (with the client manager's approval) and successfully installed a functional Application Server on the target test server. The vendors reported that my implementation was defective because the application wouldn't run on it.
I knew the Application Server was functioning properly, blackboxing the application. So I asked the principal developer to design a simple "
hello world"-type application. It took the principal developer some time to do this; he would later admit that he had to scan Google for sample code. (I already had a suspicion what was going wrong: the application hadn't been upgraded, but I needed evidence for the military project manager, whom trusted the vendors and placed the burden of proof on me. I later discovered an Oracle-supplied migration script used to check whether a candidate application was Application Server-ready. This script results listed several no-longer-supported function calls in the code.)
The developer was convinced that there was something wrong with the application, but all he knew when when he pressed an application form button to generate a report, a general error exception flashed at the bottom of the screen. I worked with him to install mileposts (e.g., display "I got here") through the code and finally narrowed the problem down to the call to print the report. The call was incompatible with changed Oracle functionality. The old reports had relied on the use of custom print drivers; Oracle no longer supported the use of custom print drivers but required developers to specify a valid output type in their function calls (e.g., rtf, pdf, etc.) The military project manager was in a state of denial: she didn't want pdf files. The administrative assistants had been trained to adjust Microsoft Word for handling the custom-driver outputs, and the use of pdf files constituted, in her judgment, a material breach of her overriding project requirement that the upgraded system be consistent with existing operations.
(Remember the dictum I mentioned in my introductory post? "It's not a bug, it's a feature." In the military mindset, eliminating the assistants' busy work of adjusting a word processing document to accommodate application output was not a benefit: it was a problem. The military project manager was more upset about the revised Oracle functionality than she was by the facts that her prime vendor had lied about upgrading the application and had failed to advise her what the upgrade meant in practical terms. In front of the defense conglomerate's employees, she lashed out at me for "incompetence" in failing to accommodate her prime objective: the fact that I'm a former senior principal with Oracle Consulting didn't impress her. "This is not your area of expertise; I told your employers I wanted them to subcontract the task out to Oracle Consulting." I calmly explained that neither I nor any other Oracle Consulting consultant had any direct sway over Oracle product decisions: they weren't going to change their desupport of custom print drivers at my request. "We're the US military!" the project manager snapped back at me. "Oracle will do what we tell them to do." I then pointed out that I had been tasked with installing a functional Oracle Application Server, which I had done. It was the developer vendor tasked with delivering the compatible application system. "No! I DECIDE what your tasks are and when your assignment is over, and you will continue to work with the other vendor.")
Usable Feedback Enables Efficient, Effective, Satisfying Problem Resolution
When IT professionals troubleshoot a problem, they want to identify the problem type as quickly as possible, e.g., is a hardware problem? Software bug? Networking issue? Data problem? Server resource issue (e.g., insufficient storage, computer jobs competing for memory or users trying to access and update the same database record)? User error? If they are running a diagnosis test, what conditions does it test? Do I have to scan through a multi-page document to identify any flagged exceptions, or am I presented with a short list of exceptions? Do I have to use external sources (e.g., an error code cross reference) to understand the nature of the error, or is the error message self-contained?
How context specific is the error message? For example, if I'm dealing with a database object problem, is it due to lack of allocated datafile resources for the database object to expand? Is it a user access or storage quota issue? Is it dealing with object definition constraints (e.g., the table may be able to expand a specified finite number of times)? Are user record inserts or updates consistent with table and field datatype definitions? Is there a physical problem with the relevant datafiles underlying the tablespace containing the object?
False Confirmations
Error resolution is difficult because of poorly designed error trapping in the underlying design; we often uncover these because of some external evidence of system malfunction. Take, for instance, the Unix standard error code of 0 (i.e., the relevant command or program was successful). The programmer's expectation is that any relevant subordinate malfunctioning program calls would return error codes which in turn are passed along to the higher-level return code. (Programmers prefer marker return codes which lead them more efficiently to the source of the program bug and hence problem resolution; otherwise, they have to white-box or broaden the scope of the software in question.)
Let me give a few real-world examples I classify as false confirmations.
False Confirmation Example #1
In my first professional assignment as a DBA, I was a federal contractor administering a database and related laboratory information management system in the Chicago region. Another vendor maintained the general application which features were decided by super-majority votes from the regional officers of the agency. (Regions could decide to maintain site-specific functionality on top of the general system; for instance, the Chicago region had a lab book functionality not in the general system.) Because of division of labor, the actual tape backups were maintained by network personnel. (For example, at many sites I'll be given a reusable amount of SAN space to park a copy of "cold" (not in use) database and/or application software tree files, and the network group will schedule a backup of the relevant SAN storage after the backup session is complete and before my next backup.)
I was very concerned about the network group's backup procedures for obvious reasons. I was repeatedly assured that they constantly monitored the backup jobs for errors; I specifically asked about any periodic tests they made of the backups, i.e., test restore procedures. They refused to answer that question, no doubt considering it as none of my business.
The LIMS vendor's tech representative got in touch with me in terms of preparing for an upgrade of the application software, including all necessary backups as a precaution. For whatever technical reasons, there were problems with the upgrade, and the vendor representative asked me to restore the backups from tape. The network group discovered and sheepishly admitted that all the tapes generated over the past several months were unusable: they had relied solely on the defective tape software's report indicating a normal completion. (I ended up having to rebuild the database from scratch, but that story is beyond the scope of this post.)
False Confirmation Example #2
A second example involves a data warehouse from a government agency that specializes in intellectual property issues. I was one of the 2 production contractor DBA's managing the warehouse. A recent federal audit required a certain naming standard in specifying database connections (i.e., a data warehouse typically stores processed feeds from one or more production databases via the network or transferred flat files). My colleague was in charge of testing out and then implementing the required changes. A few days after the changes were implemented, a client functional analyst discovered key records from one of the source databases weren't showing up in the data warehouse. My colleague was in a state of denial, and I asked if I could troubleshoot the issue. Fortunately, a number of relevant SQL or PL/SQL scripts generated spool files in relevant working directories, some two or three levels deep (although these were overwritten during nightly processing and loads to the data warehouse).
I discovered that a couple of scripts with database links had been overlooked and were now in conflict with the new naming setting. The code, however, did not trap and promote the relevant error code to the top level of reporting, which indicated normal completion. The top level report simply reflected missing error trap logic: it correctly reported what it was designed to report.
Obscure or Misleading Error Codes/Messages Impede Problem Resolution
Often when one gets an error message, the error message raises more questions than answers. One of McCracken's readers commented about the notorious German messages in operating leading ERP suite SAP R/1; I served as a SAP Basis administrator (an application administrator/DBA) back in 1996 (we were running R/3) and encountered more than my fair share of cryptic long German words in error messages.
Misleading Error Messages Example #1
I remember there was a Baltimore based subsidiary of a British-based organization that operated a number of tax-free shops at airports. This company had licensed Oracle's competitive suite to SAP offerings, originally called Oracle Applications (Financials, etc.), now E-Business Suite. Most of the software users, principally accountants, were using character mode (a variation of the Lotus 1-2-3/MS-DOS style interface) versus the newer GUI interface, required for the more recent Oracle HR module. Using Oracle's new, soon-to-be mandated GUI interface required implementing a second driver patch.
(The accountants, fed false rumors that the new drivers "broke" the functionality of Oracle Financials, were in open conflict with the HR department; I had to seriously jawbone the female VP of information technology whom decided that she wanted to implement a perverse Solomonic solution of giving the departments their own ERP systems. Being a great DBA automatically paints a bullseye on one's back. The full story is beyond the scope of this post.)
I was there to replace another contractor DBA whom had been tasked as the project DBA on a test database copy for implementing the HR module. On my first day, I was brought up to meet the contractor production DBA (actually a developer whom had been promoted to a DBA role), whom happened to be running the driver patch. As I introduced myself, I scanned what he was working on: the driver patch had failed with an error message I'll paraphrase as follows: "You have just encountered a major problem with the patch. Do you still want to proceed with the patch (Y/N)?" Incredibly, the DBA indicated "Y" and was about to hit "enter" when I yelled for him to stop. I had never seen that kind of error running the patch, and I had implemented the patch at least 6 to a dozen times in the past.
I went over the available information and found a bizarre compilation error associated with a database object. It didn't make sense: the driver ran a standard object script. I double-checked the environmental variables; everything seemed to be set up correctly, including the Western European character set necessary for the database (versus Oracle's default US7ASCII). Oracle's Metalink (technical support website) didn't provide a clue. It suddenly hit me--was the database really built using the WE8iso8859p1 character set? One has to specify an optional character set in the CREATE DATABASE command; otherwise, it defaults to US7ASCII. I verified via a relevant query from system tables that the database was really using US7ASCII.
The test database was a clone of the production database, which meant that the client was running an Oracle unsupported configuration; this is serious business. Furthermore, Oracle did not support directly changing the database character set and specifically stated that the only supportable solution was for the customer had to export and import the database back into a new database created with the correct character set. (I had to educate the VP of IT.) It turned out that the company had earlier hired a hotshot Oracle DBA WITHOUT ORACLE APPS EXPERIENCE to migrate the database between servers; and he did so by rebuilding the database on the target server. I had softcopies of his scripts and outputs--including the CREATE DATABASE script without a character set specification.
Misleading Error Messages Example #2
A second example was an Oracle Apps upgrade project I did for a suburban county to the west of Milwaukee in the summer/fall of 2001. I got notified of an unusual error during an Apps update to a database table that had nothing to do with database resources. I had to contact the county Unix system administrator (which was political because it was past 3 PM, and the county IT manager did not like his civil servants being contacted after their assigned hours). I had a gut feeling that there was a problem with the RAID device hosting the database, but the Unix administrator was in a state of denial. I was obligated to stay on site through 7 PM (I had to be there at 7:30 AM, which meant I had to get up earlier for the 83-mile drive to the courthouse).
To my surprise, the Unix admin showed up on site at or just before 7 PM without additional explanation and had decided to reboot the server. When the server came up, all the RAID files went to "lost and found", which is the Unix version of a trash can. In other words, the files were unusable. The county had a contractual obligation to back up the test server, and it turned out they ran an obsolete static database backup script and didn't back up the application software at all. Through some miracle I was able to fish out what had to be the missing 2GB datafile out of lost and found and was able to bring up the databases after the county DBA's restored the other datafiles from backup. There was an important test of the database scheduled for Monday morning, which meant I spent over two man-days all weekend UNCOMPENSATED (the county claimed that their flat-bid contract was all-inclusive and crap happens: I guess there was no recourse when the county didn't live up to their contractual obligations, but I'm not a lawyer), reinstalling Oracle EBS, including tedious hours of megapatches and one-off patches. It was trickier than it sounds because the patches update up to three levels of technology and I couldn't rerun the database portion of the patches. It was the first time I ever had to do this task because most professional IT organizations routinely back up an ENTIRE server, not just a few database files...
Misleading Error Messages Example #3
A third and final example involved a federal entitlement-related production database. The vendor's Russian immigrant developer-turned-DBA had run out of disk space on a production server; suffice it to say it turned out the test server had more disk storage than the production server, and the DBA migrated, in a nonstandard way, the database between servers. I was contacted by the vendor, whom had hired me on an earlier gig to diagnose implement any necessary changes to make their databases audit-compliant. After the DBA's migration of the database, the IT manager discovered a critical morning status report was terminating abnormally. The vendor's developers were convinced it had something to do with XML processing. I could replicate the source of the problem--I could generate an ORA-600, an internal generic error code, at will by running a simple query, something I had never seen before in 18 years as a professional DBA. Oracle Metalink was of no help, and Oracle Tech Support was absolutely clueless, saying they had no prior record of the problem (there was a very long obscure code that accompanied the ORA-600).
The problem did not occur with the test system, and the objects by all appearances seemed to be the same. At one point, I had a gut feeling and decided to manually rebuild the external specification of a relevant table function (which took just a few seconds)--and the problem was solved. Somehow the Russian DBA's nonstandard migration had resulted in at least one corrupt database object. (I had specifically warned the IT manager during the first engagement that the DBA's cloning process was unacceptable.)
The Windows 7 Error Code 766f6c756d652e63 3f1
What would you think if you saw this error code booting up your Windows 7 PC? How does it comply with the concepts I've outlined in this post? I encountered this several times, but let me
quote another user whom encountered the same problem:
Hi guys. I'm using a fairly new laptop, had it for less then a yr. It came w/ Vista and I never had to use Chkdsk on it. I used the Windows 7 upgrade disc about 3 weeks and upgraded to Windows 7 w/ no issues. Love it so far. Today my computer froze while shutting down. I had to hold the power button to turn it off. I then wanted to chkdsk, but it will not run suppostedly because "due to recent software installation" and it gives me error 766f6c756d652e63 3f1. Anyone experience this? Pls help.
My response is, at the time of this post,
at the end of a related thread; the reason I'm quoting this problem statement is because he specifically mentions the "helpful" system message "due to a recent software installation". Can you narrow that down? What the user omits here is that Windows also "helpfully" notes that you can go to a system restore point before the unspecified recent software installation.
I inferred that the software in question was probably security-related. I was using at the time a very well-known security software product which came bundled with my cable Internet service. As I suspected, chkdsk could run if and when I uninstalled the security product and rebooted my PC with a scheduled chkdsk. (By the way, the security software had been installed some time earlier, and several products had been installed in the interim.) Subsequent interactions with the security software's technical support group proved unproductive; I was allegedly the first person whom had ever reported this problem. (I don't recall if I had subsequently tested any available option for suspending or disabling the security software in question, but uninstalling/reinstalling security security software was an infeasible solution, and I had done so given the context of the error message; the tech support representative was more interested in trying to capture the event than in suggesting a usable alternative to uninstallation of the software.)
I describe a two-stage workaround to resolve the occasional chkdsk in the above-sited post: scheduling a chkdsk after PC restart; and, just before restart, suspending one's security software product (i.e., antivirus/firewall). These are the kinds of things Apple loves to parody in its ads; let's hope that Windows product engineers take notice...