At my current client, we are working on a major rewrite of their Claims Management System, which has included moving the old data from an Oracle database to a new SQL database being designed by our team. Part of this conversion involved a particular challenge. We were tasked with extracting all of the documents stored in the Oracle DB as BLOBs (HUGEBLOBs to be exact) and loading them out to the file system. The documents are going to be stored in the new SQL database in a “File Table”. A File Table is essentially an index to these files, while the files themselves are not physically stored in the database, they are stored out in the file system. The File Table contains certain fields including a unique stream_id, a path_locator that indicates where the document is, and other pertinent information about it, including the file_stream encoding, file name, file type, file size and other attributes. There is no ability to add additional fields to the built-in SQL File Table structure for storing other key information.
There are 1.2 million documents that have to be extracted, saved to the file system, and linked to the appropriate claimant’s record in the new database. Without being able to attach any other fields to these documents while they are extracted to the file system to link the document to the appropriate client record, that is challenge #1. Challenge # 2 is getting the documents out en masse (certainly we couldn’t save them out one by one using the existing front end application). Challenge #3 comes later, and I’ll get to that.
So I did some searching and found a built-in Oracle function called UTL_FILE that extracts the documents from the Oracle BLOB table. I tweaked the function into a script that contains a cursor to select the documents in chunks at a time, then loops for each document, renames the document, and runs UTL_FILE to save it out—looping each file into further chunks when they are too big for one write (the limit is 32k bytes at a time). This script then writes them out to the file system on the network. This solved Challenge #2 which was to bulk extract all of these documents, although due to the time cost of the script, they still will need to be done in smaller batches at a time.
Renaming the documents as they came out of the system solved Challenge #1, where these documents needed to somehow be linked to the original claimant’s record. I appended the original Claimant ID number to the front of the document name, separated by “—“ so it would be easy to use SUBSTRING to get the ID number out later for loading into the table that links the File Table records (by the stream_id) to the Claimant records.
Challenge #3 came when trying to open some of the documents after they were exported. They were corrupted—it was approximately 10% of the documents. After a lot of head scratching and looking for similar patterns in these documents to explain why they were corrupted yielded no clear answer. A suggestion from my manager led me to explore the code from the old application. I looked at the code that uploaded documents into to the Oracle table, and the code that opened them and allowed the users to export and save them out to the file system. Therein lied the answer. It appeared that some of the documents were being compressed before being uploaded to the Oracle database. There really seemed no business rule for which were to be compressed, and there was no indicator in the database as to if they were compressed or not. Therefore, I altered my UTL_FILE script to uncompress all of the files before saving them out. Unfortunately, if a file was NOT compressed, it would throw an error. So, I again altered my script to catch the error, and NOT uncompress those documents. Voila, the script worked like a charm. Here it is in all its glory, and our customer is happy that we can get all their BLOBs out!