So the website transformation has not occurred yet, but I have downloaded WordPress which is the platform I will move Steady As She Goes to. But before I can move I have to figure out how to back up the TypePad files and figure out how to set up the url.
In the meantime I have been working on a few projects that I hope will function as a proof of concept for some business ideas that I have. These projects are directly connected with the direction of the new blog – information & location & travel (which really can be considered a combination of information & location).
So while there is no migration and web design going on I have learned an awful lot (and still have and awful lot to learn ) about Regular Expressions (and how much I suck at using them), data extraction programs, data extraction and manipulation between programs, sgml formats, MySql and myphpadmin, web hosting interactions (what do I actually have and what will I need), and I am still learning Google Maps API programming. It’s mind-boggling.
So for this project I am cleaning up data in order to geocode it but it isn’t very easy. Here are the steps so far:
1)Downloading Files – This sounds easy until you realize that there are approximately 2900 of them in various folders – not all in one place. This required many emails just to find the files in the first place on the Library of Congress website, then IM chats with a tech friend to figure out why I couldn’t access the ftp site (my browser was timing out – ended up using WSftp). Fortunately, two of the main folders had directory listing so I could create a batch file and download en-mass – but the two did not – numbingly good time that was.
2) Evaluating File Format – These files are “machine readable” according to the Library of Congress. They were written in sgml which is indeed machine readable but does not have much of an organizational structure. And there are no available sgml readers out there anymore – too bad for us. I used TextPad to look at them. The files were meant to be digital facsimiles of the paper documents, which they are, but that doesn’t help me convert them to a database format for serving up over the internet. Fortunately, most of the data I wanted was between specific tags – usually it was much more than what I needed but it worked. Another bonus is that the bulk of the text to be used was stored between
tags and is readable by html. I can store this text with these tags in a mediumtext mysql table(?) – although I don’t know what to call the mysql equivalent of an Excel cell or actually how to get it from TextPad to MySql.
3) Extracting data from Files – This one is a doosey and if I knew Regular Expressions and maybe Grep or Awk (and had the systems to use them) and maybe a little Perl – not sure – it would have been a hell of a lot easier than what I went through – on 418 of the files – not even all of the files. First I decided on a subset of information to represent – if I knew more programmey stuff I would have tried to pull info from all of the files, but not right now – 418 is enough. Next I poured over the sgm files to find someway to extract data between tags. As it turns out – I could use a text search in a program called DataExtractor to get the actual data and then a similar Regular Expression search in TextPad to get the line numbers (which correspond to the file numbers). It was easy to just pulling specific data from 418 files, but very difficult to know what data corresponded to what file.
4) Cleaning up the Data – This is the very painful part. I have to actually compare what was pulled to the complete file. For example, I will pull everything between "
2. Place of Interview*<(slash)p>“ which may give me any one of the following results: the search string and the actual address of the interview, the search string only because the address was on the next line with its own set of
tags, nothing because there was a different number of spaces between “2.” and “Place”, nothing because it was “1.” not “2.” and a handful of other problems. Multiply this by 3 data sets to pull. I got it down to 4 different pulls per set and a final integrity check just to fill in the blanks – spots that received nothing but I know had data. And this is even cleaning yet – this is just filling in the blanks. Cleaning will be a painstaking review of each entry, splitting it up into different fields so that I will be able to geocode the addresses.
5) Geocoding – This, when I get to it, will be interesting because many of these places probably don’t exist anymore. Such-and-such Nursing Home, Portland, OR – I can bet that that place doesn’t exist so I will have to ferret out a location for it. It’s just research, but a team of elves would be a good thing here.
6) Making a MySQL database out of the text and xls data – The is my second big knowledge gap and the problem actually lies in getting formatted text into a database format.
7) Making the webpage – The page will most likely involve, forms, Google Maps API programming, javascrip or Perl scripting all of which I know some but not enough about.
The solution is to just keep moving forward, keep learning what I can, and hope that I will eventually find a tech partner to make these ideas a business. Anyone interested?
(P.S. I know that I didn’t divulge what I was actually doing but I need to create, at the minimum, a prototype before I go shooting off my mouth to the world in general)
Also - please note that I didn't figure out how to show paragraph tags within the text of the entry so I put a bunch of stupid parenthesis around them.