We have recently completed a report for the second year of the project. I thought I would share some of the accomplishment from last April to date.
Processing
We processed 1100 boxes of material last year with the help of 8 student workers. The processing mostly involved the rearrangement of material. In its current state there was no order for the materials throughout the collection. The processing work for this year involved removing the contents of existing boxes of Hall Hoag material and organizing it in new boxes. Due to the size of the collection a lot of logistical work needed to be done in conjunction with the processing. Boxes had to be removed from cold storage and after the contents were rearranged placed back into cold storage. Because of this a procedure had to be worked out with the staff at the Brown University Collections Annex. Special thanks to them!
Processing Last Summer
In addition to the processing of the physical materials, the inventory that existed for all 1655 boxes had to be updated after the folders were relocated. The inventory contained approximately 170,000 folders and throughout the year ~112,000 folder records were updated with new locations.
The folder inventory list also had to be matched with the organizations list that was created. There are +35,000 organizations and +170,000 folders and for the work that will be done in connecting the EAC-CPF records to the EAD finding aid it will be important to know which folders belong to which organizations. This work is somewhat complicated because the inventory contains very granular differences in organization names. For example, The Communist Party USA may come up in a variety of ways on the inventory, including: CPUSA Communist Party USA, Communist Party United States of American, or Communist Party USA, The. We worked to group these organizations together intellectually across the collection. This work has not been totally completed but 162,000 folders from the inventory list have been matched to an organization.
We have also begun the work of organizing a detailed EAD Finding Aid that combines inventories from Part I and Part II of the Hall Hoag collection. The first step in this process is comparing the organizations list from Part II to Part I and finding where the two parts overlap. A first pass in comparing these two lists has been completed. There are 5872 organizations listed in Part I of the collection. 4739 of those organizations are also represented in Hall Hoag Part II. That means approximately 80% of the organizations from Part I are also in Part II. This overlap will help make it easier to combine data from both parts and establishing this overlap will help make us create a finding aid that allows researchers to easily find material in both parts and link them to EAC-CPF records.
Data Gathering
Using Open Refine we were able to run a search and retrieval script for all 35,000 organizations against the VIAF database. This script took a column in Open Refine (organization names) and searched each entry against VIAF and then returned positive and negative hits in XML. XML results were returned for each organization and for the organizations that returned a positive hit a VIAF ID was isolated as well as a VIAF URL. ~12,000 organizations returned a positive hit. The next step will be determining which of IDs apply to the correct authority record in VIAF. Many of the results are for similar organizations but not an exact VIAF match. Additionally many of the organization are so small that they will not have a VIAF record.
5000 organizations have been assigned categories based on Library of Congress subject headings. This work was done by searching the organization names in a FileMaker Pro database for key words such as “Christian, Communist, Socialist, Militia, etc.” and assigning appropriate subject headings as categories. With the Hall Hoag collection we were a bit lucky in that many of the organizations have very straightforward and direct names. This is not the case for all organizations but it made the process of assigning these 5000 categories somewhat straightforward. Many of the vaguely named organizations will be much more difficult.
We began the process of researching the organizations in the Hall-Hoag collection. We have started with virtually no information on each organization and without gathering some information the EAC-CPF records would be very bare. To gather some basic information on some organizations we had students do online research throughout the year. There is currently no sure way to gather this information with an automated process and we felt it was important to gather information manually on at least a portion of the collection. We prioritized the organizations by how much material existed in the Hall Hoag collection. The organizations with the most material were the first researched. The students searched for the following information: VIAF authority file, VIAF ID, location city, location state, start date, end date, URL to biography page, member names, member VIAF authority files, member VIAF ID, member position, member start date, member end date, related collection URL, related collection title, related collection location and notes. A total of 2061 organizations were researched with varying degrees of results. We have locations for 1715 organizations, dates for 1423 organizations, bio links to 1392 organizations, 1032 related collections, and 1483 members.
Left Wing Groups (Red = Highest Concentration)
The data collected was imported into a custom made FileMaker Pro database and an EAC-CPF record was exported for each of these 2061 organizations. This was done as a test to see if valid EAC-CPF could be exported with real data.
Looking Ahead
Looking ahead to the final year of the grant in which a web interface to explore the collection will be created I tested visualization software (Google Fusion Tables, leaflet.js, Voyant, Viewshare, Open Refine) for possible use for EAC-CPF records. Using some of these tools sample visualizations were created from the data collected on 2061 organizations that were researched. The most useful visualizations were created with Google Fusion Tables. Of particular note is a map that includes catalog records (with links to outside biographies and collections) that can be filtered by category, dates and locations. This allowed users to changed the map on their particular interested and then clicked the dots on the map that contained catalog records. We have included a report as an attachment that reviews some of the findings using visualizations. You can also use this link to view the visualization live online.
Possibilities
We are exploring more options for automating some of the data gathering. In particular we are going to try to leverage the data being stored in DBPedia. DBPedia is a downloadable version of Wikipedia. We have found scripts that can be run using Google Refine to search DBPedia against our list of organization names. This would give me a DBPedia link for organizations that are in Wikipedia. We have not been able to work the kinks out of this script to date, but we have found examples of others leveraging this script so there is a good chance it should work for us as well (http://www.youtube.com/watch?v=XdpzmGxA33U). These examples were found through doing online research. Additionally the digital technologies department knows of a way to pull individual fields of data out of DBPedia records using a SPAQRL script. If we can get this to work we should be able to automate a plethora of data for EAC-CPF records.
Blog Update
Since that last grant report we have written more than 50 blog posts for the Hall-Hoag blog. We also started to gather analytics on visits to the site. From October 2013 though February 2014, the site has had 1027 visits and 2178 page views. On average it seems that the blog gets a little more than 200 unique visits per month. Additionally I have been contacted numerous times by people who only came across the Hall Hoag collection through the blog. Thanks for visiting and contacting me!
In general the project has been fun and sometimes challenging. There are many moving parts and it can become hard to prioritize the work because there is huge amount of processing that needs to be done and at the same time a huge amount of research and data entry needs to be done.