As a deliverable for this grant, we have to create EAC-CPF records for each organization in the collection. The first step in that process is figuring out what organizations are in the collection.
When the materials were shipped to us they were completely unorganized. As part of the process of organizing the materials into new folders and archival, we also created an inventory in Excel of each organization that came out of that shipping box (see image above).
We ended up with (after years of work) inventories of nearly 800 boxes that when combined represented about 180,000 lines in an Excel spread sheet. As you can imagine, many of the organizations are repeated throughout the collection.
We combined the organization name column from all of the inventories (Column D above) into a new spread sheet. Excel can only handle about 65,000 lines and due to the size of our data we had to start with 3 spread sheets. After we pulled all of the organizations out of the original spread sheets, we alphabetized the new spread sheet. By doing this we were able to see which organizations were duplicated throughout the collection.
Next, we had to weed out the duplicates. Rather than going through each line in Excel and deleting the duplicates, we used a few tricks. The first step was using the function in Excel “delete all duplicates.” With this function alone we were able to cut the number from 180,000 down to about 75,000. However, this only deleted the lines that were exactly the same.
When looking through the data, we saw that many duplicates still existed due to misspellings and any additional information added to the name of the organization. Excel could not recognize them as the same organization and therefore could not automatically delete them.
In the example above you can see that all of these lines are from the same organization but the data in parenthesis makes them unique. To clean this up, we used the Excel function “Text to Column” and we were able to separate the text out by deliminating the text by a “(“ symbol. In Other words, Excel was able to take everything that occurred after a ( symbol and separate it out. Once the data inside the () was taken out we were able to run the “Delete All Duplicates” function once again and this cut our list from 75,000 down to about 50,000.
However, there was still a lot of weeding to be done. In the example above you can see how many different spellings and formats the Minute Women of the USA had over the years of inventorying the collection.
There was no common factor that we could use to separate out these organizations. We chose to use Google Refine to help clean up the lists. Google Refine is a tool that can be used in various different ways to help clean up data. As you can see in the screen shot below after uploading the spread sheet into Google Refine, it was able to point out different lines of data that are similar.
Using this tool you are able to select which version of the organization you would like to use and then Google Refine goes through and updates the other versions of that name. This process cut the list down to about 40,000 lines. The next step was to go through each line of the spread sheet manually and deleting the duplicates.
In all, the process took about 60 hours to complete and we ended up with about 35,000 unique organization names. The next step will be taking this list of names and creating an authoritative version for each.