Category: EAC-CPF

Bar Graphs of Collection

I wanted to share a few bar graphs created from the data collected on 1700 organizations in the collection. They are based on  state, decade founded and category.

 

States

Decades

Decade Founded

For Me To Deal With

Category

Data Visualization (Mapping)

We have been gathering more and more data on the organizations in the Hall Hoag Collection throughout the year and it is starting to show some results. We have locations for over 1700 organizations in the collection and you can see them mapped out by clicking the image below. While exploring the map click on any of the dots to learn more about the organizations. Included are hyperlinks to bio pages, start dates, end dates and categories.

Hall Hoag Map

Click Map To Explore Data

Hall Hoag Summer Work

Today marks the beginning of the summer work for the Hall Hoag project. In two weeks we will begin our work at the Library Annex building but for now we are going to be doing research the organizations in the collection. We should be able to get through 1000-1200 organizations in this time. We will be finding the following information on each organization:

  • Organization Authority – we will be using VIAF to find authoritative versions of the organization names.
  • Organization Authority ID
  • Location City
  • Location State
  • Start Date
  • End Date
  • Organization Historty URL
  • Member Names
  • Member Name VIAF
  • Member Position
  • Member Start Date
  • Member End Date
  • Related Collection Location
  • Related Collection Title
  • Related Collection URL

Since we will have the help of 8 students this summer I am having them look up each organization the old fashioned way (Google) and fill in as much as they can find. All of the data they find will be imported into our database and will be added to the EAC-CPF records for the organizations.

photo

EAC-CPF Workshop

On Monday I attended the Society of American Archivists’ (SAA) Encoded Archival Context – Corporate Bodies, Persons, and Families (EAC-CPF) workshop, held in the Sheerr Room in Fay House at the Radcliffe Institute for Advanced Study, located at Harvard University. Katherine M. Wisser, Assistant Professor and Co-Director, Archives/History Dual Degree Program at Graduate School of Library and Information Science at Simmons College taught the workshop that covered the International Standard Archival Authority Record For Corporate Bodies, Persons and Families, hands on EAC-CPF work and an overview of other EAC-CPF projects. EAC-CPF is a standard that can help define the creators of archival collections and expose relationships between creators and other archival collections. For the Hall-Hoag project we will be using EAC-CPF to show relationships between all of the organizations in the collection when we publish the collection online. This workshop really helped clarify how to properly code the identities of the organizations in the Hall-Hoag collection and the potential of listing the functions of these organizations.

To get an idea of what EAC-CPF can look like in practice check out these sites:

Trove (The National Library of Australia): http://trove.nla.gov.au/?q=

Archives Portal Europe Network: http://www.apenet.eu/

Fay House, Cambridge

 

EAC-CPF Workshop

Experimental Relations: Using Samuel Johnson to Learn EAC-CPF @ NEA 2013

I attended the Spring 2013 meeting of the New England Archivist’s on Friday at The College of Holy Cross in Worcester. I had a great time learning about other projects going on in the area. In particular I enjoyed a joint presentation by Susan Pyzynski and Melanie Wisner from the Houghton Library at Harvard and Ellen Doon and Micheal Rush from the Beinecke Library at Yale.  They have recently completed a project called “Connecting the Dots: Using EAC-CPF to Reunite Samuel Johnson and His Circle” in which they created ~80 EAC-CPF records for people related to their archival holdings on Samuel Johnson. It was the first time that I have seen a presentation that walked through the process of creating EAC-CPF records from start to finish.

A lot of what they presented on is covered in their wiki, which also includes best practices and downloadable sample EAC records. It is a great resource for anyone working with EAC-CPF:

https://wiki.harvard.edu/confluence/display/connectingdots/Connecting+the+Dots:+Using+EAC-CPF+to+Reunite+Samuel+Johnson+and+His+Circle

NEA Spring 2013 at Holy Cross

 

Creating A List of Organization Names

As a deliverable for this grant, we have to create EAC-CPF records for each organization in the collection. The first step in that process is figuring out what organizations are in the collection.

Inventory List Created When Materials Were Organized

When the materials were shipped to us they were completely unorganized. As part of the process of organizing the materials into new folders and archival, we also created an inventory in Excel of each organization that came out of that shipping box (see image above).

We ended up with (after years of work) inventories of nearly 800 boxes that when combined represented about 180,000 lines in an Excel spread sheet. As you can imagine, many of the organizations are repeated throughout the collection.

We combined the organization name column from all of the inventories  (Column D above) into a new spread sheet. Excel can only handle about 65,000 lines and due to the size of our data we had to start with 3 spread sheets. After we pulled all of the organizations out of the original spread sheets, we alphabetized the new spread sheet. By doing this we were able to see which organizations were duplicated throughout the collection.

Organization Lists Combined and Alphabetized

Next, we had to weed out the duplicates. Rather than going through each line in Excel and deleting the duplicates, we used a few tricks. The first step was using the function in Excel “delete all duplicates.” With this function alone we were able to cut the number from 180,000 down to about 75,000. However, this only deleted the lines that were exactly the same.

When looking through the data, we saw that many duplicates still existed due to misspellings and any additional information added to the name of the organization. Excel could not recognize them as the same organization and therefore could not automatically delete them.

Organization List Showing Differences

In the example above you can see that all of these lines are from the same organization but the data in parenthesis makes them unique.  To clean this up, we used the Excel function “Text to Column” and we were able to separate the text out by deliminating the text by a “(“ symbol. In Other words, Excel was able to take everything that occurred after a ( symbol and separate it out. Once the data inside the () was taken out we were able to run the “Delete All Duplicates” function once again and this cut our list from 75,000 down to about 50,000.

Organization List Showing Differences in Spelling and Formatting

However, there was still a lot of weeding to be done. In the example above you can see how many different spellings and formats the Minute Women of the USA had over the years of inventorying the collection.

There was no common factor that we could use to separate out these organizations. We chose to use Google Refine to help clean up the lists. Google Refine is a tool that can be used in various different ways to help clean up data. As you can see in the screen shot below after uploading the spread sheet into Google Refine, it was able to point out different lines of data that are similar.

Google Refine

Using this tool you are able to select which version of the organization you would like to use and then Google Refine goes through and updates the other versions of that name. This process cut the list down to about 40,000 lines. The next step was to go through each line of the spread sheet manually and deleting the duplicates.

In all, the process took about 60 hours to complete and we ended up with about 35,000 unique organization names. The next step will be taking this list of names and creating an authoritative version for each.

EAC-CPF Creation

As part of this project we are going to be creating EAC-CPF records for each organization contained in the collection.

“Encoded Archival Context – Corporate bodies, Persons, and Families (EAC-CPF) primarily addresses the description of individuals, families and corporate bodies that create, preserve, use and are responsible for and/or associated with records in a variety of ways…It supports the linking of information about one agent to other agents to show/discover the relationships amongst record-creating entities, and the linking to descriptions of records and other contextual entities. EAC-CPF is a communication structure for archival contextual information for individuals, corporate bodies and families. It supports the exchange of ISAAR (CPF) compliant authority records.”[1]

EAC is defined as a document type definition (DTD) as well as in an XML Schema and a Relax NG schema. EAC elements reflect the ISAAR(CPF) standard and the ISAD(G), two standards managed by the International Council on Archives[2]

To save the time in creating an XML file for each of the organizations in the collection we have created a FileMaker Pro database which stores information on each organization and has been customized to export this data in valid EAC-CPF XML. All of the information will either be entered in manually, or pulled imported from other online sources (we are still exploring our options here). The FileMaker Pro database will track information about the background of the organizations, members of the organizations, publications of the organizations, and other archives that have information on the organizations. We determined what information to track based on what is most readily available to our staff within the collection   We anticipate that most of the data in the FileMaker Pro database will be hand entered (meaning not through a script or importing) which in itself is extremely time consuming. However, being able to export the data in a valid EAC XML will save us an immeasurable amount of time.

Below are some screen shots of our FileMaker Pro database. For more information about this work contact Daniel Johnson (Daniel_johnson_1@brown.edu). The database has not been completed and will be worked on throughout the first two years of the grant, but the screen shots below will give a general idea of what will be used to create the EAC-CPF.


[1] http://eac.staatsbibliothek-berlin.de/about/ts-eac-cpf.html

[2] http://en.wikipedia.org/wiki/Encoded_Archival_Context

Tab 1: Organizational Background/Authority

In the first tab in the database we can collect information on the background of the organization and enter information regarding authorized versions of the organization’s name. Included here are start dates, end dates, locations, organizational histories, categories, and subjects. In this tab data is used to describe what the organization is and evidence is given to help identify the organizations. In the other tabs in the database we establish the relationship of this organization to other entities.

Tab 2: Members

In the 2nd tab we establish relationships between the organization and people who were members of the organization, by looking through the materials (who wrote the articles, who is mentioned as members) and doing additional research. We can include what a persons role in the organization was and when they were involved. This helps create connections between different organizations within the collection (many people are members of multiple organizations) and with outside collections. By using authorized forms of a person’s name (LoC) eventually these names can be cross referenced against archival holdings within other institutions as well as other collections held by Brown University.

Tab 3: Publications

This database is also structured to track the publications in the collection that each organization has issued. Due to the size of the collection (700,000 items) it is unlikely that we will be able to input the title of all of the publications however, simply entering the dates of the publications will create useful connections. In many cases the organizations in this collection are obscure and very difficult to research. By looking at when things were published the major dates of existence for an organization can be established  This also helps establish connections between different organizations. For example after dates have been entered researchers can explore questions like “what was being published in 1958?” and “who was publishing it?” on the extreme right and left. This makes issues that were topical at a particular time in history easier to explore across a wide variety of organizations.

Tab 4: Related Archival Collections

This tab will be used to list other archives that have collections created by or about organizations in the Hall-Hoag collection. Although this tab will not be often used (due to the size of the collection and the difficulty in finding collections online) it was developed because in some cases the connections will be obvious and the information would be very valuable for researchers.

Tab 5: EAC-CPF Creation

The final tab in the database is used to create the XML data. By clicking the button “Craete EAC-CPF” the text box below is filled with a valid EAC-CPF file created using the data entered into the database. The text in this field can be copy/pasted into Oxygen to be view, edited and saved. In addition this database also has scripts in place that can create EAC-CPF records for each organization in the database and save them to a specified location on a shared drive at Brown University. This allows us to make whole sale changes to our EAC-CPF code and update each record en masse. A pdf version of some sample data is attached.

American Nazi Party EAC

Powered by WordPress & Theme by Anders Norén