Building Social Media Collections

Presentation to Management Council

Brian Dietz & Jason Ronallo

Harvesting and Archiving

Preponderance of Social Media

2.2 Billion Active Social Media Users

30% global usage

http://wearesocial.net/blog/2015/08/global-statshot-august-2015/

In 2015 in the US

65% of adults use at least one social media platform

http://www.pewinternet.org/2015/08/19/the-demographics-of-social-media-users/

Percentage of Adult Internet Users, by Platform

  • 72% Facebook
  • 28% Instagram
  • 23% Twitter

Percentage of Young Adult Internet Users (18-29), by Platform

  • 82% Facebook
  • 55% Instagram
  • 32% Twitter

Researchers are Taking Note

In a meta-analysis of studies using data from Twitter, there were least seventeen different disciplines represented in 382 studies spread over six years.

Michael Zimmer and Nicholas John Proferes, “A Topology of Twitter Research: Disciplines, Methods, and Ethics,” Aslib Journal of Information Management 66, no. 3 (2014): 250–61.

Twitter Research Data Grants

  • Foodborne Gastrointestinal Illness (US)
  • Disaster Information Analysis (Japan)
  • Cancer Early Detection Campaigns (Netherlands)
  • Modelling Urban Flooding in Jakarta (Australia)

But Why Archive Social Media Data?

Discourse Relevant to Archival Collections

How could we not want to preserve a vast record of everyday life and thoughts from tens of millions of people, however mundane?

Dan Cohen, Digital Ephemera and the Calculus of Importance.

Perceived Value of Social Media Data Among SCRC Researchers

Serious discourse occurs on social media?

45% agreed

22% strongly agreed

(67% combined)

Value in using social media data in research?

34% agreed

37% strongly agreed

(71% combined)

What Does This Content Represent?

Official Records

@NCState

Everday Experience

#ThinkAndDo

Significant Events

#OurThreeWinners

Greater Representation in the Archival Record

  • Increase diversity of voices in historic record
  • Build more representative collections

Engagement With New Communities & Deeper Engagement With Existing Communities

My #HuntLibrary

My #HuntLibrary

  • Crowdsourced storytelling
  • Multiple Access Layers
  • Battles, Voting, Moderation
  • Award-winning

Archival Component

That's pretty legit! Appreciate the props #huntlibrary!

My #HuntLibrary User Study

75% listed contributing to the archive as a main motivator for participating.

“Even Better Than Winning an iPad!”

“New Voices and Fresh Perspectives”

2014-15 LSTA EZ Innovation Grant

  • Administered by the NC State Library
  • Collaboration between SCRC and DLI
  • Guidance from Copyright & Digital Scholarship Center
  • Significant contributions from student assistants

Project Goals

  • Establish groundwork for a social media collecting program at NCSU Libraries
  • Develop free, web-based documentary toolkit
  • Develop open, easily deployable collecting environment

Collecting Program

Not collecting all of Twitter and Instagram!

Historians of the English Civil War are deeply thankful that Humphrey Bartholomew had the presence of mind to save 50,000 pamphlets (once considered throwaway pieces of hack writing) from the seventeenth century and give them to a library at Oxford.

Dan Cohen, Digital Ephemera and the Calculus of Importance.

SCRC Collecting Strengths

Largely focused on NC State History

Identifying Content

  • Targeted accounts
  • Hashtags
  • Keywords

Account-based Twitter Harvests

@NCState

Colleges and Departments

DASA

Student Organizations

And About 460 Other Accounts

Hashtag-based Instagram and Twitter Harvests

NCSU16 - NCSU20

NCStateOnCampus

Packapalooza

Homecoming

Krispy Kreme Challenge

This Data Tells Part of the University's History

Documentary Toolkit

To help other institutions kickstart
their own collecting initiatives

  • Environmental scan
  • Research value
  • Legal and ethical analysis
  • Documentation
  • Surveys

Broader Impacts

Contributions to the Profession

Open Source

https://github.com/NCSU-Libraries/

Lentil

Technical prerequisites

2011-2012 ALA Public Library Funding & Technology Access Study

Technical Requirements for Social Media Archiving Tools

  • Social Feed Manager
  • Lentil

Technical Requirements

  • Social Feed Manager
    • Python
  • Lentil
    • Ruby

Technical Requirements

  • Social Feed Manager
    • Python
      • Django
  • Lentil
    • Ruby
      • Rails

Technical Requirements

  • Social Feed Manager
    • Python
      • Django
      • requirements.txt
  • Lentil
    • Ruby
      • Rails
      • gems

Technical Requirements

  • Social Feed Manager
    • Python
      • Django
      • requirements.txt
    • PostgreSQL
  • Lentil
    • Ruby
      • Rails
      • gems
    • MySQL

Technical Requirements

  • Social Feed Manager
    • Python
      • Django
      • requirements.txt
    • PostgreSQL
  • Lentil
    • Ruby
      • Rails
      • gems
    • MySQL
  • Linux: sudo apt-get install git apache2 python-dev python-virtualenv postgresql libxml2-dev libxslt1-dev libpq-dev libapache2-mod-wsgi supervisor

Financial costs

Along with email, social media will probably provide the main source of information for researchers studying our current time. However, our institution just does not have the resources right now to collect and store the social media of other people or organizations.
NCSU Social Media Archives Toolkit survey of North Carolina Cultural Heritage Organizations

Beyond Open Source

Metrics for Success

Social Media Combine

Virtualized social media harvesting environment

https://github.com/NCSU-Libraries/Social-Media-Combine

Server Virtualization

(not desktop virtualization)

Virtual Machines/Virtual Servers

vagrant

vagrantup.com

Virtual Machine on Your Laptop

Repurposing Virtualization

Peer-to-Peer

Future Plans

  • Continued collecting
  • Access
  • Best practices
  • Outreach and campus partners
  • Challenges of social media archiving
  • Campus collaborators

Thanks!

Brian Dietz bjdietz@ncsu.edu

Jason Ronallo jnronall@ncsu.edu

Bonus Slides!

Associated content?

  • Linked web pages
  • Replies
  • Videos and other media
  • Retweeting account info
  • Engagement metrics

Availability and access

  • What is the "whole" dataset if it is constantly being revised?
  • How do we redistribute unstable data?
  • How can research results be reproduced?