Alexander H. Black

MIT Agelab

Automating data collection at MIT.

My Role

Automation Engineer overseeing automated data collection and powered visualization tools

Tools Used

Python, ELK stack, PyPDF, Google Tessaract, NLTK

Project Duration

September 2015 - November 2015

Summary

  • Contracted to solve workflow problems and provide visualization tools for data analysis and presentation
  • Implemented data collection automation from various sources, designed the primary data visualization platform, and reduced manual work for staff substantially

Description

Agelab was doing a lot of their research manually and very little, if any, was being automated. This was resulting in a lot of time lost in data collection, less time being spent in analysis, and some data was subject to human perspective, delaying the length of projects significantly and potentially muddying the data. Additionally, their visualizations were being purely done in Excel and limited by the researchers’ familiarity with advanced Excel features.

Background

While working at Tinder, a friend of mine approached me about his new research job at MIT Agelab, a research lab summarized as doing research for statistics related to anyone over the age of 55. Based on what he told me about their needs,I told him a lot of that could be automated. He then connected me to the Director of Agelab who wanted me to not only provide the automation for a lot of manual tasks, but also some visualization tools for their data to be used in projects and publications.

The biggest project they had at the time was getting PDF copies of new editions of major and local newspapers and saving the headlines in Excel. They would also catalog the tone of the headline and article, any political bias the article may have, how relevant it was to older and ethnic demographics, and if the subject or actual article was being tweeted by personalities those demographics were following on the day the article published. This task could take 3-4 hours on some days, and checking the social networks would take numerous checks everyday, causing researchers to step away from other tasks they were doing.

They were also manually reviewing stocks, index funds, and other stock market information that these demographics were also believed to be following, and logging price values at various parts of the day. This was another task that took people out of their regular tasks to attend to. A better visualization solution for the data was also requested, as they only able to use Excel. This limited how they were visualizing their data based on Excel’s feature set, the skill of the researcher, and often produced an unattractive, but telling, visualization.

Solving the Problem

Publication Review

All aspects of the publication review were easily automate-able. What was more difficult was figuring out an implementation method that would be easy to find someone to maintain after my contract. We elected to use Python as not only did it have a large user base that they could likely hire a grad student to maintain, but it was an approachable enough language they could reasonably train a researcher how to use it.

Scanning publication PDFs had three issues we had to solve:

  1. Extracting content had a problematic inconsistency. Some PDF's were properly formatted and we could easily extract the text, but others were just raw scans and we couldn't just run a raw text analysis on it, we needed a different method of extracting the content.

    • We decided to use Google Tesseract to do OCR scanning to extract the content because not only was it backed by Google, but after testing, the results fit our needs.
  2. For the actual text analysis, we needed something that could do sentiment and bias analysis without needing to be manually trained.

    • We used NLTK because it was open-source, easily approachable, and there were existing models created that we based ours off of, saving a lot of time on problem-solving.
  3. How could we reasonably detect if personalties were tweeting about the subject, or linking to articles from different sources about the same topic?

    • Ultimately, we passed on this as it required a higher level model than the contract timeline allowed for. (At the time, as of 2016, there's a few FOSS libraries that this could be done in a week or two.) We did try a proof of concept and attempted to create our own NLTK model that was had a 45% success rate, but we needed to get it significantly higher to be usable in research. Researchers continued to do this manually.

Once we made decisions on how this should be implemented, we setup a pipeline that collected and passed data in the following manner:

  1. At a set time every morning, the process would start.
  2. Content was extracted from PDFs via text analysis. If no text was returned, the PDF was passed to the OCR function.
  3. Text was then passed to NLTK for sentiment and bias analysis.
  4. Keyword detection looked for matches that provided a probability of relevance to certain demographics. Researchers would manually review this after the pipeline to confirm likelihood.
  5. Results were saved to a spreadsheet on a local server that researchers would access in order to work with.
  6. This pipeline ultimately removed 2-3 hours of manual work everyday.

Stock Market Review

Stock Market review was a much simpler task, as there were not only numerous libraries that made this easy, but we didn't require any analysis. We selected a library that we felt was beginner-friendly and implemented it to run on a cron job that ran 5 times a day starting from when the market opens and ending when the market closes. Results were saved to a spreadsheet on a local server that researchers would access in order to work with.

Researchers were able to save up to an hour everyday, as well as no longer having to constantly switch contexts to tasks like this.

Visualization

One of the biggest goals with a new visualization was having an easy to use system that any researcher could be trained how to use within a day or two. The other goal was a system that could be made to work with any kind of data and even create custom types of visualizations.

There was discussion about building a custom system for this task using D3 and other libraries, however, this was too monumental of a task for a 3 month contract, and we elected to try and find an existing solution.

We made an attempt using the ELK Stack, as I had familiarity with it and knew the raw power it had in terms of visualization types and extensibility, as well as how creating visualizations via it was relatively easy. There was also a large community of developers creating plugins that would make it easy to add new features if they were needed, as well as a community to answer questions.

We built a working prototype that was used as a small pilot for researchers. The feedback from researchers was positive for accuracy, ease of use, and visual design, however, to make this application work fully for the research lab, we required a lot more time than a 3 month contract, and so the solution was deferred. Researchers continued to use Excel for visualizations.

Like what you see?

Let's work together

Reach Out