Exploring Survival on the Titanic with Machine Learning

12 mins

In the early morning of 15 April 1912, a British passenger liner sank in the North Atlantic Ocean after colliding with an iceberg. More than 1,500 passengers died in the sinking, making it one of the deadliest maritime disasters. Since then, the Titanic has become one of the most famous ships in history, her memory kept alive in various forms of pop culture, museums, books and films.

We can use machine learning to explore some interesting questions. How much of role did a passenger’s socio-economic status play on their chance of survival? Did their name or age make a difference? What about siblings, parents or children? Is one of these factors more significant than the rest? Using decision trees and a random forest model, we can analyze the passenger data from the ship, answer some of these interesting questions and create a classifier that can predict if a passenger survived the tragedy.

Read more

The Emotional Timeseries of Prose

15 mins

Nearly twenty years ago, Kurt Vonnengut, an American author perhaps most famously known for his satirical novel Slaughterhouse-Five, gave a lecture that would change the way we think about stories. Standing in front of a blackboard, chalk in hand, he proclaims, “There’s no reason why the simple shapes of stories can’t be fed into computers; they are beautiful shapes.” He then proceeds to plot a cosine curve, and amidst applause and laughter, playfully declares, “People love this story!”

Those Who Tell the Stories Rule the World

The notion Vonnengut explores is an interesting one - can we quantitatively look at writing to understand how it is emotionally structured? When we read, we feel emotionally connected to the writing. We get so ‘lost’ in the fictional world and fall so deep into it that our own emotions become mapped to the narrative. In fact, narrative transportation theory in psychology studies exactly this. The quantitative meta-analysis by Van Laer, De Ruyter, Visconti and Wetzels on the effects of narrative transportation allude to readers ‘mentally enter(ing) a world that a story evokes’. We feel what we read, and being able to understand how these emotions vary over the course of a story is, I think, an extremely interesting intellectual pursuit.

More importantly, this discussion leads to some interesting questions that we can now address through data analysis of big datasets. How does this emotional structure vary over generations of writing, from early 16th century Shakespeare to modern day Pratchett? How do these trends differ between cultures - how similar or different is Indian and Japanese literature in its emotional structure? Do certain authors have an emotional signature - a unique structure to their stories, a formula to their writing? Given an emotional structure, can we predict what kind of story it is (or perhaps even predict its ending?)

Many of these questions were inspired by the research of Andrew Reagan and the Computatational Story Lab at the University of Vermont, where they used sentiment analysis to analyze the ‘emotional arcs’ of 1,700 stories to reveal the most common ones. Their findings are fascinating - according to the research, all stories conform to one of six basic emotional arcs.


As an avid reader, this research really fascinates me. In this multi-part blog series, I will try to understand the concept of an emotional timeseries in a piece of literature, and how it is affected by various factors. In future posts, I will attempt to address some of the more interesting questions that I brought up earlier.

Read more

Sentiment Analysis on Yelp Reviews

9 mins

I was looking for public datasets to explore the other day, and I ran into Yelp’s dataset from the Yelp Dataset Challenge. After poking around the data, I realized that it was a treasure trove of data for local businesses – it had around 2.4GB of data and invaluable information ranging from details like location and opening hours of the businesses, to user reviews about service and quality of food.


There are five different datasets:

  • yelp_academic_dataset_business contains details about businesses such as opening and closing hours, location, categories, number of reviews, ratings, as well as other attributes ranging from if it takes reservations to if it would be considered ‘hipster’.

  • yelp_academic_dataset_checkin contains all the check-in information at businesses.

  • yelp_academic_dataset_review contains the reviews for all the businesses, as well as the number of stars associated with the review. It also contains information on whether the review was rated as ‘funny’, ‘useful’ or ‘cool’.

  • yelp_academic_dataset_tip contains the user provided tips for the businesses, as well as the number of likes the tip received.

  • yelp_academic_dataset_user is the user information dataset. It contains information such as how many votes the user got, the number of reviews the user wrote, as well as other information like friends, average ratings and so on.

That is a lot of data. I’m getting excited just by the possibility of exploring and learning from all this information. I thought I’d start off by doing something relatively simple - a sentiment analysis on Yelp reviews by training a multinomial naive Bayes classifier.

Read more