Data Analysis of Streaming Services & Movie Ratings

SI 330 Data Manipulation Final Project. Sophie Loesberg, Ashley Newman, Christopher Gamboa. Access our project here.

Introduction

Motivation

Our project seeks to gain insights into how movies compare on Netflix, Hulu, Amazon Prime, and Disney+. Using IMDb ratings, our group wanted to determine which streaming service held the most and highest rated movies. We were also curious about the rating of movies between years and decades to see which years had the highest movie ratings.

This topic is interesting to study because the way we consume content and stories has evolved immensely over the last decade and beyond. Since the 2008 economic crash, our content consumption has resulted in online streaming services and general content creation instead of the classic “Blockbuster” in-person movie rental. Viewing these trends and observing human behavior provides us context for societal changes.

We predicted that Disney owns more movies and franchises, but Netflix updates more frequently. Prime also has a very large movie list on its site. Ashley’s prediction is Prime has the most movies. Sophie thinks Netflix has the most movies. Chris thinks Prime has the most movies. In regards to IMDB rating, Ashley thinks Netflix will have the highest average IMDB rating and Sophie and Chris think it will be Disney+ highest average IMDB rating.

Collaboration

Work from home!

Data Sources

Our second data set displays the most popular movies sorted by ratings from IMDb. We used web scraping to extract the movie’s name, date of release, and IMDb user score.

We accessed the CSV file from kaggle.com and we used the IMDb website to get the most popular movies on IMDb ranked by rating.

We used formats like dataframes and series to view the data and manipulate it.For our first dataset, the CSV file with different streaming services and related information, we read in the CSV file and then returned the dataframe that that function returns.After that, we did some clean up on the function by changing some of the types for the variables so that they would match the types of the other dataframe. For the second dataset, we did web scraping to grab the necessary data and then took the list acquired from the web scraping to add them as columns to a dataframe.

We thought it would be important to use datetime values for our “Year” variables in each dataframe. Datetime values are a key concept in pandas and this course and we felt that it would be useful to do data manipulation on our dataframes by using the “Year” column. By converting them to datetime, we were able to compare different data points across years in a well formatted way. We also found it important to use rotten tomatoes as a variable to look at as well as IMDb values. The IMDb values only needed to be converted to a float data type, but the rotten tomatoes variable had to be converted to a string which was then changed from a string percent value to a float rating out of 10. This was so that we could compare this with IMDb ratings in both datasets. Lastly, the streaming platforms on the first database were necessary to look at. Each of these columns was an integer value that had a 1 for yes, the movie is on that platform, and a 0 for no, the movie is not on the platform. We used the Titles of movies as a way to merge the databases and connect our two tables so that we could compare values where the movies were in both datasets. This data types was a string value.

We used/retrieved 16,744 records for the dataset with streaming services and 100 records for the data frame of the top movies rated on the IMDb website.

As a whole, our data covers time periods from 1902 to around 2022. Different movies in the datasets have different years they were released and these movies are in this range.

Each of our datasets includes a temporal aspect. They each have “Year” columns that are in datetime format in the dataframes we made. In addition to having each of these datasets in dataframes using python and pandas, we also put both of them into tables in SQL and did some SQL manipulations on them. We did analyses on streaming services and counting the number of movies on each service as well as averages and percentages related to streaming platforms.

Data Processing

Then, we converted the dataframes to SQL tables that we could then manipulate. We renamed the Prime Video and Disney+ column names to help us in our queries. We then created a table to determine the average IMDb rating of the movies on each streaming platform. Disney+ has the highest average ratings for movies on their platform. Next, we calculated the number of movies on each platform for the data where titles are the same in each sql table. Lastly, we calculated the percentage of number of movies on each platform in a new table. Some of our SQL queries are included in the images below.

SQL queries to determine the percentage of IMDb rated movies on each platform
SQL queries to determine the average IMDb movie rating on each streaming service

Statistical Analysis

Visualizations

IMDb ratings for movies by year
Average IMDb rating for each year
Percentage of movies that appeared in both dataframes
Average Rotten Tomatoes score for movies that were rated higher than 60%
Number of movies with a Rotten Tomatoes score higher than 60% on each streaming service

Analysis

Testing

Contribution

Conclusion

UX Design Student, University of Michigan School of Information sophieloesberg.com