SI 330 Data Manipulation Final Project. Sophie Loesberg, Ashley Newman, Christopher Gamboa. Access our project here.
Movies and streaming services are a huge part of media and entertainment culture. We wanted to choose a topic that was relevant to the world and our lives. Ever since the pandemic, watching movies has become a much bigger part of our lives to pass the time in quarantine in a more meaningful way. We decided to look at movie data and streaming services to make predictions about which services had higher rated movies and to see how years compared to each other. After this project, we now have a better sense of which streaming services have the highest ratings.
Our goal was to determine which streaming service has a movie collection with the highest average IMDb rating and which years had the highest rated movies.
Our project seeks to gain insights into how movies compare on Netflix, Hulu, Amazon Prime, and Disney+. Using IMDb ratings, our group wanted to determine which streaming service held the most and highest rated movies. We were also curious about the rating of movies between years and decades to see which years had the highest movie ratings.
This topic is interesting to study because the way we consume content and stories has evolved immensely over the last decade and beyond. Since the 2008 economic crash, our content consumption has resulted in online streaming services and general content creation instead of the classic “Blockbuster” in-person movie rental. Viewing these trends and observing human behavior provides us context for societal changes.
We predicted that Disney owns more movies and franchises, but Netflix updates more frequently. Prime also has a very large movie list on its site. Ashley’s prediction is Prime has the most movies. Sophie thinks Netflix has the most movies. Chris thinks Prime has the most movies. In regards to IMDB rating, Ashley thinks Netflix will have the highest average IMDB rating and Sophie and Chris think it will be Disney+ highest average IMDB rating.
Due to the COVID-19 pandemic, we collaborated virtually to accomplish our goals.
Our first dataset is a CSV titled “Movies on Netflix, Prime Video, Hulu and Disney+” and has a size of 1.85MB. It has 17 columns with data on the movie ID, movie title, year it was released, target age group, IMDb rating, Rotten Tomatoes rating, directors, genre, country, language, and runtime, whether it is a movie or TV series, and if the movie is on Netflix, Hulu, Prime Video, or Disney+.
Our second data set displays the most popular movies sorted by ratings from IMDb. We used web scraping to extract the movie’s name, date of release, and IMDb user score.
We used formats like dataframes and series to view the data and manipulate it.For our first dataset, the CSV file with different streaming services and related information, we read in the CSV file and then returned the dataframe that that function returns.After that, we did some clean up on the function by changing some of the types for the variables so that they would match the types of the other dataframe. For the second dataset, we did web scraping to grab the necessary data and then took the list acquired from the web scraping to add them as columns to a dataframe.
We thought it would be important to use datetime values for our “Year” variables in each dataframe. Datetime values are a key concept in pandas and this course and we felt that it would be useful to do data manipulation on our dataframes by using the “Year” column. By converting them to datetime, we were able to compare different data points across years in a well formatted way. We also found it important to use rotten tomatoes as a variable to look at as well as IMDb values. The IMDb values only needed to be converted to a float data type, but the rotten tomatoes variable had to be converted to a string which was then changed from a string percent value to a float rating out of 10. This was so that we could compare this with IMDb ratings in both datasets. Lastly, the streaming platforms on the first database were necessary to look at. Each of these columns was an integer value that had a 1 for yes, the movie is on that platform, and a 0 for no, the movie is not on the platform. We used the Titles of movies as a way to merge the databases and connect our two tables so that we could compare values where the movies were in both datasets. This data types was a string value.
We used/retrieved 16,744 records for the dataset with streaming services and 100 records for the data frame of the top movies rated on the IMDb website.
As a whole, our data covers time periods from 1902 to around 2022. Different movies in the datasets have different years they were released and these movies are in this range.
Each of our datasets includes a temporal aspect. They each have “Year” columns that are in datetime format in the dataframes we made. In addition to having each of these datasets in dataframes using python and pandas, we also put both of them into tables in SQL and did some SQL manipulations on them. We did analyses on streaming services and counting the number of movies on each service as well as averages and percentages related to streaming platforms.
First, we read in the movies on streaming platforms dataset and cleaned the data within it by making the values in the Year and Rotten Tomatoes columns string objects. Additionally, we converted everything in the IMDb column to a float. Then, we used beautiful soup to read in the IMDb website data and formatted the IMDb website data into a dataframe that we could use. From there, we converted the Year column in both dataframes to datetime values. We further cleaned the streaming services data by making the Rotten Tomatoes score by converting the score from a percentage to a float out of 10. We made the decision to clean the data in this way because it would make the Rotten Tomatoes score easier to compare with the IMDb score. From here, we created a statistical analysis using Pearson’s Correlation to determine how Rotten Tomatoes and IMDb scores relate to each other. Next, we combined our two dataframes using a pivot table and determined which streaming service has the most movies per decade using groupby and sort_values().
Then, we converted the dataframes to SQL tables that we could then manipulate. We renamed the Prime Video and Disney+ column names to help us in our queries. We then created a table to determine the average IMDb rating of the movies on each streaming platform. Disney+ has the highest average ratings for movies on their platform. Next, we calculated the number of movies on each platform for the data where titles are the same in each sql table. Lastly, we calculated the percentage of number of movies on each platform in a new table. Some of our SQL queries are included in the images below.
We conducted a statistical analysis (t-test) to determine if the datasets we are comparing have different means. We conducted this test twice. The first time it was conducted, we checked to see if the means were different for IMDb ratings when comparing the two datasets to each other. We found a p-value of 8.545027989126333e-11. Since this is an incredibly low value, we can conclude that the null hypothesis (that the means are different) is not true. Therefore, the means are very similar on each dataset. For the second t-test we conducted, we compared IMDb ratings on the first dataset to rotten tomato ratings on the first dataset. In this t-test we found a p-value of 1.3566754335582972e-60, which also proves that the means are similar for IMDb ratings and rotten tomato ratings in the dataset.
Our first visualization depicts the IMDb ratings for each movie. Next, we have a graph that shows the average IMDb ratings by each year. This allowed us to better visualize the data we were working with and to see the spread of movie scores. Next, we wanted to show which movies appear in the IMDb 100 most popular movies and are on a streaming service. We created a pie chart that shows what percentage of these movies are in each respective streaming service. Our data shows that Prime video has the highest percentage of these movies with 38.9%, followed by Netflix with 27.8%, Disney+ with 22.2% and Hulu with 11.1%. Rotten tomatoes is a website that shows the percentage of positive review scores for a particular movie. The website considers any movie to be “fresh”, with 60% of review scores being positive. So we filtered the movies with scores equal to or greater than 60 and found the average rating for each streaming service. There wasn’t any clear differences between each streaming service and their average rating, with Netflix having an average rating of 83.8%, Hulu with 83.2%, Prime Video with 82.8%, and Disney+ with 82.4%. However, we also looked at how many movies are on/have been on each streaming service and found that Prime Video had 1,897 movies above 60%, Netflix with 905, Hulu with 423, and Disney+ with 210.
Our main goal for this project was to determine which streaming service has a movie collection with the highest average IMDb rating and which years had the highest rated movies. Through our data manipulation and calculations, we found that Disney+ has the highest average for IMDb rated movies and that the years 1989, 1994, 2011, 2015, and 2019 have the highest rating for movies in our dataframes at 9.3. Even though Disney+ had the highest average for IMDb rated movies, we found that Amazon Prime was the real winner in the amount of movies it has on its platform from our datasets. It’s percentage of movies on its platform was about 50% higher than its runner up, Netflix. Netflix came in at about 21%, whereas Amazon Prime was at about 73%. Through our visualizations and calculations we were able to learn a lot from this data and make lots of connections and conclusions about streaming services, ratings, and time periods for movies.
We have test cases to ensure our functions for cleaning and manipulating the data are working properly. The tests all of the main functions from our manipulations and ensures that there are no errors in the actions we are doing. Once we created and ran these cases, we found no errors in our functions. Therefore, our functions are reliable and accurate in their calculations, which means the data being collected is correct and can be analyzed properly.
We mostly collaborated synchronously over Zoom calls to accomplish tasks. More specifically, Ashley read and cleaned the data as well as manipulated the data using Python Pandas. Sophie used SQL queries to manipulate the data as well as wrote test cases for reading in the data. Chris worked on the visualizations of our data findings and data manipulation. We all collaborated on the final report and presentation.
In conclusion, Disney+ has the highest average IMDb rating at 6.44138543516873 out of 10 among the movies included on their streaming service. This indicates that Sophie and Chris were correct in their prediction that Disney+ would have the highest average IMDB rating. Additionally, Ashley and Chris were correct in their prediction that Prime Video has the highest percentage of movies on their platform at 42.1%. Additionally, we wanted to note that because movie data is constantly changing and updating, our calculations are accurate for the date this post was created.