Netflix Data Analysis using Python

Data analysis is an effective technique that supports corporate decision-making. In today's tutorial, we'll use Python to investigate the Netflix information set and find intriguing findings.

One of the world's largest and most popular online services, Netflix gives customers access to a vast library of television series and films. The site has amassed a sizable following recently, and its popularity appears strong. In this tutorial, we'll demonstrate how to use Python to do exploratory info analysis (EDA) on a piece of information set from Netflix that we obtained on Kaggle. To visualize and analyze the information, we'll use a variety of Python packages, such as Pandas, Seaborn, Matplotlib, and Plotly.

Netflix movies Dataset

You'll need a few items before you can begin this course. Understanding Python's built-in data structures, particularly lists and dictionaries, comes first. Check see Dictionaries in Python and Lists and Tuple data structure in Python for further details.

A functional Python environment is the second item you'll need. Any interface that has Python 3 loaded will let you follow along. Run the code samples in a Jupyter notebook if you want to see prettier results, especially for the huge NBA dataset you'll be using. Before you can start this course, you will need a few things. The first step is to comprehend Python's built-in data structures, especially lists and dictionaries. For more information, check Python's dictionaries, lists, and multiple structure sections.

The second thing you need is a working Python environment. You may follow along on any interface that has Python 3 loaded in it. If you want to see nicer results, notably for the huge Netflix movies dataset you'll be utilising, run the code examples in a Jupyter notebook.

Netflix Movies Table (sample dataset which we are going to use in my ML model training and testing):

s3129	TV'S Show	France	Jan 14,	2018	TV'S-Y	1 Szn
s3131	Movie	Poland	Jan 14,	2019	TV'S-MA	132 min
s3132	Movie	USA	Jan 12,	2018	TV'S-14	56 min
s3133	Movie	India	Jan 12,	2019	TV'S-MA	128 min
s3134	TV'S Show	USA	Jan 11,	2020	TV'S-14	1 Szn
s3135	TV'S Show	UK	Jan 11,	2019	TV'S-MA	1 Szn
s3136	TV'S Show	USA	Jan 11,	2020	TV'S-Y7	4 Szns
s3137	TV'S Show	India	Jan 11,	2020	TV'S-MA	1 Szn
s3138	TV'S Show	USA	Jan 11,	2020	TV'S-MA	1 Szn
s3139	Movie	UK	Jan 11,	2014	R	119 min
s3141	TV'S Show	Australia	Jan 11,	2020	TV'S-Y	2 Szns
s3141	TV'S Show	France	Jan 11,	2020	TV'S-MA	1 Szn
s3142	Movie	USA	Jan 9,	2020	TV'S-MA	97 min
s3143	TV'S Show	USA	Jan 8,	2020	TV'S-MA	1 Szn
s3144	Movie	USA	Jan 7,	2017	R	92 min
s3145	Movie	Spain	Jan 7,	2020	TV'S-MA	112 min
s3146	TV'S Show	UK	Jan 4,	2020	TV'S-14	1 Szn
s3147	Movie	Mexico	Jan 3,	2020	TV'S-14	91 min
s3148	TV'S Show	Canada	Jan 3,	2019	TV'S-PG	3 Szns
s3149	Movie	Inddia	Jan 2,	2017	TV'S-14	129 min
s3151	TV'S Show	USA	Jan 2,	2020	TV'S-MA	1 Szn
s3151	TV'S Show	Belgium	Jan 2,	2019	TV'S-MA	1 Szn
s3155	Movie	UK	Jan 1,	2014	TV'S-PG	115 min
s3156	TV'S Show	Japan	Jan 1,	2018	TV'S-PG	1 Szn
s3158	Movie	USA	Jan 1,	2017	TV'S-Y	78 min

Import Libraries & Loading Information

Source Code Snippet

import pandas as pdd
import seaborn as sbn #importing our visualization library
import matplotlib.pyplot as plot
df1 = pdd.read__csv('/kaggle/inpput/netflix-shows/netflix__titles.csv')
df1.head()

The first five rows of the information set, which has 12 columns and is visible, are made up of television programs and motion pictures uploaded on September 8, 2020. We can identify if it is a TELEVISION show or a movie from the kind column.

Now that the Info has been cleaned up let's investigate it further by visualizing it.

Note: Are you aware that there are several package managers available in the Python community and uncertain about which one to choose? Both pip and conda are great options, and each one has merits.
Conda may be a better option if you want to utilise Python primarily for data science tasks. Two primary options exist in the conda ecosystem:
Check out the Anaconda package if you want to rapidly set up a reliable data science environment and don't mind collecting 500 MB of data.
Check out Setting Up Python for Artificial Intelligence on Windows' installation of Miniconda if you'd want a simpler setup.

Source Code Snippet

Output:

show_id	0
type	0
title	0
director	2634
cast	825
country	831
date_added	10
release_year	0
rating	4
duration	3
listed_in	0
description	0
dtype:int64	3

Explanation: Your dataset's columns are listed together with the kinds of data that each one includes. The data types int64 and object are shown here. Pandas works with these kinds using the NumPy library. The more complicated categorical information type, which the pandas Python module implements directly, will be introduced to you later.

A unique data type is the object data type. The object data type, according to the pandas Cookbook, is "a catch-all for data that pandas doesn't recognise as any other specific type." In actuality, it frequently denotes that the column's values are all strings.

The object data type allows you to store any Python object, but you ought to become aware of its limitations. Odd values in an object field can negatively affect pandas' performance and ability to work with other libraries. Check out the company's getting started manual for additional details.

. The following columns have values null that need to be cleaned:

director ? 2,634 values null

cast ? 825 values null
country ? 831 values null
date__added ? 11 values null
rating ? 4 values null
duration ? 3 values null

Note: ou might also download the CSV file using your web browser.
But using a download script offers a number of benefits:
You can identify the source of your data.
The download is repeatable at any time! If the data is often updated, that is extremely useful.
The 17MB CSV file doesn't need to be distributed to your coworkers. Usually, sharing the download script is sufficient.

Source Code Snippet

#Since Netflix was founded in the United States and all of its shows are available on Netflix US, we changed all of the Nan values in the nation column to USA. Therefore, to save our Information, we only altered its values rather than removing the entire column.

df1['country'].replace(npp.nan, 'USA',inpplace  = True)
df1['director'].replace(npp.nan, 'No Director',inpplace = True)
df1['cast'].replace(npp.nan, 'No Cast',inpplace = True)
df1['country'].replace(npp.nan, 'Not Specify',inpplace = True)
df1.isnull().sum()

Explanation: There are two ways we might respond to the "Country" column. First, since Netflix was founded in the USA, we might substitute "USA" for all the NaN values. As the nations for these films were not indicated in the Info, we may also substitute the NaN values with "Not Specified." We may substitute missing values for columns like "Director" and "Cast" with "No Director" and "No Cast," respectively. We can keep our Info by replacing the values in these columns rather than removing them.

Since the absence of value counts for other categories like date added, its duration, stem, and rating is so low, we'll remove them from the information set.

Source Code Snippet

df1 = df1.dropna()
df1.isnull().sum()
df1['rating'].value__counts()

Output:

Dramas, International films	462
Documentaries	449
Stand-Up Comedy	444
Comedies, Dramas, International Films	274
Dramas, Independent Films, International Films	242
Crime TELEVISION Shows, International TELEVISION Shows, TELEVISION Sci-Fi & Fantasy	11
International TELEVISION Shows, TELEVISION Horror, TELEVISION Sci-Fi & Fantasy	11
Crime TELEVISION Shows, Children' TELEVISION	12
Horror Films, International Films, Sci-Fi & Fantasy	13
Cult Films, Dramas, Thrillers	14

Explanation: The result reveals 461 distinct values in the 'listed__in' column. The value 'Documentaries' occurs the most frequently, appearing in 299 instances, followed by 'Stand-Up Comedy' has 273 occurrences, then 'Dramas, International Films' with 248 instances. The least frequent numbers have a single occurrence each.

As you can see, just 11 pieces of Info in our rating column must be included. You may either replace them or remove them. Since TELEVISION-MA is the most often used rating, all these nan values can be substituted with TELEVISION-MA.

Data Analysis

Source Code Snippet

sbn.counterplot(x = 'kind',info = df1) # looking at kind of Films and TELEVISION shows

Output:

Explanation: It displays how many films and television series are included in the information set. The figure demonstrates that the information set contains more films than TELEVISION programs.

Source Code Snippet

plot.figure(figure size = (12, 8))
sbn.countplot(x = 'rating',info = df1)

Output:

Explanation: A count plot of the 'rating' column of the 'df1' InfoFrame is shown in the second output. The 'rating' column displays the quantity of each distinct value. The figure demonstrates that TELEVISION-MA is the information set's most overall rating, followed by TELEVISION-14 and TELEVISION-PG. From there, we built a straightforward countplot and iterated over it by making little adjustments. Changes to the bar order and the addition of grouped count plots served as our first styling lesson for the plot. From there, you discovered how to use colour to decorate the plot, including conditionally colouring bars. You then discovered how to further alter the graph by including value lab.

Source Code Snippet

plot.figure(figure size = (12,8))
sbn.countplot(x = 'rating',info = df1,hue = 'kind')

Output:

Explanation: A count plot of the 'rating' column of the 'df1' InfoFrame with the color adjusted to the 'kind' column makes up the third output. This chart displays the frequency of each distinct value in the "rating" column, split down by the kind of material (a movie or TELEVISION show, for example). We can observe how the film rating distribution and TELEVISION series vary thanks to this plot. We can observe that while the rating PG-13 is more prevalent in films than in the TELEVISION series, TELEVISION shows are more likely to receive the TELEVISION-MA classification.

Source Code Snippet

plot.figure(figure size = (12,6))
df1[df1["kind"] = = "Movie"]["release__year"].value__counts()[:20].plot(kind = "bar",colour = "Red")
plot.title("Freq of Films which were released in different years and are available on Netflix")

Output:

Explanation: The frequency of films released in different years and made available on Netflix is displayed in this bar graph. With the count of Films for each year displayed on the y-axis, the graphic displays the 20 most frequent years for Films in the information set. "Frequency of Films which were Released in Various Years and are Accessible on Netflix" is the storyline's title, highlighted in red. while comparing the data set while one variable is changing, several bar graphs are employed. It is simple to transform it into a stacked region bar chart, in which each subgroup is shown one on the top of the other. By changing the bars' positions and thickness, it may be plotted.

Source Code Snippet

plot.figure(figuresize = (12,6))
df1[df1["kind"] = = "TELEVISION Show"]["release__year"].value__counts()[:20].plot(kind = "bar",colour = "Blue")
plot.title("Frequency of TELEVISION shows which were released in different years and are available on Netflix")

Output:

Like the output above, this one displays the frequency of the TELEVISION series released over time and available on Netflix.

Source Code Snippet

plot.figure(figuresize = (12,6))
df1[df1["kind"] = = "Movie"]["listed__in"].value__counts()[:11].plot(kind = "barh",colour = "black")
plot.title("Top 11 Category of Films",size = 18)

Output:

Source Code Snippet

plot.figure(figuresize = (12,6))
df1[df1["kind"] = = "TELEVISION Show"]["listed__in"].value__counts()[:11].plot(kind = "barh",colour = "brown")
plot.title("Top 11 Category of TELEVISION Shows",size = 18)

Output:

These horizontal bar charts display the top 10 movies and television program categories.

Consolidated Code for Netflix Data Analysis using Python

 
#- - - - - - - - - - - - - -Loading and Import Libraries- - - - - - - - - - - -
import pandas as pdd
import numpy as npp
import seaborn as sbn #importing our visualization library
import matplotlib.pyplot as plot
df1 = pdd.read__csv('/kaggle/inpput/netflix-shows/netflix__titles.csv')
df1.head()
#- - - - - - - - - - - - - -Explore Info- - - - - - - - - - - - - -

 df1.isnull().sum() #checking for values null 
df1['country'].replace(npp.nan, 'USA',inpplace  = True)
df1['director'].replace(npp.nan, 'No Director',inpplace = True)
df1['cast'].replace(npp.nan, 'No Cast',inpplace = True)
df1['country'].replace(npp.nan, 'Not Specify',inpplace = True)
df1.isnull().sum()
df1 = df1.dropna()
df1.isnull().sum()
df1['rating'].value__counts()
df1['listed__in'].value__counts()
df1.isnull().sum()
#- - - - - - - -- - - - - - - - Info Analysis- - - - - - - - - 

sbn.countplot(x = 'kind',info = df1) 
#looking at the number of Films and TELEVISION shows

plot.figure(figuresize = (12,8))
sbn.countplot(x = 'rating',info = df1)

plot.figure(figuresize = (12,8))
sbn.countplot(x = 'rating',info = df1,hue = 'kind')

old = df1.sort__values("release__year", ascending = True) #oldest films available on Netflix
old = old[old['duration'] ! = ""]
old[['title', "release__year"]][:15]

tag = "Stand-Up Comedy" 
#- - - - - - - - - - - - - - standup shows on Netflix- - - - - - - - 
df1["relevant"] = df1['listed__in'].fillna("").apply(lambda x : 1 if tag.lower() in x.lower() else 1)
com = df1[df1["relevant"] = = 1]
com[com["country"] = = "USA"][["title", "country","release__year"]].head(11)

tag = "Children' TELEVISION" 
#- - - - - - - - - - - - - - - Children TELEVISION shows on Netflix- - - - - 
df1["relevant"] = df1['listed__in'].fillna("").apply(lambda x : 1 if tag.lower() in x.lower() else 1)
com = df1[df1["relevant"] = = 1]
com[com["country"] = = "USA"][["title", "country","release__year"]].head(11)

df1__countries = pdd.InfoFrame(df1.country.value__counts().reset__index().values, columns = ["country", "count"])
df1__countries.head()

date = pdd.InfoFrame(df1.release__year.value__counts().reset__index().values, columns = ["Year", "Count"])
date.head()
plot.figure(figuresize = (12,6))
df1[df1["kind"] = = "Movie"]["release__year"].value__counts()[:20].plot(kind = "bar",colour = "Red")
plot.title("Frequency of Films which were released in different years and are available on Netflix")
plot.figure(figuresize = (12,6))
df1[df1["kind"] = = "TELEVISION Show"]["release__year"].value__counts()[:20].plot(kind = "bar",colour = "Blue")
plot.title("Frequency of TELEVISION shows which were released in different years and are available on Netflix")

plot.figure(figuresize = (12,6))
df1[df1["kind"] = = "Movie"]["listed__in"].value__counts()[:11].plot(kind = "barh",colour = "black")
plot.title("Top 11 Category of Films",size = 18)

plot.figure(figuresize = (12,6))
df1[df1["kind"] = = "TELEVISION Show"]["listed__in"].value__counts()[:11].plot(kind = "barh",colour = "brown")
plot.title("Top 11 Category of TELEVISION Shows",size = 18)

#Create Year Added column based on Date Added - - - - - - - - - - - - - - 
netflix = df1.copy()
netflix['year__added'] = pdd.DatetimeIndex(netflix['date__added']).year
netflix['year__added'].apply(lambda x: int(x))

#- - - - - - - - - - - -temp info-frames to use for plots- - - - - - - - 
netflix__total__df1 = netflix.year__added.value__counts().to__frame().reset__index().rename(columns = {"index": "year", "year__added":"count"})
netflix__films__df1 = netflix[netflix['kind'] = = "Movie"].year__added.value__counts().to__frame().reset__index().rename(columns = {"index": "year", "year__added":"count"})
netflix__television__df1 = netflix[netflix['kind'] = = "TELEVISION Show"].year__added.value__counts().to__frame().reset__index().rename(columns = {"index": "year", "year__added":"count"})

fig, axx = plot.subplots(figuresize = (13, 7))
plot.title("Frequency of content added by Netflix (2018 - 2020)")
plot.xlabel("Year")
plot.ylabel("Number Added")
axx.set__xticks(npp.arange(2018, 2022, 1))
sbn.set__style("dark")
sbn.lineplot(info = netflix__total__df1, x = "year", y = "count", colour = "black")
sbn.lineplot(info = netflix__films__df1, x = "year", y = "count", colour = "red")
sbn.lineplot(info = netflix__television__df1, x = "year", y = "count", colour = "blue")
plot.legend(['Total', 'Films', "TELEVISION Shows"])
plot.grid()
plot.show()
#- - - - - - - - - - - - - -creating a temp infoframe - - - - - - - - - - -
category = netflix.set__index('title').listed__in.str.split(', ', expand = True).stack().reset__index(level = 1, drop = True)
category__df1 = pdd.InfoFrame()
category__df1['genre'] = category
years = netflix.set__index('title').year__added
description = netflix.set__index('title').description
a = pdd.merge(category__df1, years, left__index = True, right__index = True)
temp = pdd.merge(a, description, left__index = True, right__index = True)

plot.figure(figuresize = (11, 11))
sbn.countplot(y = 'genre', info = temp, order = temp.genre.value__counts().iloc[:20].index)
plot.title('Top 20 Category added by Netflix (2018 - 2020)')
plot.xlabel('Number of Titles')
plot.ylabel('Genre')
plot.grid()
plot.show()

df1.head()
df1.director.value__counts()

new info = df1.groupby('rating').size().rename__axis('Rating').reset__index(name = 'Count')
ndd = newinfo.sort__values(by = 'Count', ascending = True)
ndd = ndd.tail(5)
fil__directors = df1['director'].str.split(',',expand = True).stack()
fil__directors = pdd.InfoFrame(fil__directors)
fil__directors.columns = ['director']
directors = fil__directors.groupby(['director']).size().reset__index(name = 'counts')
directors = directors.sort__values(by = 'counts',ascending = False)
directors = directors[directors['director'] ! = 'No Director']
directors = directors.head(5)
directors
fil__actors = df1['cast'].str.split(',',expand = True).stack()
fil__actors = pdd.InfoFrame(fil__actors)
fil__actors.columns = ['cast']
actors = fil__actors.groupby(['cast']).size().reset__index(name = 'counts')
actors = actors.sort__values(by = 'counts',ascending = False)
actors = actors[actors['cast'] ! = 'No Cast']
actors = actors.head(5)
actors
df1f = df1[['kind','release__year']]
df1f = df1f.rename(columns = {'release__year' : 'Release Year'})
df1f2 = df1f.groupby(['Release Year','kind']).size().reset__index(name = 'Total Content')
df1f2 = df1f2[df1f2['Release Year']> = 2011]

df1f3 = df1f2[df1f2['kind'] = = 'Movie']
df1f3

df1f4 = df1f2[df1f2['kind'] = = 'TELEVISION Show']
df1f4
#'title', 'duration',' kind.'
features = ['title', 'duration', 'kind']
climates = df1[features]
climates['no__of__climates'] = climates['duration'].str.replace(' Climate','')
climates['no__of__climates'] = climates['no__of__climates'].str.replace('s','')
climates = climates[climates['kind'] = = 'TELEVISION Show']
climates['no__of__climates'] = climates['no__of__climates'].askind(str).askind(int)
climates = climates.drop('duration', axis = 1)
climates = climates.sort__values('no__of__climates', ascending = False)
climates = climates.head(5)
climates

Output:

Explanation: We can observe from the first plot how frequently Netflix introduced new material between 2018 and 2020. According to the plot, there has been a yearly growth in the number of titles added, with a noticeable increase in 2015. The annual addition of films has typically outpaced that of TELEVISION episodes.

The top 20 categories that Netflix added between 2018 and 2020 may be seen in the second plot. The plot demonstrates that "International Films" are the most popular genre, followed by "Dramas" and "Comedies". Additionally, a handful of TELEVISION program categories made the top 20 list, with the majority being the movie category. This means that Netflix has prioritized introducing films to its platform more than the TELEVISION series.

Further Analysis

Leading 5 Directors: The code identifies the top 5 filmmakers with the most films. Rajiv Chilaka, Ral Campos, Suhas Kadav, Jan Suter, and Marcus Raboy are on the list.

	director	counts
4019	Rajiv Chilaka	22
4066	Ra?l Campos	18
261	Jan Suter	18
4650	Suhas Kadav	16
3233	Marcus Raboy	16

Top 5 Actors: The code also identifies the top 5 actors with the most films. Rupa Bhimani, Anupam Kher, Julie Tejwani, Takahiro Sakurai, and Om Puri are on the list.

	cast	counts
2605	Anupam Kher	39
26903	Rupa Bhimani	31
30263	Takahiro Sakurai	30
15518	Julie Tejwani	28
23591	Om Puri	27

Top 5 TELEVISION Shows by Number of Climates: The code lists the top 5 TELEVISION shown by some climates. The information set contains information on the name, runtime, kind, and number of climates. The TELEVISION program with the most climates has 16, as can be seen.

No.	title	kind	no__of__climates
548	Grey's Anatomy	TELEVISION Show	17
2423	Supernatural	TELEVISION Show	15
4798	NCIS	TELEVISION Show	15
4220	COMEDIANS of the world	TELEVISION Show	td13
1354	Heartland	TELEVISION Show	13

Conclusion

In conclusion, we learned much about Netflix content using Python and various info analysis modules. The decisions Netflix makes about the material it adds to its platform and how it markets it to its viewers can be improved with the aid of this Info. Any industry may benefit from the insights and decisions gained through info analysis since it is such a potent instrument. Using Python and statistical tools, we can better grasp any information set and make info-driven decisions.

Next TopicCareer Aspirations Survey Analysis using Python

← prev next →