Water Quality Analysis

One of everyone's basic necessities is access to clean water for drinking. Legally speaking, having access to clean water for consumption is a fundamental human right. Water quality is influenced by a variety of factors and is one of the main topics of machine learning research. So this tutorial is for you if you want to understand how to analyse water quality using machine learning. We'll lead you through a Python machine learning examination of water quality in this tutorial.

Introduction: Water Quality Analysis

Analysing water quality is one of the key topics of machine learning research. In order to train a machine learning model that can determine if a certain water sample is safe or unsafe for eating, we must first understand all the parameters that impact water potability. This process is also known as water potability analysis.

We'll be utilising a Kaggle dataset that includes information on all of the key elements that have an impact on the potability of water for the water quality analysis challenge. Before building a model using machine learning to predict whether the water specimen is acceptable or unsafe for eating, we must first quickly examine each characteristic of this dataset because all of the elements that determine water quality are crucial.

About dataset

Content

The water_potability dataset contains different types of water quality metrics.

The pH value is a crucial factor in determining how acidic or basic water is. Additionally, it shows if the water is acidic or alkaline. The highest pH allowed range, according to WHO, is between 6.5 and 8.5. The present investigation's ranges fell between 6.52 to 6.83, which is within WHO criteria.
Hardness: Salts of calcium and magnesium are the major contributors to hardness. These salts are released by the geologic formations that water passes through. How long water is exposed to a hardness-producing substance influences how hard the water is while it is in its raw state. The ability of water to form soap due to calcium and magnesium precipitation was the original definition of hardness.
Solids: A vast variety of inorganic and certain organic minerals or salts, such as calcium, potassium, sodium, bicarbonates, chloride compounds, magnesium, sulphates, etc., may be dissolved by water. These minerals gave the water an undesirable taste and diminished colour. This is a crucial variable while using water. Water with a high TDS rating is one that has a high mineral content. The recommended TDS level for drinking purposes is 500 mg/l, with a maximum limit of 1000 mg/l.
Chloramines: The two main disinfectants utilised in water supply systems in cities are chlorine and chloramine. When methane is added to bleach to purify drinking water, chloramines are most frequently generated. In drinking water, chlorine concentrations up to 4 milli-grammes per litre are regarded as safe.
Sulphate: Sulphates are organic compounds that are naturally present in rocks, soil, and minerals. They can be found in the surrounding air, groundwater, vegetation, and food. Sulphate is mostly used in the chemical industry for commercial purposes. In saltwater, there are around 2,700 milli-grammes of sulphate per lite. The majority of freshwater sources have concentrations between 3 and 30 mg/L, while certain regions have substantially greater levels.
Conductivity: Water that is clean is an excellent insulator and poor conductor of electrical current. The electrical resistance of water is improved by an increase in ion concentration. The electrical conductivity of water is typically determined by the amount of dissolved particles present. The ability of a solution to conduct electricity is determined by its ionic process, which is measured by electrical conductivity. According to WHO guidelines, the EC value shouldn't be more than 400 S/cm.
Organic_carbon: Both manufactured and naturally occurring organic matter ( NOM ) contribute to the total organic carbon in source waters. The total amount of carbon ( TOC ) in organic substances in pure water is a measurement of this. US EPA estimates that treated drinking water has 2 mg/L of TOC and that source water, which is used for treatment, contains 4 mg/Lit.
Trihalomethanes ( THMs ): THMs are substances that may be present in chlorine-treated water. The amount of organic matter in the water, the quantity of chlorine needed to treat the water-based, and the temperature of the treated water all affect the levels of THMs in drinking water. THM concentrations up to 80 ppm are regarded as safe for drinking water.
Turbidity: The amount of solid stuff in the water's suspended state determines how turbid it is. The test is used to determine the quality of waste released with regard to colloidal particles and measures the light-emitting capabilities of water. The Wondo Genet Campus's mean turbidity value ( 0.98 NTU ) is less than the WHO-recommended threshold of 5.00 NTU.
Potability: A score between 0 and 1 that indicates whether water is suitable for human consumption.

Python Water Quality Analysis

We'll begin the work of analysing the water quality by importing the dataset and the required Python libraries:

Source Code Snippet:

	import matplotlib.pyplot as plot
	import pandas as pdd
	import seaborn as sns1
	import numpy as npp
	

	data = pdd.read_csv( "water_potability.csv" )
	data.head( )

Output:

	ph	Hardness	Solids	Chloramines	Sulfate	Conductivity
0	NaN	204.590455	20791.315951	7.300212	355.515441	554.305554
1	3.715050	129.422921	15530.057555	5.535245	NaN	592.555359
2	5.099124	224.235259	19909.541732	9.275554	NaN	415.505213
3	5.315755	214.373394	22015.417441	5.059332	355.555135	353.255515
4	9.092223	151.101509	17975.955339	5.545500	310.135735	395.410513

Before continuing, let's eliminate all the rows that have null values since I can see them in the dataset's initial preview:

Source Code Snippet:

data = data.dropna( )
data.isnull( ).sum( )

Output:

ph                 0
Hardness           0
Solids             0
Chloramines        0
Sulfate            0
Conductivity       0
Organic_carbon     0
Trihalomethanes    0
Turbidity          0
Potability         0
dtype: int64

Input:

Output:

	ph	Hardness	Solids	Chloramines	Sulfate	Conductivity	Organic_carbon	Trihalomethanes	Turbidity	Potability
count	2785.000000	3276.000000	3276.000000	3276.000000	2495.000000	3276.000000	3276.000000	3114.000000	3276.000000	3276.000000
mean	7.080795	196.369496	22014.092526	7.122277	333.775777	426.205111	14.284970	66.396293	3.966786	0.390110
std	1.594320	32.879761	8768.570828	1.583085	41.416840	80.824064	3.308162	16.175008	0.780382	0.487849
min	0.000000	47.432000	320.942611	0.352000	129.000000	181.483754	2.200000	0.738000	1.450000	0.000000
25%	6.093092	176.850538	15666.690297	6.127421	307.699498	365.734414	12.065801	55.844536	3.439711	0.000000
50%	7.036752	196.967627	20927.833607	7.130299	333.073546	421.884968	14.218338	66.622485	3.955028	0.000000
75%	8.062066	216.667456	27332.762127	8.114887	359.950170	481.792304	16.557652	77.337473	4.500320	1.000000
max	14.000000	323.124000	61227.196008	13.127000	481.030642	753.342620	28.300000	124.000000	6.739000	1.000000

Input:

Output:


RangeIndex: 3276 entries, 0 to 3275
Data columns ( total 10 columns ):
 0   Column           Non-Null Count  Dtype  
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
 0   ph               2785 non-null   float64
 1   Hardness         3276 non-null   float64
 2   Solids           3276 non-null   float64
 3   Chloramines      3276 non-null   float64
 4   Sulfate          2495 non-null   float64
 5   Conductivity     3276 non-null   float64
 6   Organic_carbon   3276 non-null   float64
 7   Trihalomethanes  3114 non-null   float64
 8   Turbidity        3276 non-null   float64
 9   Potability       3276 non-null   int64  
dtypes: float64( 9 ), int64( 1 )
memory usage: 256.1 KB

Input:

Output:

Hardness           3276
Solids             3276
Chloramines        3276
Sulfate            2495
Conductivity       3276
Organic_carbon     3276
Trihalomethanes    3114
Turbidity          3276
Potability            2
dtype: int64

Input:

Output:

Sum values
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -            
Hardness             0
Solids               0
Chloramines          0
Sulfate            781
Conductivity         0
Organic_carbon       0
Trihalomethanes    162
Turbidity            0
Potability           0
dtype: int64

Input:

Output:

ph                 float64
Hardness           float64
Solids             float64
Chloramines        float64
Sulfate            float64
Conductivity       float64
Organic_carbon     float64
Trihalomethanes    float64
Turbidity          float64
Potability           int64
dtype: object

Since this dataset's Potability column comprises values 0 and 1, which represent whether the water in the system is fit for eating or not ( 0 ), it is this column that we must predict. Check out the breakdown of 0 and 1 in the column for potability now:

Source Code Snippet:

plot.figure( figsize=( 15, 10 ) )
sns1.countplot( data.Potability )
plot.title( "Distribution of Unsafe and Safe Water" )
plot.show( )

Output:

You should be aware that this dataset has an imbalance because there are more samples of 0s than 1s.

We can overlook no elements that have an impact on water quality, as was already said, therefore let's look at each column individually. Let's begin by examining the ph column:

Source Code Snippet:

import plotly.express as px
data = data
figure = px.histogram( data, x = "ph", 
                      color = "Potability", 
                      title= "Factors Affecting Water Quality: PH" )
figure.show( )

Output:

The ph column shows the water's ph value, which is crucial for determining the water's acid-base balance. Drinking water should have a pH level of 6.5 to 8.5. Let's examine the dataset's second element impacting water quality now:

Source Code Snippet:

figure = px.histogram( data, x = "Hardness", 
                      color = "Potability", 
                      title= "Factors Affecting Water Quality: Hardness" )
figure.show( )

Output:

The distribution of fluid hardness in the dataset is depicted in the image above. Water's hardness often varies depending on where it comes from, however water between 120 and 200 milligrammes is drinkable. Let's now examine the following element impacting water quality:

Source Code Snippet:

figure = px.histogram( data, x = "Solids", 
                      color = "Potability", 
                      title= "Factors Affecting Water Quality: Solids" )
figure.show( )

Output:

The dataset's distribution of all of the dissolved solids in water is shown in the figure above. Dissolved solids are any organic or inorganic minerals found in water. Highly mineralized water has a very high dissolved solids content. Let's now examine the next element impacting water quality:

Source Code Snippet:

figure = px.histogram( data, x = "Chloramines", 
                      color = "Potability", 
                      title= "Factors Affecting Water Quality: Chloramines" )
figure.show( )

Output:

The dataset's distribution of chlorine dioxide in water is shown in the image above. In public water systems, disinfectants like chlorine and chloramine are employed. Let's now examine the following element impacting water quality:

Source Code Snippet:

figure = px.histogram( data, x = "Sulfate", 
                      color = "Potability", 
                      title= "Factors Affecting Water Quality: Sulfate" )
figure.show( )

Output:

The dataset's distribution of sulphate in water is seen in the figure above. They are elements that occur naturally in minerals, soil, and rocks. Drinkable water is defined as having less than 500 mg of sulphate. Next, let's examine another element:

Source Code Snippet:

figure = px.histogram( data, x = "Conductivity", 
                      color = "Potability", 
                      title= "Factors Affecting Water Quality: Conductivity" )
figure.show( )

Output:

The distribution of a fluid's conductivity in the dataset is shown in the image above. The most pure type of water is not an effective conductor of electricity, although water is an excellent conductor of electricity in general. Drinkable water has an electrical resistance of less than 500. Next, let's examine another element:

Source Code Snippet:

figure = px.histogram( data, x = "Organic_carbon", 
                      color = "Potability", 
                      title= "Factors Affecting Water Quality: Organic Carbon" )
figure.show( )

Output:

The dataset's distribution of carbon compounds in water is shown in the image above. Decomposition of organic substances from both natural and artificial sources yields organic carbon. Drinkable water is defined as having fewer than 25 milligrammes of organic carbon. Let's now examine the following element that has an impact on drinking water quality:

Source Code Snippet:

figure = px.histogram( data, x = "Trihalomethanes", 
                      color = "Potability", 
                      title= "Factors Affecting Water Quality: Trihalomethanes" )
figure.show( )

Output:

The distribution of trihalomethanes, or THMs, in water is shown in the image above. Water that has been chlorinated contains compounds called THMs. Drinkable water is defined as having fewer than 80 milligrammes of THMs. Let's now examine the following variable in the dataset that influences the quality of drinking water:

Source Code Snippet:

figure = px.histogram( data, x = "Turbidity", 
                      color = "Potability", 
                      title= "Factors Affecting Water Quality: Turbidity" )
figure.show( )

Output:

The distribution of turbidity in water is seen in the above graph. The quantity of suspended particles affects the turbidity of water. Drinkable water is defined as having less than 5 milli-grammes of turbidity.

Python-based Water Quality Prediction Model

All the elements that influence water quality were discussed in the section above. The following step is to use Python to build a model based on machine learning for the purpose of analysing water quality. I'll be utilising the Python PyCaret package for this purpose. If you've never used this package of libraries before, using the pip command, you can quickly install it on your system:

pycaret installation

Let's look at the association between all the characteristics and the dataset's Potability column before building a machine learning model:

Source Code Snippet:

Output:

ph                 1.000000
Hardness           0.108948
Organic_carbon     0.028375
Trihalomethanes    0.018278
Potability         0.014530
Conductivity       0.014128
Sulfate            0.010524
Chloramines       -0.024768
Turbidity         -0.035849
Solids            -0.087615
Name: ph, dtype: float64

The PyCaret Python module is now used to determine which machine learning method is appropriate for this dataset:

Source Code Snippet:

from pycaret.classification import *
clf = setup( data, target = "Potability", silent = True, session_id = 786 )
compare_models( )

Output:

	Model	Accuracy	AUC	Recall	Prec.	Fa	Kappa	McC
if	Random Forest Classifier	0.6830	0.7005	0.4197	0.6744	0.5133	0.2976	0.3182
qda	Quadratic DiscriminantAnalysis	0.6823	0.7192	0.3985,	0.6883	0.5013	0.2917	0.3174
et	Extra TreesClassifier	0.6816	0.6941	0.3861	0.6858	0.4916	0.2863	0.3123
lightgbm	Light Gradient Boosting Machine	0.6652	0.6916	0.4762	0.6078	0.5324	0.2781	0.2840
gbe	Gradient BoostingClassifier	0.6602	0.6738	0.3718	0.6306	0.4667	0.2419	0.2603
nb	Naive Bayes	0.6184	0.6078	0.2478	0.5545,	0.3412	0.1261	0.1462
dt	Decision Tree Classifier	0.6034	0.5895	0.5186	0.5049	0.5097	0.1775	0.1784
Ir	Logistic Regression	0.5984	0.5199	0.0071	0.1900	0.0134	0.0028	0.0127
ridge	Ridge Classifier	0.5984	0.0000	0.0089	0.1583.	0.0168	0.0035	0.0056
Ida	Linear Discriminant Analysis	0.5977	0.4903	0.0089	0.1500	0.0167	0.0021	0.0024
ada	Ada Boost Classifier	0.5956	0.5671	0.2919	0.4896	0.3644	0.0972	0.1034
knn	K Neighbors Classifier	0.5743	0.5423	0.3644	0.4642	0.4070	0.0826	0.0846
svm	SVM- Linear Kernel	0.5194	0.0000	0.3982	0.1604	0.2287	-0.0014	-0.0104

The aforementioned result indicates that training a model based on machine learning for the purpose of analysing water quality is best accomplished using the random forecast classifying technique. Therefore, let's train the algorithm and assess its forecasts:

Source Code Snippet:

model = create_model( "rf" )
predict = predict_model( model, data=data )
predict.head( )

Output:

ph	Hardness:	Solids	Chloramines	Sulfate
a.67e6	2473208	00000000	2.050332 358.006136	60286516
8.02223	00000000	00000000	6.546600 310.195738	308.410813,
504067	186:313524	00000000	7544809 326.676303	00000000
00000000	00000000	2area 710546	rst3408 3.602306	2aaest6a4
e.635e49	00000000	00000000	( 4.563009 303300771	00000000

Conductivity	Organic_carbon	Trihalonethanes	Turbidity	Potability Label
18420524	s00.341674	oze77	1	1
11558279	00000000	4.075075	1	1
8300735	sastre82	2559708	1	1
+3.780605	00000000	2.672089	1	1
42:363817	62,798309	4401425,	1	1

The findings shown above appear to be good. I hope you enjoyed my Python-based machine learning experiment on analysing water quality.

Summary

So this is how you may evaluate the water's quality and train a machine learning model to distinguish between water that is safe to drink and water that is not. One of everyone's basic necessities is access to clean water for drinking. Legally speaking, having access to clean water for consumption is a fundamental human right. Water quality is influenced by a variety of factors and is one of the main topics of machine learning research. I hope you enjoyed reading this tutorial on Python-based machine learning for water quality analysis. Please feel free to leave your insightful remarks via mail.

Next TopicStudent Academic Performance Prediction Using Python

← prev next →