Skip to content

🔰 Data Visualization and its Importance in Machine Learning

Data visualization plays a crucial role in machine learning (ML) for several reasons. It is not just a tool for presenting results but also an integral part of the entire ML workflow, from data exploration to model evaluation. Here are the key reasons why data visualization is important in machine learning:

1️⃣ Understanding the Data

👉 Exploratory Data Analysis (EDA): Visualization helps in understanding the structure, patterns, and relationships within the data. It allows data scientists to identify trends, correlations, and outliers that might not be apparent from raw data. 👉 Data Distribution: Visualizing data distributions (e.g., histograms, box plots) helps in understanding the spread, skewness, and central tendencies of features, which is critical for selecting appropriate preprocessing techniques and models. 👉 Feature Relationships: Scatter plots, pair plots, and correlation matrices help in identifying relationships between features, which can guide feature selection and engineering.

2️⃣ Data Cleaning and Preprocessing

👉 Identifying Missing Data: Visualizations like heatmaps or bar charts can highlight missing or incomplete data, enabling better handling of such issues. 👉 Outlier Detection: Box plots, scatter plots, and other visual tools help in identifying outliers that could negatively impact model performance. 👉 Data Transformation: Visualizing data before and after transformations (e.g., normalization, log transformation) ensures that the preprocessing steps are effective.

3️⃣ Feature Engineering

👉 Feature Importance: Visualizing feature importance (e.g., using bar charts or tree-based model outputs) helps in selecting the most relevant features for the model. 👉 Interaction Effects: Visualizing interactions between features can reveal complex relationships that might need to be explicitly modeled.

4️⃣ Model Selection and Tuning

👉 Performance Comparison: Visualizing metrics like accuracy, precision, recall, and F1-score across different models helps in selecting the best-performing model. 👉 Hyperparameter Tuning: Visualizing the impact of hyperparameters (e.g., learning curves, validation curves) helps in optimizing model performance. 👉 Bias-Variance Tradeoff: Visualizing training and validation errors helps in diagnosing overfitting or underfitting.

5️⃣ Model Interpretation

👉 Explainability: Visual tools like SHAP (SHapley Additive exPlanations) values, LIME (Local Interpretable Model-agnostic Explanations), and partial dependence plots help in interpreting complex models like neural networks and ensemble methods. 👉 Decision Boundaries: Visualizing decision boundaries (e.g., in classification tasks) helps in understanding how the model separates different classes.

6️⃣ Communicating Results

👉 Stakeholder Understanding: Visualizations make it easier to communicate insights and results to non-technical stakeholders. Charts, graphs, and dashboards are more intuitive than raw numbers or equations. 👉 Storytelling: Effective visualizations help in telling a compelling story about the data, the model, and its predictions, making it easier to drive decision-making.

7️⃣ Debugging and Error Analysis

👉 Error Patterns: Visualizing errors (e.g., confusion matrices, residual plots) helps in identifying patterns in model mistakes, which can guide improvements. 👉 Model Behavior: Visualizing predictions versus actual values helps in understanding where the model is performing well and where it is struggling.

🔴 Tools for Data Visualization in Machine Learning 👉 Libraries: Matplotlib, Seaborn, Plotly, ggplot, and Bokeh are popular libraries for creating static and interactive visualizations. 👉 Dashboards: Tools like Tableau, Power BI, and Dash help in creating interactive dashboards for monitoring and presenting results. 👉 Specialized Tools: SHAP, LIME, and Yellowbrick are specialized tools for visualizing model interpretability and performance.

📕 Barplot:

A barplot (or bar chart) is a graphical representation of data using rectangular bars. The length or height of each bar is proportional to the value it represents. Barplots are widely used in data visualization to compare and display categorical data.

sns.barplot(x='day', y='total_bill', data=tips, palette='tab10'); 1754813327317

📚 Boxplot

A box plot is a statistical chart that visually summarizes data distribution. It shows the minimum, first quartile, median, third quartile, and maximum values. The box represents the interquartile range (IQR), and whiskers extend to show the range of data, excluding outliers. Outliers are often plotted as individual points.

sns.boxplot(x='day', y='total_bill', hue='sex', data=tips, linewidth =2.5, palette='Dark2');

1754813405909

🟥 Kdeplot

A KDE plot, or Kernel Density Estimation plot, is a smooth curve that estimates the probability density function of a continuous variable. It provides a visual representation of the data's distribution, highlighting its shape, central tendency, and spread. By examining the peaks, valleys, and overall shape of the KDE curve, insights can be gained into the data's characteristics.

sns.kdeplot(data=df , x='Age', hue='Sex', multiple='stack', palette='tab10');

1754813464853

🟧 Violinplot

A violin plot is a statistical chart that combines a box plot with a kernel density plot. It provides a richer visualization of data distribution, showing both the overall shape and density of the data. The wider parts of the violin represent regions with higher data density, while narrower parts indicate lower density. Violin plots are useful for comparing distributions, especially when there are multiple groups or categories.

sns.violinplot(x="day", y="total_bill", data=tips);

1754813524451

🟨 Stripplot

A stripplot is a simple visualization that shows the distribution of data points along a single axis. It's useful for visualizing the spread and density of data, especially when combined with other plots like box plots or violin plots. Stripplots can be customized with jitter to reduce overlap between points and enhance readability.

sns.stripplot(x="time", y="total_bill", hue="sex", data=tips);

1754813557685

🟩 Scatterplot

A scatter plot is a graph used to visualize the relationship between two numerical variables. Each data point is represented by a dot on the plot, with its position determined by its values on the x and y axes. Scatter plots are useful for identifying trends, correlations, and outliers in data. They can also be used to fit regression lines to model the relationship between the variables.

sns.scatterplot(x = 'total_bill', y = 'tip', hue = 'sex', data = tips);

1754813575634

🟦 Swarmplot

A swarmplot is a visualization that displays individual data points along an axis. It's similar to a stripplot, but it adjusts the positions of points to minimize overlap, making it easier to visualize the distribution of data, especially when there are many data points. Swarmplots are often used in conjunction with box plots or violin plots to provide a more detailed view of the data.

sns.swarmplot(x="day", y="total_bill", hue="sex", data=tips);

1754813615463

🟪 Boxenplot

A boxenplot is an enhanced box plot that provides a more detailed view of data distribution. It displays multiple quantiles, revealing more information about the shape of the distribution, especially in the tails. Boxenplots are particularly useful for large datasets where traditional box plots might not be sufficient.

sns.boxenplot( x='time', y="total_bill", hue='sex', data=tips);

1754813644742

⬛️ Lineplot

A line plot is a visualization used to show trends over time or across categories. It connects data points with lines, revealing patterns, increases, decreases, and overall changes. Line plots are excellent for tracking variables over time or comparing multiple variables simultaneously. By analyzing the slope and direction of the lines, insights can be gained into the underlying trends and relationships in the data.

sns.lineplot(x="size",y="total_bill",data=tips,hue='sex',markers=True);

1754813660055

🟫 Jointplot

A Jointplot is a visualization that combines a scatter plot with histograms along the x and y axes. It provides a comprehensive view of the relationship between two numerical variables, along with their individual distributions. This allows for a deeper understanding of the bivariate relationship and the marginal distributions of each variable. Jointplots are particularly useful for identifying trends, correlations, and outliers in the data.

sns.jointplot(x="chol",y="trtbps",data=heart,kind="kde",hue='sex');

1754813674969

🔵 Lmplot

Lmplot, a function in Seaborn, is used to visualize linear relationships between numerical variables. It combines scatter plots with regression lines, providing insights into the correlation and potential trends between the variables. Lmplot can also handle categorical variables, allowing for visualizing relationships within different groups.

g= sns.lmplot(x="age", y="chol", hue="cp", data=heart)

1754813692282

🟢 Relplot

Relplot is a versatile function in Seaborn that allows for creating various relational plots, including scatter plots, line plots, and more. It offers flexibility in customizing the appearance, adding multiple dimensions (hue, size, style), and organizing plots into subplots. Relplot is a powerful tool for exploring relationships between variables and understanding data trends.

g = sns.relplot(x="age", y="chol", data=heart,hue='sex')

1754813705993

🟣 Heatmap

A heatmap is a 2D visualization that represents data values as colors. Warmer colors indicate higher values, while cooler colors indicate lower values. Heatmaps are useful for identifying patterns, trends, and anomalies within data, especially when dealing with large datasets. They are commonly used in fields like finance, biology, and machine learning.

mask = np.triu(np.ones_like(tips.corr(), dtype=bool)) sns.heatmap(tips.corr(), mask = mask, annot=True, cmap='Dark2');

1754813719819

🔵 Catplot

Catplot in Seaborn is a versatile function for visualizing categorical data. It creates various plot types like strip plots, swarm plots, box plots, violin plots, and bar plots, helping to understand data distribution across categories.

sns.catplot(x='smoker', col='sex', kind='count', data=tips ,palette="Dark2")

1754813734496

🟧 Correlation with Response Variable class

X = heart.drop(['HeartDisease'], axis=1) y = heart['HeartDisease']

X.corrwith(y).plot.bar(figsize=(16, 4), rot=90, grid=False) plt.title('Correlation with heart', fontsize=25, color='Blue', font='Times New Roman') plt.show()

1754813752636

🟥 Correlation Analysis

import matplotlib matplotlib.rcParams.update({'font.size': 12}) corr = heart.corr() mask = np.triu(np.ones_like(corr, dtype=bool)) plt.figure(dpi=100) plt.title('Correlation Analysis', fontsize=15, color='Blue', font='Lucida Calligraphy') sns.heatmap(corr, mask=mask, annot=True, lw=0, linecolor='white', cmap='viridis', fmt="0.2f") plt.xticks(rotation=90) plt.yticks(rotation=0) plt.show()

1754813770042

Pie Chat

matplotlib.rcParams.update({'font.size': 15}) ax=heart['Sex'].value_counts().plot.pie(explode=[0.1, 0.1],autopct='%1.2f%%',shadow=True); ax.set_title(label = "Sex", fontsize = 40,color='DarkOrange',font='Lucida Calligraphy'); plt.legend(labels=['M','F']) plt.axis('off');

1754813834896

Bar Plot :

sns.set_style("white")
sns.set_context("poster",font_scale = 1.2)
palette = ["#1d7874","#679289","#f4c095","#ee2e31","#ffb563","#918450","#f85e00","#a41623","#9a031e","#d6d6d6","#ffee32","#ffd100","#333533","#202020"]
# sns.palplot(sns.color_palette(palette))
# plt.show()

plt.subplots(figsize=(20,8))
p = sns.barplot(x=titanic["Pclass"][:14],y=titanic["Age"],palette=palette, saturation=1, edgecolor = "#1c1c1c", linewidth = 2)
p.axes.set_title("\nTop Anime Community\n", fontsize=25)
plt.ylabel("Total Member" , fontsize = 20)
plt.xlabel("\nAnime Name" , fontsize = 20)
# plt.yscale("log")
plt.xticks(rotation = 90)
for container in p.containers:
    p.bar_label(container,label_type = "center",padding = 6,size = 25,color = "black",rotation = 90,
    bbox={"boxstyle": "round", "pad": 0.6, "facecolor": "orange", "edgecolor": "black", "alpha": 1})

sns.despine(left=True, bottom=True)
plt.show()

1754813857823

Outlier Distribution

numfeature = ["Age", "Fare"] enumfeat = list(enumerate(numfeature))

plt.figure(figsize=(20,7)) plt.suptitle("Distribution and Outliers of Numerical Data", fontsize=25,color='Blue') for i in enumfeat: plt.subplot(1,4,i[0]+1) sns.boxplot(data = titanic[i[1]], palette="Dark2") plt.xlabel(str(i[1])) for i in enumfeat: plt.subplot(1,4,i[0]+3) sns.histplot(data = titanic[i[1]], palette="tab10", bins=15) plt.xlabel(str(i[1])) plt.tight_layout() plt.show()

1754813876270

Probability Distribution

plt.figure(figsize=(15,7)) plt.suptitle("Probability Distribution of numerical columns according to number of Survived", fontsize = 25,color="Red") for i in enumfeat: plt.subplot(1,2,i[0]+1) sns.kdeplot(data=titanic, x=i[1], hue="Survived") plt.tight_layout() plt.show()

1754813898388

Countplot

countfeature = ["Survived", "Pclass", "Sex", "SibSp", "Parch", "Embarked"] countlist = list(enumerate(countfeature))

plt.figure(figsize = (15,10)) plt.suptitle("Countplot of Categorical Features", fontsize=25,color='Red') for i in countlist: plt.subplot(2,3,i[0]+1) sns.countplot(data = titanic, x = i[1], hue = "Survived", palette="rainbow") plt.ylabel("") plt.legend(['Not Survived', 'Survived'], loc='upper center', prop={'size': 10}) plt.tight_layout() plt.show()

1754813914784

Distribution Plot

set configuration for charts⚓︎

plt.rcParams["figure.figsize"]=[18 , 7] plt.rcParams["font.size"]=15 plt.rcParams["legend.fontsize"]="medium" plt.rcParams["figure.titlesize"]="medium"

def plot_disribution(data , x ,color,bins ): mean = data[x].mean() std = data[x].std() info=dict(data = data , x = x , color = color) plt.subplot(1 , 3 , 1 , title =f"Ditstribution of {x} column") sns.distplot(a=data[x] , bins = bins) plt.xlabel(f"bins of {x}") plt.axvline(mean , label ="mean" , color ="red") plt.ylabel("frequency") plt.legend(["\({\sigma}\) = %d"%std , f"mean = {mean:.2f}"]) plt.title(f"histogram of {x} column") plt.subplot(1 , 3 , 2) sns.boxplot(info) plt.xlabel(f"{x}") plt.title(f"box plot of {x} column") plt.subplot(1 , 3 , 3) sns.swarmplot(info) plt.xlabel(f"{x}") plt.title(f"distribution of points in {x} column") plt.suptitle(f"Distribution of {x} column" , fontsize =20 , color="red") plt.show()

age_bins = np.arange(29 , 77+5 , 5) base_color = sns.color_palette()[4] plot_disribution(data = heart , x ="Age" , color = base_color , bins=age_bins)

1754813978899

Scatter Plot

num = wine.select_dtypes(include="number") fig, ax = plt.subplots(14, 1, figsize = (7, 30)) for indx, (column, axes) in list(enumerate(list(zip(num, ax.flatten())))):

sns.scatterplot(ax = axes, y = wine[column].index, x = wine[column],hue = wine['total sulfur dioxide'],
                palette = 'magma', alpha = 0.8)

else: [axes.set_visible(False) for axes in ax.flatten()[indx + 1:]]
plt.tight_layout() plt.show()

1754813997924

Count Plot

cat = ['Sex','Embarked'] sns.set_theme(rc = {'figure.dpi': 100, 'axes.labelsize': 12, 'axes.facecolor': '#f0eee9', 'grid.color': '#fffdfa', 'figure.facecolor': '#e8e6e1'}, font_scale = 1.2) fig, ax = plt.subplots(5, 2, figsize = (12, 22)) for indx, (column, axes) in list(enumerate(list(zip(cat, ax.flatten())))):

sns.countplot(ax = axes, x = titanic[column], hue = titanic['Pclass'], 
              palette = 'magma', alpha = 0.8)

else: [axes.set_visible(False) for axes in ax.flatten()[indx + 1:]] plt.tight_layout() plt.show()

1754814020326

Count Plot

num = heart.select_dtypes(include="number") fig, ax = plt.subplots(3, 2, figsize = (14, 15)) for indx, (column, axes) in list(enumerate(list(zip(num, ax.flatten())))):

sns.histplot(ax = axes, x = heart[column],hue = heart['HeartDisease'],
                palette = 'magma', alpha = 0.8, multiple = 'stack')

legend = axes.get_legend() # sns.hisplot has some issues with legend
handles = legend.legendHandles
legend.remove()
axes.legend(handles, ['0', '1'], title = 'HeartDisease', loc = 'upper right')
Quantiles = np.quantile(heart[column], [0, 0.25, 0.50, 0.75, 1])

for q in Quantiles: axes.axvline(x = q, linewidth = 0.5, color = 'r')

plt.tight_layout() plt.show()

1754814033876

Barcharts

raw_df = raw_df [['name', 'year', 'selling_price', 'km_driven', 'fuel', 'seller_type', 'transmission', 'owner']]

Function to print width of barcharts on the bars⚓︎

def barw(ax):
for p in ax.patches: val = p.get_width() #height of the bar x = p.get_x()+ p.get_width() # x- position y = p.get_y() + p.get_height()/2 #y-position ax.annotate(round(val,2),(x,y)) plt.figure(figsize=(10,5)) ax0 = sns.countplot(data = raw_df, y ='owner', order = raw_df['owner'].value_counts().index) barw(ax0) plt.show()

1754814052667

🟩 Pie Chat

education=df['parental level of education'].value_counts() sns.set_palette('bright') plt.figure(figsize=(10,7)) labels=education.index sizes=education.values plt.pie(sizes,labels=labels,autopct='%1.1f%%', shadow=True,startangle=90) plt.show()

1754814072476

🟫 Correlation Plot

plt.figure(figsize=(12,8)) data_4 = data.corr()["Fire Alarm"].sort_values(ascending=False) indices = data_4.index labels = [] corr = [] for i in range(1, len(indices)): labels.append(indices[i]) corr.append(data_4[i]) sns.barplot(x=corr, y=labels, palette='mako') plt.title('Correlation coefficient between different features and Fire Alarm ') plt.show()

1754814088054

⬛️ Pairplot

import matplotlib matplotlib.rcParams.update({'font.size': 15}) plt.figure(figsize=(18,9)) cols_out = ["RestingBP", "Cholesterol", "MaxHR", "Age",'ChestPainType'] sns.pairplot(heart[cols_out], hue="ChestPainType", diag_kind="hist", palette="tab10") # tab10 plt.show();

1754814102472

🟨 Countplot with Percentage

fig, ax = plt.subplots(figsize = (18,8)) sns.countplot(x= wine["quality"]) plt.title("Wine Quality Count",fontsize=20,color='#1a4441',font='Comic Sans Ms',pad=20) plt.xlabel("Quality ",fontsize=15,color='#1a4441',font='Comic Sans Ms') plt.ylabel("Count",fontsize=15,color='#1a4441',font='Comic Sans Ms');

total = len(wine) for p in ax.patches: percentage = f'{100 * p.get_height() / total:.1f}%\n' x = p.get_x() + p.get_width() / 2 y = p.get_height() ax.annotate(percentage, (x, y), ha='center', va='center')

1754814119927

🟥 Distinguish color for different Values

print("Skewly distributed columns by skewness value:\n") skew_df = wine.skew().sort_values()

fig,ax = plt.subplots(figsize=(25,7)) ax.bar(x = skew_df[(skew_df<2)& (skew_df>-2)].index, height = skew_df[(skew_df<2)& (skew_df>-2)], color = "g", label= "Semi-normal distribition") ax.bar(x = skew_df[skew_df>2].index, height = skew_df[skew_df>2], color = "r", label = "Positively skewed features") ax.bar(x = skew_df[skew_df<-2].index, height = skew_df[skew_df<-2], color = "b", label = "Negatively skewed features") ax.legend() fig.suptitle("Skewness of numerical columns",fontsize = 20) ax.tick_params(labelrotation=90);

1754814135607

🟫 WordCloud

from wordcloud import WordCloud, STOPWORDS text = " ".join(Company for Company in df["Cuisines"])

font = "Quicksand-Bold.ttf"⚓︎

word_cloud = WordCloud(width = 2300, height = 800, colormap = 'jet', background_color = "white").generate(text) plt.figure(figsize = (50, 8)) plt.imshow(word_cloud, interpolation = "gaussian") plt.axis("off") plt.show()

1754814151944

📕 Scatter Plot with different features

plt.figure(figsize=(10,5))

plotting the values for people who have heart disease⚓︎

plt.scatter(heart.Age[heart.HeartDisease==1], heart.Cholesterol[heart.HeartDisease==1], c="tomato")

plotting the values for people who doesn't have heart disease⚓︎

plt.scatter(heart.Age[heart.HeartDisease==0], heart.Cholesterol[heart.HeartDisease==0], c="lightgreen")

plt.title("Heart Disease w.r.t Age and Max Heart Rate") plt.xlabel("Age") plt.legend(["Disease", "No Disease"]) plt.ylabel("Max Heart Rate");

1754814166314

Groupby Plot with different colour

df2=df.groupby('Type Of Restaurant')['Cost Per Head'].mean().sort_values(ascending=False) plt.figure(figsize = (15,6)) color = [('b' if i < 500 else 'r') for i in df2] df2.plot.bar(color=color);

1754814294684

📕 Boxplot for Multiple variables

import math cont_features=['fixed acidity', 'volatile acidity', 'citric acid','free sulfur dioxide','pH', 'alcohol']

y=3 x=math.ceil(len(cont_features)/y)

plt.subplots(x,y,figsize=(15,10)) for i in range(1,len(cont_features)+1) : plt.subplot(x,y,i) sns.boxplot(data=wine,y=cont_features[i-1],x='quality',palette=['#e60000','#FAFAD2','#660000','#DEB078','#FF8C00','black']) plt.tight_layout()
plt.show()

1754814309099

Pairplot with trend Line

sns.pairplot(wine.drop(columns=['quality']),kind="reg",diag_kind='kde',plot_kws={'line_kws':{'color':'red'}},corner=True) plt.tight_layout() plt.show()

1754814329388

⚓️ Histplot with multiple variables

features = ['fixed acidity','citric acid','volatile acidity'] fig, axs = plt.subplots(1,3, figsize=(16,6))

for f, ax in zip(features,axs.ravel()): sns.histplot(wine, x=f, ax=ax) plt.show()

1754814349228

📪 Missing values percentage per column with threshold

import matplotlib.pyplot as plt import seaborn as sns %matplotlib inline

def missing_values(data, thresh = 20, color = 'black', edgecolor = 'black', height = 3, width = 15):

plt.figure(figsize = (width, height))
percentage = (data.isnull().mean()) * 100
percentage.sort_values(ascending = False).plot.bar(color = color, edgecolor = edgecolor)
plt.axhline(y = thresh, color = 'r', linestyle = '-')

plt.title('Missing values percentage per column', fontsize = 20, weight = 'bold' )

plt.text(len(data.isnull().sum()/len(data))/1.7, thresh + 12.5, f'Columns with more than {thresh}% missing values', fontsize = 12, color = 'crimson',
     ha = 'left' ,va = 'top')
plt.text(len(data.isnull().sum()/len(data))/1.7, thresh - 5, f'Columns with less than {thresh}% missing values', fontsize=12, color='green',
     ha = 'left' ,va = 'top')
plt.xlabel('Columns', size = 15, weight = 'bold')
plt.ylabel('Missing values percentage')
plt.yticks(weight = 'bold')

return plt.show()

missing_values(titanic, thresh = 10, color = sns.color_palette('Reds',15))

1754814371899

📓 Keep only correlation higher than a threshold

Explore the correlation between all numerical features⚓︎

corr_mat_train = wine.drop(columns = ['quality'], axis = 1).corr()

Keep only correlation higher than a threshold⚓︎

threshold = 0.3 corr_threshold_train = corr_mat_train[(corr_mat_train > threshold) | (corr_mat_train < -threshold)]

plt.figure(figsize = (8, 6)) sns.heatmap(corr_threshold_train, annot = True, cmap = 'seismic', fmt = ".2f", linewidths = 0.5, cbar_kws={'shrink': .5},annot_kws={'size': 8}).set_title('Correlations Among Features (in Train)');

1754814396815

📊 Cascading Barplot

plt.rcParams['figure.figsize'] = (18, 5)

Y = pd.crosstab(df['rate'], df['book_table']) Y.div(Y.sum(1).astype(float), axis = 0).plot(kind = 'bar', stacked = True,color=['red','yellow'])

plt.title('table booking vs Normal rate', fontweight = 30, fontsize = 20) plt.legend(loc="upper right") plt.show()

1754814415980

🧿 Pie chart with inner hollow

Pie chart⚓︎

labels = df['listed_in(type)'].value_counts().index sizes = df['listed_in(type)'].value_counts().values

only "explode" the 2nd slice (i.e. 'Hogs')⚓︎

explode = (0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1) fig1, ax1 = plt.subplots(figsize = (8, 8))

ax1.pie(sizes, labels = labels, shadow = True, startangle = 90, explode = explode, rotatelabels = True) centre_circle = plt.Circle((0, 0), 0.70,fc = 'white') fig = plt.gcf() fig.gca().add_artist(centre_circle)

Equal aspect ratio ensures that pie is drawn as a circle⚓︎

ax1.axis('equal')
plt.tight_layout() plt.show()

1754814445805

📊 Distribution Plot

check distribution of Na_to_k (based on Drug_Type)⚓︎

plt.style.use('seaborn-notebook') for i, label in enumerate(df.Drug_Type.unique().tolist()): sns.kdeplot(df2.loc[df2['Drug_Type'] == i+1, 'Na_to_K'], label=label, shade=True) plt.title('1. KDE of Na_to_k (based on Drug_Type)', fontdict=font, pad=15) plt.xticks(np.arange(0,46,2), rotation=90) plt.xlim([0,46]) plt.legend() plt.show()

1754814469247

🟦 Count plot and pie plot

draw countplot and pie plot of categorical data⚓︎

for col in categorical: fig, axes = plt.subplots(1,2,figsize=(10,4)) # count of col (countplot) sns.countplot(data=df2, x=col, ax=axes[0]) for container in axes[0].containers: axes[0].bar_label(container)

# count of col (pie chart)
slices = df2[col].value_counts().values
activities = [f"{i} ({var})" for i, var in zip(df2[col].value_counts().index, df[col].value_counts().index)]

axes[1].pie(slices, labels=activities, shadow=True, autopct='%1.1f%%')
plt.suptitle(f'Count of Unique Value in {col}', y=1.09, **font)
plt.show()

1754814510110

🟦 Count of purchased based on Gender

for col in ['Sex','BP','Cholesterol']: ax = sns.countplot(data=df, x='Drug_Type', hue=col) for container in ax.containers: ax.bar_label(container) plt.title(f'Count of Drug (based on {col})', fontdict=font, pad=15) plt.show()

1754814521837

⬛️ Mean Calculation

Mean of Age and Na_to_K based on each feature⚓︎

for col in ['Sex', 'BP', 'Cholesterol']: fig , ax= plt.subplots(1,2, figsize=(10,4)) gp = df.groupby([col])['Na_to_K'].mean().to_frame().reset_index() sns.barplot(data=gp, x=col, y='Na_to_K', ax=ax[0]) for container in ax[0].containers: ax[0].bar_label(container) ax[0].set_title(f'Mean of Na_to_K (based on {col})', y=1.09, font) sns.boxplot(data=df, x=col, y='Na_to_K', ax=ax[1]) ax[1].set_title(f'Boxplot of {col})', y=1.09, font) plt.show()

1754814559503

🔰 Scatter plot for multiple features

use scatter plot for numerics feature (Age and Na_to_K)⚓︎

fig, ax = plt.subplots(2,2,figsize=(14,8)) for i, col in enumerate(['Sex', 'BP', 'Cholesterol', 'Drug_Type']): sns.scatterplot(data=df, x='Age', y='Na_to_K', hue=col, ax=ax[i//2, i%2], palette='turbo')

ax[i//2, i%2].set_title(f'Na_to_K vs Age (based on {col}', y=1.09, **font)
ax[i//2, i%2].legend(loc='upper center', bbox_to_anchor=(1.2, 0.6),
    fancybox=True, shadow=True)

fig.tight_layout() plt.show()

1754814586753

🟦 Swarmplot

fig, ax = plt.subplots(3,2,figsize=(14,12))

sns.swarmplot(data=df, x='Cholesterol', y='Na_to_K', hue='Drug_Type', ax=ax[0,0]) sns.swarmplot(data=df, x='Cholesterol', y='Age', hue='Drug_Type', ax=ax[0,1]) sns.swarmplot(data=df, x='BP', y='Na_to_K', hue='Drug_Type', ax=ax[1,0]) sns.swarmplot(data=df, x='BP', y='Age', hue='Drug_Type', ax=ax[1,1]) sns.swarmplot(data=df, x='Sex', y='Na_to_K', hue='Drug_Type', ax=ax[2,0]) sns.swarmplot(data=df, x='Sex', y='Age', hue='Drug_Type', ax=ax[2,1])

ax[0,0].set_title('Swarmplot of Drug Type vs Na_to_K',y=1.05, font) ax[0,1].set_title('Swarmplot of Drug Type vs Age',y=1.05, font) plt.tight_layout() plt.show()

1754814615727

🟨 MeanPlot

Mean of Income and CCAvg based on each feature⚓︎

for i, col in enumerate(['Income', 'CCAvg','Mortgage']): print('='30, f"Mean of {col} in each categorical feature", '='30) for j, cat in enumerate(discrete_cols2): fig , ax= plt.subplots(1,2, figsize=(10,4)) gp = df.groupby([cat])[col].mean().to_frame().reset_index() sns.barplot(data=gp, x=cat, y=col, ax=ax[0]) for container in ax[0].containers: ax[0].bar_label(container) ax[0].set_title(f'Mean of {col} (based on {cat})', y=1.09, FONT) sns.boxplot(data=df, x=cat, y=col, ax=ax[1]) ax[1].set_title(f'Boxplot of {cat} (Fig {i+11}-{j+1})', y=1.09, FONT) plt.show()

1754815686027

⬛️ 3d Plot

continuous_cols = ['Age','Experience','CCAvg','Mortgage']

for i, col in enumerate(continuous_cols): fig = px.scatter_3d( data_frame= df, x=df.Income, y=df[col], z=df['Personal Loan'], color=df['Personal Loan'].astype(str), color_discrete_map={'1':'orange', '0':'red'}, template='ggplot2', hover_name='Age', # hover_data= opacity=0.6, # symbol='Transmission', # symbol_map= # log_x=True, # log_z=True, height=700, title=f'3D scatter of features based on Personal Loan (Fig {i+1})') fig.update_layout( title_text="Box Plot Styling Outliers", title_font=dict(color='orange', family='newtimeroman', size=25), title_x=0.45, paper_bgcolor='#145A32', # plot_bgcolor='#DAF7A6', font=dict(color='#DAF7A6', family='newtimeroman', size=16), ) pio.show(fig)

1754815706228

🟩 Pie Chart

df["Type Of Restaurant"].value_counts()[:10].plot.pie(figsize = (10, 10), autopct = '%1.0f%%') plt.title("Pie Chart") plt.xticks(rotation = 90) plt.show()

1754815738204

🎲 Pie Chart for selected Value

df['city_1'].value_counts().nlargest(n=20, keep='first').plot.pie(figsize = (10, 10), autopct = '%1.0f%%') plt.title("Pie Chart") plt.xticks(rotation = 90) plt.show()

1754815775325 📕 kdeplt

plt.figure(figsize=(10, 5)) sns.set_context("paper")

kdeplt = sns.kdeplot( data=heart_dft_chol_n0, x="Cholesterol", hue="Sex", palette=sex_color, alpha=0.7, lw=2, ) kdeplt.set_title("Cholesterol values distribution\n Male VS Female", fontsize=12) kdeplt.set_xlabel("Cholesterol", fontsize=12) plt.axvline(x=Chol_mean_f, color="#c90076", ls="--", lw=1.3) plt.axvline(x=Chol_mean_m, color="#2986cc", ls="--", lw=1.3) plt.text(108, 0.00612, "Mean Cholesterol / Male", fontsize=10, color="#2986cc") plt.text(260, 0.006, "Mean Cholesterol / Female", fontsize=10, color="#c90076") plt.show()

1754815794669

🟢 Regplot

heart_df_fg = sns.FacetGrid( data=heart_dft_chol_n0, col="Sex", hue="Sex", row="HeartDisease", height=4, aspect=1.3, palette=sex_color, col_order=["Male", "Female"], ) heart_df_fg.map_dataframe(sns.regplot, "Age", "MaxHR") plt.show()

1754815808868

🟥 Histplot

mean_SalePrice = usa_housing_df[["SalePrice"]].mean().squeeze() median_SalePrice = usa_housing_df[["SalePrice"]].median().squeeze()

plt.figure(figsize=(10, 5)) sns.set_context("paper")

histplt = sns.histplot( data=usa_housing_df, x="SalePrice", color="#4f758f", bins=60, alpha=0.5, lw=2, ) histplt.set_title("SalePrice Distribution", fontsize=12) histplt.set_xlabel("SalePrice", fontsize=12)

plt.axvline(x=mean_SalePrice, color="#14967f", ls="--", lw=1.5) plt.axvline(x=median_SalePrice, color="#9b0f33", ls="--", lw=1.5) plt.text(mean_SalePrice + 5000, 175, "Mean SalePrice", fontsize=9, color="#14967f") plt.text( median_SalePrice - 115000, 175, "Median SalePrice", fontsize=9, color="#9b0f33" ) histplt.xaxis.set_major_formatter(ticker.EngFormatter()) plt.ylim(0, 200) plt.show()

1754815829439

📓 Boxpot

df2 = titanic[['Survived','Pclass','Sex','Embarked','SibSp','Parch',"Age"]]

fig, axes = plt.subplots(1, 2) fig.set_figheight(10) fig.set_figwidth(20) for i,col in enumerate(df2.select_dtypes('object')): sns.boxplot(x="Age", y=col, data=df2, whis=[0, 100], width=.6,ax=axes[i])

1754815847855

Boxplot & Bargraph

df2 = titanic[['Survived','Pclass','Sex','Embarked','SibSp','Parch',"Age"]]

create the subplots⚓︎

f, (ax_box, ax_hist) = plt.subplots(2, sharex=True, gridspec_kw={"height_ratios": (.15, .85)})

title⚓︎

ax_box.title.set_text('Price countplot and Boxplot')

assigning a graph to each ax⚓︎

sns.boxplot(df2["Age"], orient="h" ,ax=ax_box) sns.histplot(data=df2, x="Age", ax=ax_hist)

Remove x axis name for the boxplot⚓︎

ax_box.set(xlabel='') plt.show()

1754815877134

🟢 Histplot for multiple features

NUMERICAL = wine[['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar', 'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density', 'pH', 'sulphates', 'alcohol']] fig, axes = plt.subplots(2, 4) fig.set_figheight(12) fig.set_figwidth(16) for i,col in enumerate(NUMERICAL): sns.histplot(wine[col],ax=axes[(i // 4) -1 ,(i % 4)], kde = True) axes[(i // 4) -1 ,(i % 4)].axvline(wine[col].mean(), color='k', linestyle='dashed', linewidth=1)

1754815891777

🔰 Scatterplot

fig, axes = plt.subplots(1, 3) fig.set_figheight(7) fig.set_figwidth(20) sns.scatterplot(data=titanic, x="Age", y="Fare", hue="Survived", size="Survived", ax=axes[0]) sns.scatterplot(data=titanic, x="Age", y="Fare", hue="Pclass", size="Pclass", ax=axes[1]) sns.scatterplot(data=titanic, x="Age", y="Fare", hue="SibSp", size="SibSp", ax=axes[2]);

1754815908719

Groupby plot

color = list(np.full(12, 'grey')) color[2], color[10] = 'orange', 'orange' df.groupby('month').mean().active_power.plot(kind='bar', title='Average of Active Power of each Months', color=color, rot=0) plt.ylabel('Active Power [kW]');

1754815921214

🛑 Line and Scatter Plot

plt.title('Actual Power vs Theoretical Power') plt.plot(df.theor_power, df.active_power, 'o', markersize= 1) plt.grid('both') plt.xlabel('Theoretcial Power (kW)') plt.ylabel('Actual Power (kW)') plt.plot([0,3650], [0,3650], '-', c= 'k') plt.show()

1754815952352

📚 Groupby Line plot

group_hours = df_demand['load'].groupby(pd.Grouper(freq='D', how='mean')) fig, axs = plt.subplots(1,1, figsize=(12,5)) year_demands = pd.DataFrame() for name, group in group_hours: year_demands[name.year] = pd.Series(group.values)
year_demands.plot(ax=axs) axs.set_xlabel('Hour of the day') axs.set_ylabel('Energy Demanded MWh') axs.set_title('Mean yearly energy demand by hour of the day ');

1754815965810

💾 Barplot

plot , ax = plt.subplots(1 , 3 , figsize=(14,4)) sns.histplot(data = train_data.loc[train_data["Pclass"]==1] , x = "Age" , hue = "Survived",binwidth=5,ax = ax[0],palette = sns.color_palette(["yellow" , "green"]),multiple = "stack").set_title("1-Pclass") sns.histplot(data = train_data.loc[train_data["Pclass"]==2] , x = "Age" , hue = "Survived",binwidth=5,ax = ax[1],palette = sns.color_palette(["yellow" , "green"]),multiple = "stack").set_title("2-Pclass") sns.histplot(data = train_data.loc[train_data["Pclass"]==3] , x = "Age" , hue = "Survived",binwidth=5,ax = ax[2],palette = sns.color_palette(["yellow" , "green"]),multiple = "stack").set_title("3-Pclass") plt.show()

1754815979952

🟥 Plotting the distributions of the numerical variables

color_plot = ['#de972c','#74c91e','#1681de','#e069f5','#f54545','#f0ea46','#7950cc']

fig,ax = plt.subplots(4,2,figsize=(20,20)) sns.kdeplot(df['HeartDisease'],color=np.random.choice(color_plot), ax=ax[0][0], shade=True) sns.kdeplot(df['Oldpeak'],color=np.random.choice(color_plot), ax=ax[0][1], shade=True) sns.kdeplot(df['Age'],color=np.random.choice(color_plot), ax=ax[1][0], shade=True) sns.kdeplot(df['FastingBS'],color=np.random.choice(color_plot), ax=ax[1][1], shade=True) sns.kdeplot(df['RestingBP'],color=np.random.choice(color_plot), ax=ax[2][0],shade=True) sns.kdeplot(df['Cholesterol'],color=np.random.choice(color_plot), ax=ax[2][1], shade=True) sns.kdeplot(df['MaxHR'],color=np.random.choice(color_plot), ax=ax[3][0],shade=True) fig.delaxes(ax[3][1])

🟫 Heatmap Correlation plot

hm= df.drop('id', axis =1) mask = np.zeros_like(hm.corr(), dtype=np.bool) mask[np.triu_indices_from(mask)]= True

plt.suptitle('Correlation', size = 20, weight='bold')

ax = sns.heatmap(hm.corr(), linewidths = 0.9, linecolor = 'white', cbar = True,mask=mask, cmap=heatmap)

ax.annotate('Low Correlation', fontsize=10,fontweight='bold', xy=(1.3, 3.5), xycoords='data', xytext=(0.6, 0.95), textcoords='axes fraction', arrowprops=dict( facecolor=heatmap[0], shrink=0.025, connectionstyle='arc3, rad=0.50'), horizontalalignment='left', verticalalignment='top' )

ax.annotate('High Correlation', fontsize=10,fontweight='bold', xy=(3.3, 7.5), xycoords='data', xytext=(0.8, 0.4), textcoords='axes fraction', arrowprops=dict( facecolor=heatmap[0], shrink=0.025, connectionstyle='arc3, rad=-0.6'), horizontalalignment='left', verticalalignment='top' ) plt.show()

1754816006513

📊 Boxplot with threshold

fig = plt.figure( figsize=(8, 6)) ax = fig.add_axes([0,0,1,1]) sns.boxplot(ax=ax, data=df, x='TARGET', y='LDH')#,flierprops=dict(marker='o', markersize=6),fliersize=2)

ax.axhline(y=550,color='b') ax.axhline(y=650,color='orange') ax.axhline(y=1200,color='g')

1754816022449

📊 Barplot with Percentage

plt.suptitle('Target Variable', size = 20, weight='bold')

song_popularity = df['song_popularity'].map({0:'UnPopular', 1:'Popular'})

a = sns.countplot(data = df, x =song_popularity,palette=theme) plt.tick_params(axis="x", colors=theme[0],labelsize=15)

for p in a.patches: width = p.get_width() height = p.get_height() x, y = p.get_xy() a.annotate(f'{height/df.shape[0]100} %', (x + width/2, y + height1.02), ha='center')

plt.show()

1754816046626

🔵 KDE plot

cont = ['song_duration_ms', 'acousticness', 'danceability', 'energy', 'instrumentalness', 'liveness', 'loudness', 'speechiness', 'tempo', 'audio_valence'] cat = [ 'key', 'audio_mode', 'time_signature']

a = 4 # number of rows b = 3 # number of columns c = 1 # initialize plot counter

plt.figure(figsize= (18,18))

for i in cont: plt.suptitle('Distribution of Features', size = 20, weight='bold') plt.subplot(a, b, c) A=sns.kdeplot(data= df, x=i,hue=song_popularity,palette=theme[:-2], linewidth = 1.3,shade=True, alpha=0.35) plt.title(i) plt.xlabel(" ") c = c + 1

1754816063984

🟩 KDE plot

plotting⚓︎

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(18, 9)) fig.suptitle(' Highest and Lowest Correlation ', size = 20, weight='bold') axs = [ax1, ax2]

kdeplot⚓︎

sns.kdeplot(data=df, y='energy', x='acousticness', ax=ax1, color=heatmap[0]) ax1.set_title('Energy vs Acousticness', size = 14, weight='bold', pad=20)

kdeplot⚓︎

sns.kdeplot(data=df, y='energy', x='loudness', ax=ax2, color=heatmap[4]) ax2.set_title('Energy vs Loudness', size = 14, weight='bold', pad=20);

1754816079426

⬛️ Count plot with specific selection

Parameters for Plots⚓︎

plt.rcParams['figure.figsize'] = (10,6) plt.rcParams['axes.edgecolor'] = 'black' plt.rcParams['axes.linewidth'] = 1.5 plt.rcParams['figure.frameon'] = True plt.rcParams['axes.spines.top'] = False plt.rcParams['axes.spines.right'] = False plt.rcParams["font.family"] = "monospace";

Colors for charts⚓︎

colors = ["#e9d9c8","#cca383","#070c23","#f82d06","#e8c195","#cd7551","#a49995","#a3a49c","#6c7470"] sns.palplot(sns.color_palette(colors))

plot⚓︎

A = sns.countplot(train_df['case_num'], color=colors[1], edgecolor='white', linewidth=1.5, saturation=1.5)

Patch⚓︎

patch_h = []
for patch in A.patches: reading = patch.get_height() patch_h.append(reading)

idx_tallest = np.argmax(patch_h)
A.patches[idx_tallest].set_facecolor(colors[3])

Lables⚓︎

plt.ylabel('Count', weight='semibold', fontname = 'Georgia') plt.xlabel('Cases', weight='semibold', fontname = 'Georgia') plt.suptitle('Number of Cases', fontname = 'Georgia', weight='bold', size = 18, color = colors[2]) A.bar_label(A.containers[0], label_type='edge')

plt.show()

1754816122946

🔵 Pie Chat

fig, ax = plt.subplots(ncols=3, figsize=(18,6))

colors = [['#ADEFD1FF', '#00203FFF'], ['#97BC62FF', '#2C5F2D'], ['#F5C7B8FF', '#FFA177FF']] explode = [0, 0.2] columns = ['Parking', 'Warehouse', 'Elevator'] for i in range(3): data = df[columns[i]].value_counts() ax[i].pie(data, labels=data.values, explode=explode, colors=colors[i], shadow=True) ax[i].legend(labels=data.index, fontsize='large') ax[i].set_title('{} distribution'.format(columns[i]))

1754816138243

🟢 Histplot

def plot_hist(feature): fig, ax = plt.subplots(2, 1, figsize=(17, 12))

sns.histplot(data = titanic[feature], kde = True, ax = ax[0],color="Brown")

ax[0].axvline(x = titanic[feature].mean(), color = 'r', linestyle = '--', linewidth = 2, label = 'Mean: {}'.format(round(titanic[feature].mean(), 3)))
ax[0].axvline(x = titanic[feature].median(), color = 'orange', linewidth = 2, label = 'Median: {}'.format(round(titanic[feature].median(), 3)))
ax[0].axvline(x = statistics.mode(titanic[feature]), color = 'yellow', linewidth = 2, label = 'Mode: {}'.format(statistics.mode(titanic[feature])))
ax[0].legend()

sns.boxplot(x = titanic[feature], ax = ax[1],color="Brown")
plt.show()

plot_hist('Age')

1754816162467

🟧 Barplot with Lineplot

plt.figure(figsize=(12,5)) plt.title('top categories') plt.ylabel('item_price') titanic.groupby('Embarked')['Fare'].mean().sort_values(ascending=False)[0:15].plot(kind='line', marker='*', color='red', ms=10) titanic.groupby('Embarked')['Fare'].mean().sort_values(ascending=False)[0:15].plot(kind='bar',color=sns.color_palette("inferno_r", 7)) plt.show()

1754816178803

🟢 Scatterplot with Marker

import matplotlib.pyplot as plt import seaborn as sns

sns.scatterplot(x=df.iloc[:,0], y=df.iloc[:,1], hue=y) plt.annotate("KD65", (df.iloc[64,0], df.iloc[64,1]), (81e6, 1), arrowprops=dict(arrowstyle="->"), fontsize="xx-large",c='red') plt.annotate("KD99", (df.iloc[98,0], df.iloc[98,1]), (81e6, 21e6), arrowprops=dict(arrowstyle="->"), fontsize="xx-large",c='red') plt.annotate("control3", (df.iloc[107,0], df.iloc[107,1]), (81e6, 31e6), arrowprops=dict(arrowstyle="->"), fontsize="xx-large",c='red') plt.annotate("control13", (df.iloc[117,0], df.iloc[117,1]), (81e6, 4*1e6), arrowprops=dict(arrowstyle="->"), fontsize="xx-large",c='red')

1754816193731