Understand Data Analyst
Data Analyst Recap
Understand Data Analyst
1) Exploratory Data Analysis (EDA)
Data Exploration
Checking Data Types & Missing Values
1
2
3
4
5
import pandas as pd
books = pd.read_csv("books.csv")
print(books.head()) # View first 5 rows
books.info() # Summary of dataset
Descriptive Statistics
1
2
# For numerical columns
books.describe() # Summary statistics
1
2
# For categorical columns
books["genre"].value_counts()
Data Visualization
1
2
3
4
5
import seaborn as sns
import matplotlib.pyplot as plt
sns.histplot(data=books, x="rating", binwidth=0.1)
plt.show()
Data Cleaning & Imputation
Handling Missing Values
1
2
3
4
5
6
# Check for missing values
print(books.isnull().sum())
# Drop missing values
books.dropna(inplace=True)
books.dropna(subset=["rating"], inplace=True)
1
2
3
# Impute missing values
books["rating"].fillna(books["rating"].mean(), inplace=True)
books["genre"].fillna(books["genre"].mode()[0], inplace=True)
Handling Categorical Variables
1
2
3
4
5
6
7
8
# Check unique authors
print(books["author"].nunique())
# Check for authors starting with "Brown"
books["author"].str.contains("^Brown", case=False)
# Check for specific genres
books["genre"].str.contains("Thriller|Mystery", case=False)
Creating categorical column
1
books["is_highly_rated"] = books["rating"] > 4.5
Handling Outliers
1
2
3
4
5
6
7
8
9
# Using IQR method
Q1 = books["rating"].quantile(0.25)
Q3 = books["rating"].quantile(0.75)
IQR = Q3 - Q1
upper = Q3 + 1.5 * IQR
lower = Q1 - 1.5 * IQR
outliers = books[(books["rating"] < lower) | (books["rating"] > upper)]
print(outliers)
Relationships in Data
Handling DateTime data
1
2
books["year"] = pd.to_datetime(books["year"], format="%Y")
books["year"].dt.year # Extract year
Correlation Analysis
1
2
3
books.corr()
sns.heatmap(books.corr(), annot=True)
plt.show()
KDE Plot
1
2
3
4
5
6
7
# Update the KDE plot so that marriage duration can't be smoothed too far
sns.kdeplot(data=divorce, x="marriage_duration", hue="num_kids", cut=0)
plt.show()
# Update the KDE plot to show a cumulative distribution function
sns.kdeplot(data=divorce, x="marriage_duration", hue="num_kids", cut=0, cumulative=True)
plt.show()
EDA in action
Class imbalance
1
2
# Print the relative frequency of Job_Category
print(salaries["Job_Category"].value_counts(normalize=True))
Cross-tabulation
1
2
3
# Cross-tabulate Job_Category and Company_Size
print(pd.crosstab(salaries["Job_Category"], salaries["Company_Size"],
values=salaries["Salary_USD"], aggfunc="mean"))
Generating new features
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
#ย Get the month of the response
salaries["month"] = salaries["date_of_response"].dt.month
#ย Extract the weekday of the response
salaries["weekday"] = salaries["date_of_response"].dt.weekday
# Create a heatmap
sns.heatmap(salaries.corr(), annot=True)
plt.show()
# Find the 25th percentile
twenty_fifth = salaries["Salary_USD"].quantile(0.25)
# Save the median
salaries_median = salaries["Salary_USD"].median()
# Gather the 75th percentile
seventy_fifth = salaries["Salary_USD"].quantile(0.75)
print(twenty_fifth, salaries_median, seventy_fifth)
# Create salary labels
salary_labels = ["entry", "mid", "senior", "exec"]
# Create the salary ranges list
salary_ranges = [0, twenty_fifth, salaries_median, seventy_fifth, salaries["Salary_USD"].max()]
# Create salary_level
# pd.cut() creates bins for salary ranges
salaries["salary_level"] = pd.cut(salaries["Salary_USD"],
bins=salary_ranges,
labels=salary_labels)
# Plot the count of salary levels at companies of different sizes
sns.countplot(data=salaries, x="Company_Size", hue="salary_level")
plt.show()
This post is licensed under CC BY 4.0 by the author.