Post

Understand Data Analyst

Data Analyst Recap

Understand Data Analyst

1) Exploratory Data Analysis (EDA)

Data Exploration

Checking Data Types & Missing Values

1
2
3
4
5
import pandas as pd
books = pd.read_csv("books.csv")
print(books.head())  # View first 5 rows

books.info()  # Summary of dataset

Descriptive Statistics

1
2
# For numerical columns
books.describe() # Summary statistics
1
2
# For categorical columns
books["genre"].value_counts()

Data Visualization

1
2
3
4
5
import seaborn as sns
import matplotlib.pyplot as plt

sns.histplot(data=books, x="rating", binwidth=0.1)
plt.show()

Data Cleaning & Imputation

Handling Missing Values

1
2
3
4
5
6
# Check for missing values
print(books.isnull().sum())

# Drop missing values
books.dropna(inplace=True)
books.dropna(subset=["rating"], inplace=True)
1
2
3
# Impute missing values
books["rating"].fillna(books["rating"].mean(), inplace=True)
books["genre"].fillna(books["genre"].mode()[0], inplace=True)

Handling Categorical Variables

1
2
3
4
5
6
7
8
# Check unique authors
print(books["author"].nunique()) 

# Check for authors starting with "Brown"
books["author"].str.contains("^Brown", case=False)  

# Check for specific genres
books["genre"].str.contains("Thriller|Mystery", case=False) 

Creating categorical column

1
books["is_highly_rated"] = books["rating"] > 4.5

Handling Outliers

1
2
3
4
5
6
7
8
9
# Using IQR method
Q1 = books["rating"].quantile(0.25)
Q3 = books["rating"].quantile(0.75)
IQR = Q3 - Q1

upper = Q3 + 1.5 * IQR
lower = Q1 - 1.5 * IQR
outliers = books[(books["rating"] < lower) | (books["rating"] > upper)]
print(outliers)

Relationships in Data

Handling DateTime data

1
2
books["year"] = pd.to_datetime(books["year"], format="%Y")
books["year"].dt.year  # Extract year

Correlation Analysis

1
2
3
books.corr()
sns.heatmap(books.corr(), annot=True)
plt.show()

KDE Plot

1
2
3
4
5
6
7
# Update the KDE plot so that marriage duration can't be smoothed too far
sns.kdeplot(data=divorce, x="marriage_duration", hue="num_kids", cut=0)
plt.show()

# Update the KDE plot to show a cumulative distribution function
sns.kdeplot(data=divorce, x="marriage_duration", hue="num_kids", cut=0, cumulative=True)
plt.show()

EDA in action

Class imbalance

1
2
# Print the relative frequency of Job_Category
print(salaries["Job_Category"].value_counts(normalize=True))

Cross-tabulation

1
2
3
# Cross-tabulate Job_Category and Company_Size
print(pd.crosstab(salaries["Job_Category"], salaries["Company_Size"],
            values=salaries["Salary_USD"], aggfunc="mean"))

Generating new features

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
#ย Get the month of the response
salaries["month"] = salaries["date_of_response"].dt.month

#ย Extract the weekday of the response
salaries["weekday"] = salaries["date_of_response"].dt.weekday

# Create a heatmap
sns.heatmap(salaries.corr(), annot=True)
plt.show()

# Find the 25th percentile
twenty_fifth = salaries["Salary_USD"].quantile(0.25)

# Save the median
salaries_median = salaries["Salary_USD"].median()

# Gather the 75th percentile
seventy_fifth = salaries["Salary_USD"].quantile(0.75)
print(twenty_fifth, salaries_median, seventy_fifth)

# Create salary labels
salary_labels = ["entry", "mid", "senior", "exec"]

# Create the salary ranges list
salary_ranges = [0, twenty_fifth, salaries_median, seventy_fifth, salaries["Salary_USD"].max()]

# Create salary_level
# pd.cut() creates bins for salary ranges
salaries["salary_level"] = pd.cut(salaries["Salary_USD"],
                                  bins=salary_ranges,
                                  labels=salary_labels)

# Plot the count of salary levels at companies of different sizes
sns.countplot(data=salaries, x="Company_Size", hue="salary_level")
plt.show()
This post is licensed under CC BY 4.0 by the author.