← Back to Projects

Olist Intelligence — Advanced SQL Analytics (Personal Project)

End-to-end SQL analytics pipeline built in MySQL 8, covering data engineering, relational modeling, business KPI analysis, and cohort retention analytics.

Dat Tran — Data Analyst / SQL Developer

MySQL 8 SQL Data Modeling Analytics Engineering Cohort Analysis
View on GitHub

Project Overview

This project simulates a real-world analytics workflow using the Brazilian Olist E-Commerce dataset. The objective was to build a production-style SQL pipeline that transforms raw CSV files into a structured relational database and delivers meaningful business insights through advanced SQL analysis.

Key Goals

  • Build relational schema from raw data
  • Validate data quality before analysis
  • Design performant analytical views
  • Create core business KPI queries
  • Implement advanced cohort retention analysis

Architecture

The pipeline follows a structured flow from raw data ingestion to business insight delivery:

📁 CSV Files
🔧 MySQL Workbench Import
⚙️ Phase 1: Data Engineering
📊 Phase 2: Analytical Layer
💡 Business Insights

Important: Tables were initially created using MySQL Workbench Import Wizard to generate schema structures. After structure creation, tables were emptied and data was reloaded using LOAD DATA LOCAL INFILE to simulate production bulk ingestion.

Phase 1

Data Engineering

Bootstrap & Bulk Load

The data engineering phase established the foundation for all subsequent analysis:

  • Created database olist
  • Enabled LOCAL INFILE for high-performance ingestion
  • Truncated tables generated by Import Wizard
  • Loaded large CSV datasets via SQL bulk load
  • Temporarily disabled FK and unique checks to optimize load performance
Performance Optimization Settings
SET FOREIGN_KEY_CHECKS=0;
SET UNIQUE_CHECKS=0;
SET AUTOCOMMIT=0;

LOAD DATA LOCAL INFILE 'olist_orders_dataset.csv'
INTO TABLE orders
FIELDS TERMINATED BY ',' ENCLOSED BY '"'
LINES TERMINATED BY '\n'
IGNORE 1 LINES;

COMMIT;
SET AUTOCOMMIT=1;
SET UNIQUE_CHECKS=1;
SET FOREIGN_KEY_CHECKS=1;

Tables Loaded

customers
orders
order_items
products
order_payments
order_reviews
sellers
geolocation
product_category_name_translation

Data Quality & Sanity Checks

Extensive validation was performed before modeling relationships:

  • Duplicate detection across identifiers
  • Orphan detection across relational joins
  • Numeric range validation (price, freight, payments)
  • Timestamp normalization
  • Safe TEXT → DATETIME conversion
  • Timeline validation (delivery vs purchase)
  • Orders-without-items integrity check
Orphan Detection Query
-- Orphan order_items (no matching order)
SELECT COUNT(*) AS orphan_order_items
FROM order_items oi 
LEFT JOIN orders o ON o.order_id = oi.order_id
WHERE o.order_id IS NULL;

-- Orphan orders (no matching customer)
SELECT COUNT(*) AS orphan_orders_customers
FROM orders o 
LEFT JOIN customers c ON c.customer_id = o.customer_id
WHERE c.customer_id IS NULL;

Safe datetime conversion used the add → populate → validate → swap pattern to ensure data integrity during type transformations.

Safe DATETIME Conversion Pattern
-- Step 1: Add new DATETIME columns
ALTER TABLE orders
  ADD COLUMN order_purchase_dt DATETIME NULL;

-- Step 2: Populate with parsed values
UPDATE orders
SET order_purchase_dt = STR_TO_DATE(
  order_purchase_timestamp, '%Y-%m-%d %H:%i:%s'
);

-- Step 3: Validate conversion (should be 0)
SELECT SUM(
  order_purchase_timestamp IS NOT NULL 
  AND order_purchase_dt IS NULL
) AS bad_parse;

-- Step 4: Swap columns
ALTER TABLE orders
  DROP COLUMN order_purchase_timestamp,
  RENAME COLUMN order_purchase_dt TO order_purchase_timestamp;

Relationship Modeling

Converted TEXT keys into indexed VARCHAR fields and established relational integrity:

Primary Keys
customers(customer_id)
orders(order_id)
products(product_id)
sellers(seller_id)
Composite Keys
order_items(order_id, order_item_id)
order_payments(order_id, payment_sequential)
order_reviews(order_id, review_id)
Foreign Key Relationships
orders → customers
order_items → orders / products / sellers
order_payments → orders
order_reviews → orders

Design Decision: Translation table foreign key intentionally skipped due to real-world data inconsistency risks. This reflects pragmatic production database design.

Phase 2

Analytical SQL

Enriched Views

Created semantic views to simplify analysis and reduce repeated joins:

vw_orders_enriched
CREATE OR REPLACE VIEW vw_orders_enriched AS
SELECT
    o.order_id,
    o.customer_id,
    o.order_status,
    o.order_purchase_timestamp,
    o.order_delivered_customer_date,
    o.order_estimated_delivery_date,
    c.customer_unique_id,
    c.customer_city,
    c.customer_state
FROM orders o
LEFT JOIN customers c
  ON c.customer_id = o.customer_id;
vw_order_items_enriched
CREATE OR REPLACE VIEW vw_order_items_enriched AS
SELECT
    oi.order_id,
    oi.order_item_id,
    oi.product_id,
    oi.seller_id,
    oi.price,
    oi.freight_value,
    p.product_category_name,
    s.seller_city,
    s.seller_state
FROM order_items oi
LEFT JOIN products p ON p.product_id = oi.product_id
LEFT JOIN sellers s ON s.seller_id = oi.seller_id;

Views create a semantic layer that abstracts complex joins and provides a clean interface for business analysts and downstream reporting.

Performance Indexing

Strategic indexes added to optimize analytical workloads:

Index Strategy
orders(customer_id)              -- customer lookup
orders(order_purchase_timestamp) -- time-series queries
order_items(order_id)            -- order-item joins
order_items(product_id)          -- product analysis
order_payments(order_id)         -- payment aggregation
order_reviews(order_id)          -- review correlation

Indexing rationale: Optimized for time-series queries, heavy joins, and analytics workload patterns typical in e-commerce analysis.

Core Business Metrics

Reusable analytical SQL modules were developed covering:

  • Revenue & GMV calculations
  • Orders and customer behavior
  • Monthly trends and AOV
  • Category and product performance
  • Seller performance
  • Payment behavior
  • Delivery logistics KPIs
  • Customer geography insights
Monthly AOV (Average Order Value)
SELECT
    DATE_FORMAT(o.order_purchase_timestamp, '%Y-%m') AS ym,
    AVG(t.order_gmv) AS avg_order_value
FROM orders o
JOIN (
    SELECT order_id, SUM(price + freight_value) AS order_gmv
    FROM order_items
    GROUP BY order_id
) t ON t.order_id = o.order_id
GROUP BY ym
ORDER BY ym;
Late Delivery Rate by State
SELECT 
    c.customer_state,
    COUNT(*) AS delivered_orders,
    SUM(o.order_delivered_customer_date > 
        o.order_estimated_delivery_date) AS late_deliveries,
    ROUND(100 * SUM(o.order_delivered_customer_date > 
        o.order_estimated_delivery_date) / COUNT(*), 2) AS late_rate_pct
FROM orders o
JOIN customers c ON c.customer_id = o.customer_id
WHERE o.order_delivered_customer_date IS NOT NULL
  AND o.order_estimated_delivery_date IS NOT NULL
GROUP BY c.customer_state
HAVING COUNT(*) >= 200
ORDER BY late_rate_pct DESC;
Advanced

Cohort Retention Analysis

Customers are grouped by their first purchase month, then monthly activity is tracked to measure long-term engagement and loyalty.

Cohort Metrics

Metric Description
cohort_ym Year-month of customer's first purchase
month_index Months since first purchase (0, 1, 2, ...)
active_customers Customers who purchased in this month
cohort_customers Total customers in the cohort
retention_pct Percentage of cohort still active

Business Value

Revenue shows growth, but retention shows loyalty and marketplace health. Cohort analysis reveals whether customers return after their first purchase — a critical metric for sustainable e-commerce growth.

Cohort Retention Query (CTE-based)
WITH first_purchase AS (
    SELECT
        c.customer_unique_id,
        MIN(DATE(o.order_purchase_timestamp)) AS first_order_date
    FROM orders o
    JOIN customers c ON c.customer_id = o.customer_id
    GROUP BY c.customer_unique_id
),

cohorts AS (
    SELECT
        customer_unique_id,
        DATE_FORMAT(first_order_date, '%Y-%m') AS cohort_ym
    FROM first_purchase
),

cohort_size AS (
    SELECT cohort_ym, COUNT(DISTINCT customer_unique_id) AS cohort_customers
    FROM cohorts
    GROUP BY cohort_ym
),

cohort_activity AS (
    SELECT
        ch.cohort_ym,
        PERIOD_DIFF(
            REPLACE(DATE_FORMAT(o.order_purchase_timestamp,'%Y-%m'),'-',''),
            REPLACE(ch.cohort_ym,'-','')
        ) AS month_index,
        COUNT(DISTINCT c.customer_unique_id) AS active_customers
    FROM cohorts ch
    JOIN customers c ON c.customer_unique_id = ch.customer_unique_id
    JOIN orders o ON o.customer_id = c.customer_id
    GROUP BY ch.cohort_ym, month_index
)

SELECT
    ca.cohort_ym,
    ca.month_index,
    ca.active_customers,
    cs.cohort_customers,
    ROUND(100 * ca.active_customers / cs.cohort_customers, 2) AS retention_pct
FROM cohort_activity ca
JOIN cohort_size cs USING (cohort_ym)
WHERE ca.month_index >= 0
ORDER BY ca.cohort_ym, ca.month_index;

Technical Highlights

High-performance CSV ingestion
🔄
Professional ETL workflow
🔗
Real relational PK/FK modeling
📈
Index-aware analytical SQL
👥
Cohort temporal logic
🏭
Production-style pipeline

Key Learnings

🔍 Data Quality First

Data quality validation is critical before relationship modeling. Orphan records and duplicates can break foreign key constraints.

📊 Cohort Depth

Cohort analysis reveals deeper customer behavior than revenue alone. Retention metrics expose marketplace health.

🧩 Semantic Abstraction

Views create a reusable analytical abstraction layer, simplifying downstream queries and reducing error-prone joins.

⚡ Index Impact

Indexing decisions significantly affect query performance. Strategic indexes on join and filter columns are essential.

How to Reproduce

Run the SQL scripts in the following order:

  • 01_phase1_bootstrap_load.sql
  • 02_phase1.2_dataquality_sanitycheck.sql
  • 03_phase1.3_create_relationships.sql
  • 04_Phase2.1_create_views.sql
  • 05_Phase2.2_core_business_metrics.sql
  • 06_Phase2.3_cohort_analysis.sql

More Projects