Olist Intelligence — Advanced SQL Analytics

Project Overview

This project simulates a real-world analytics workflow using the Brazilian Olist E-Commerce dataset. The objective was to build a production-style SQL pipeline that transforms raw CSV files into a structured relational database and delivers meaningful business insights through advanced SQL analysis.

Key Goals

Build relational schema from raw data
Validate data quality before analysis
Design performant analytical views
Create core business KPI queries
Implement advanced cohort retention analysis

Architecture

The pipeline follows a structured flow from raw data ingestion to business insight delivery:

📁 CSV Files

→

🔧 MySQL Workbench Import

→

⚙️ Phase 1: Data Engineering

→

📊 Phase 2: Analytical Layer

→

💡 Business Insights

Important: Tables were initially created using MySQL Workbench Import Wizard to generate schema structures. After structure creation, tables were emptied and data was reloaded using LOAD DATA LOCAL INFILE to simulate production bulk ingestion.

Phase 1

Data Engineering

Bootstrap & Bulk Load

The data engineering phase established the foundation for all subsequent analysis:

Created database olist
Enabled LOCAL INFILE for high-performance ingestion
Truncated tables generated by Import Wizard
Loaded large CSV datasets via SQL bulk load
Temporarily disabled FK and unique checks to optimize load performance

Performance Optimization Settings

SET FOREIGN_KEY_CHECKS=0;
SET UNIQUE_CHECKS=0;
SET AUTOCOMMIT=0;

LOAD DATA LOCAL INFILE 'olist_orders_dataset.csv'
INTO TABLE orders
FIELDS TERMINATED BY ',' ENCLOSED BY '"'
LINES TERMINATED BY '\n'
IGNORE 1 LINES;

COMMIT;
SET AUTOCOMMIT=1;
SET UNIQUE_CHECKS=1;
SET FOREIGN_KEY_CHECKS=1;

Tables Loaded

customers

orders

order_items

products

order_payments

order_reviews

sellers

geolocation

product_category_name_translation

Data Quality & Sanity Checks

Extensive validation was performed before modeling relationships:

Duplicate detection across identifiers
Orphan detection across relational joins
Numeric range validation (price, freight, payments)
Timestamp normalization
Safe TEXT → DATETIME conversion
Timeline validation (delivery vs purchase)
Orders-without-items integrity check

Orphan Detection Query

-- Orphan order_items (no matching order)
SELECT COUNT(*) AS orphan_order_items
FROM order_items oi 
LEFT JOIN orders o ON o.order_id = oi.order_id
WHERE o.order_id IS NULL;

-- Orphan orders (no matching customer)
SELECT COUNT(*) AS orphan_orders_customers
FROM orders o 
LEFT JOIN customers c ON c.customer_id = o.customer_id
WHERE c.customer_id IS NULL;

Safe datetime conversion used the add → populate → validate → swap pattern to ensure data integrity during type transformations.

Safe DATETIME Conversion Pattern

-- Step 1: Add new DATETIME columns
ALTER TABLE orders
  ADD COLUMN order_purchase_dt DATETIME NULL;

-- Step 2: Populate with parsed values
UPDATE orders
SET order_purchase_dt = STR_TO_DATE(
  order_purchase_timestamp, '%Y-%m-%d %H:%i:%s'
);

-- Step 3: Validate conversion (should be 0)
SELECT SUM(
  order_purchase_timestamp IS NOT NULL 
  AND order_purchase_dt IS NULL
) AS bad_parse;

-- Step 4: Swap columns
ALTER TABLE orders
  DROP COLUMN order_purchase_timestamp,
  RENAME COLUMN order_purchase_dt TO order_purchase_timestamp;

Relationship Modeling

Converted TEXT keys into indexed VARCHAR fields and established relational integrity:

Primary Keys

customers(customer_id)
orders(order_id)
products(product_id)
sellers(seller_id)

Composite Keys

order_items(order_id, order_item_id)
order_payments(order_id, payment_sequential)
order_reviews(order_id, review_id)

Foreign Key Relationships

orders → customers
order_items → orders / products / sellers
order_payments → orders
order_reviews → orders

Design Decision: Translation table foreign key intentionally skipped due to real-world data inconsistency risks. This reflects pragmatic production database design.

Phase 2

Analytical SQL

Enriched Views

Created semantic views to simplify analysis and reduce repeated joins:

vw_orders_enriched

CREATE OR REPLACE VIEW vw_orders_enriched AS
SELECT
    o.order_id,
    o.customer_id,
    o.order_status,
    o.order_purchase_timestamp,
    o.order_delivered_customer_date,
    o.order_estimated_delivery_date,
    c.customer_unique_id,
    c.customer_city,
    c.customer_state
FROM orders o
LEFT JOIN customers c
  ON c.customer_id = o.customer_id;

vw_order_items_enriched

CREATE OR REPLACE VIEW vw_order_items_enriched AS
SELECT
    oi.order_id,
    oi.order_item_id,
    oi.product_id,
    oi.seller_id,
    oi.price,
    oi.freight_value,
    p.product_category_name,
    s.seller_city,
    s.seller_state
FROM order_items oi
LEFT JOIN products p ON p.product_id = oi.product_id
LEFT JOIN sellers s ON s.seller_id = oi.seller_id;

Views create a semantic layer that abstracts complex joins and provides a clean interface for business analysts and downstream reporting.

Performance Indexing

Strategic indexes added to optimize analytical workloads:

Index Strategy

orders(customer_id)              -- customer lookup
orders(order_purchase_timestamp) -- time-series queries
order_items(order_id)            -- order-item joins
order_items(product_id)          -- product analysis
order_payments(order_id)         -- payment aggregation
order_reviews(order_id)          -- review correlation

Indexing rationale: Optimized for time-series queries, heavy joins, and analytics workload patterns typical in e-commerce analysis.

Core Business Metrics

Reusable analytical SQL modules were developed covering:

Revenue & GMV calculations
Orders and customer behavior
Monthly trends and AOV
Category and product performance
Seller performance
Payment behavior
Delivery logistics KPIs
Customer geography insights

Monthly AOV (Average Order Value)

SELECT
    DATE_FORMAT(o.order_purchase_timestamp, '%Y-%m') AS ym,
    AVG(t.order_gmv) AS avg_order_value
FROM orders o
JOIN (
    SELECT order_id, SUM(price + freight_value) AS order_gmv
    FROM order_items
    GROUP BY order_id
) t ON t.order_id = o.order_id
GROUP BY ym
ORDER BY ym;

Late Delivery Rate by State

SELECT 
    c.customer_state,
    COUNT(*) AS delivered_orders,
    SUM(o.order_delivered_customer_date > 
        o.order_estimated_delivery_date) AS late_deliveries,
    ROUND(100 * SUM(o.order_delivered_customer_date > 
        o.order_estimated_delivery_date) / COUNT(*), 2) AS late_rate_pct
FROM orders o
JOIN customers c ON c.customer_id = o.customer_id
WHERE o.order_delivered_customer_date IS NOT NULL
  AND o.order_estimated_delivery_date IS NOT NULL
GROUP BY c.customer_state
HAVING COUNT(*) >= 200
ORDER BY late_rate_pct DESC;

Advanced

Cohort Retention Analysis

Customers are grouped by their first purchase month, then monthly activity is tracked to measure long-term engagement and loyalty.

Cohort Metrics

Metric	Description
`cohort_ym`	Year-month of customer's first purchase
`month_index`	Months since first purchase (0, 1, 2, ...)
`active_customers`	Customers who purchased in this month
`cohort_customers`	Total customers in the cohort
`retention_pct`	Percentage of cohort still active

Business Value

Revenue shows growth, but retention shows loyalty and marketplace health. Cohort analysis reveals whether customers return after their first purchase — a critical metric for sustainable e-commerce growth.

Cohort Retention Query (CTE-based)

WITH first_purchase AS (
    SELECT
        c.customer_unique_id,
        MIN(DATE(o.order_purchase_timestamp)) AS first_order_date
    FROM orders o
    JOIN customers c ON c.customer_id = o.customer_id
    GROUP BY c.customer_unique_id
),

cohorts AS (
    SELECT
        customer_unique_id,
        DATE_FORMAT(first_order_date, '%Y-%m') AS cohort_ym
    FROM first_purchase
),

cohort_size AS (
    SELECT cohort_ym, COUNT(DISTINCT customer_unique_id) AS cohort_customers
    FROM cohorts
    GROUP BY cohort_ym
),

cohort_activity AS (
    SELECT
        ch.cohort_ym,
        PERIOD_DIFF(
            REPLACE(DATE_FORMAT(o.order_purchase_timestamp,'%Y-%m'),'-',''),
            REPLACE(ch.cohort_ym,'-','')
        ) AS month_index,
        COUNT(DISTINCT c.customer_unique_id) AS active_customers
    FROM cohorts ch
    JOIN customers c ON c.customer_unique_id = ch.customer_unique_id
    JOIN orders o ON o.customer_id = c.customer_id
    GROUP BY ch.cohort_ym, month_index
)

SELECT
    ca.cohort_ym,
    ca.month_index,
    ca.active_customers,
    cs.cohort_customers,
    ROUND(100 * ca.active_customers / cs.cohort_customers, 2) AS retention_pct
FROM cohort_activity ca
JOIN cohort_size cs USING (cohort_ym)
WHERE ca.month_index >= 0
ORDER BY ca.cohort_ym, ca.month_index;

Technical Highlights

⚡

High-performance CSV ingestion

🔄

Professional ETL workflow

🔗

Real relational PK/FK modeling

📈

Index-aware analytical SQL

👥

Cohort temporal logic

🏭

Production-style pipeline

Key Learnings

🔍 Data Quality First

Data quality validation is critical before relationship modeling. Orphan records and duplicates can break foreign key constraints.

📊 Cohort Depth

Cohort analysis reveals deeper customer behavior than revenue alone. Retention metrics expose marketplace health.

🧩 Semantic Abstraction

Views create a reusable analytical abstraction layer, simplifying downstream queries and reducing error-prone joins.

⚡ Index Impact

Indexing decisions significantly affect query performance. Strategic indexes on join and filter columns are essential.

How to Reproduce

Run the SQL scripts in the following order:

01_phase1_bootstrap_load.sql
02_phase1.2_dataquality_sanitycheck.sql
03_phase1.3_create_relationships.sql
04_Phase2.1_create_views.sql
05_Phase2.2_core_business_metrics.sql
06_Phase2.3_cohort_analysis.sql

Olist Intelligence — Advanced SQL Analytics (Personal Project)