End-to-end SQL analytics pipeline built in MySQL 8, covering data engineering, relational modeling, business KPI analysis, and cohort retention analytics.
This project simulates a real-world analytics workflow using the Brazilian Olist E-Commerce dataset. The objective was to build a production-style SQL pipeline that transforms raw CSV files into a structured relational database and delivers meaningful business insights through advanced SQL analysis.
The pipeline follows a structured flow from raw data ingestion to business insight delivery:
Important: Tables were initially created using MySQL Workbench Import Wizard to generate schema structures. After structure creation, tables were emptied and data was reloaded using LOAD DATA LOCAL INFILE to simulate production bulk ingestion.
The data engineering phase established the foundation for all subsequent analysis:
olistLOCAL INFILE for high-performance ingestionSET FOREIGN_KEY_CHECKS=0;
SET UNIQUE_CHECKS=0;
SET AUTOCOMMIT=0;
LOAD DATA LOCAL INFILE 'olist_orders_dataset.csv'
INTO TABLE orders
FIELDS TERMINATED BY ',' ENCLOSED BY '"'
LINES TERMINATED BY '\n'
IGNORE 1 LINES;
COMMIT;
SET AUTOCOMMIT=1;
SET UNIQUE_CHECKS=1;
SET FOREIGN_KEY_CHECKS=1;
Extensive validation was performed before modeling relationships:
-- Orphan order_items (no matching order)
SELECT COUNT(*) AS orphan_order_items
FROM order_items oi
LEFT JOIN orders o ON o.order_id = oi.order_id
WHERE o.order_id IS NULL;
-- Orphan orders (no matching customer)
SELECT COUNT(*) AS orphan_orders_customers
FROM orders o
LEFT JOIN customers c ON c.customer_id = o.customer_id
WHERE c.customer_id IS NULL;
Safe datetime conversion used the add → populate → validate → swap pattern to ensure data integrity during type transformations.
-- Step 1: Add new DATETIME columns
ALTER TABLE orders
ADD COLUMN order_purchase_dt DATETIME NULL;
-- Step 2: Populate with parsed values
UPDATE orders
SET order_purchase_dt = STR_TO_DATE(
order_purchase_timestamp, '%Y-%m-%d %H:%i:%s'
);
-- Step 3: Validate conversion (should be 0)
SELECT SUM(
order_purchase_timestamp IS NOT NULL
AND order_purchase_dt IS NULL
) AS bad_parse;
-- Step 4: Swap columns
ALTER TABLE orders
DROP COLUMN order_purchase_timestamp,
RENAME COLUMN order_purchase_dt TO order_purchase_timestamp;
Converted TEXT keys into indexed VARCHAR fields and established relational integrity:
customers(customer_id)
orders(order_id)
products(product_id)
sellers(seller_id)
order_items(order_id, order_item_id)
order_payments(order_id, payment_sequential)
order_reviews(order_id, review_id)
orders → customers
order_items → orders / products / sellers
order_payments → orders
order_reviews → orders
Design Decision: Translation table foreign key intentionally skipped due to real-world data inconsistency risks. This reflects pragmatic production database design.
Created semantic views to simplify analysis and reduce repeated joins:
CREATE OR REPLACE VIEW vw_orders_enriched AS
SELECT
o.order_id,
o.customer_id,
o.order_status,
o.order_purchase_timestamp,
o.order_delivered_customer_date,
o.order_estimated_delivery_date,
c.customer_unique_id,
c.customer_city,
c.customer_state
FROM orders o
LEFT JOIN customers c
ON c.customer_id = o.customer_id;
CREATE OR REPLACE VIEW vw_order_items_enriched AS
SELECT
oi.order_id,
oi.order_item_id,
oi.product_id,
oi.seller_id,
oi.price,
oi.freight_value,
p.product_category_name,
s.seller_city,
s.seller_state
FROM order_items oi
LEFT JOIN products p ON p.product_id = oi.product_id
LEFT JOIN sellers s ON s.seller_id = oi.seller_id;
Views create a semantic layer that abstracts complex joins and provides a clean interface for business analysts and downstream reporting.
Strategic indexes added to optimize analytical workloads:
orders(customer_id) -- customer lookup
orders(order_purchase_timestamp) -- time-series queries
order_items(order_id) -- order-item joins
order_items(product_id) -- product analysis
order_payments(order_id) -- payment aggregation
order_reviews(order_id) -- review correlation
Indexing rationale: Optimized for time-series queries, heavy joins, and analytics workload patterns typical in e-commerce analysis.
Reusable analytical SQL modules were developed covering:
SELECT
DATE_FORMAT(o.order_purchase_timestamp, '%Y-%m') AS ym,
AVG(t.order_gmv) AS avg_order_value
FROM orders o
JOIN (
SELECT order_id, SUM(price + freight_value) AS order_gmv
FROM order_items
GROUP BY order_id
) t ON t.order_id = o.order_id
GROUP BY ym
ORDER BY ym;
SELECT
c.customer_state,
COUNT(*) AS delivered_orders,
SUM(o.order_delivered_customer_date >
o.order_estimated_delivery_date) AS late_deliveries,
ROUND(100 * SUM(o.order_delivered_customer_date >
o.order_estimated_delivery_date) / COUNT(*), 2) AS late_rate_pct
FROM orders o
JOIN customers c ON c.customer_id = o.customer_id
WHERE o.order_delivered_customer_date IS NOT NULL
AND o.order_estimated_delivery_date IS NOT NULL
GROUP BY c.customer_state
HAVING COUNT(*) >= 200
ORDER BY late_rate_pct DESC;
Customers are grouped by their first purchase month, then monthly activity is tracked to measure long-term engagement and loyalty.
| Metric | Description |
|---|---|
cohort_ym |
Year-month of customer's first purchase |
month_index |
Months since first purchase (0, 1, 2, ...) |
active_customers |
Customers who purchased in this month |
cohort_customers |
Total customers in the cohort |
retention_pct |
Percentage of cohort still active |
Revenue shows growth, but retention shows loyalty and marketplace health. Cohort analysis reveals whether customers return after their first purchase — a critical metric for sustainable e-commerce growth.
WITH first_purchase AS (
SELECT
c.customer_unique_id,
MIN(DATE(o.order_purchase_timestamp)) AS first_order_date
FROM orders o
JOIN customers c ON c.customer_id = o.customer_id
GROUP BY c.customer_unique_id
),
cohorts AS (
SELECT
customer_unique_id,
DATE_FORMAT(first_order_date, '%Y-%m') AS cohort_ym
FROM first_purchase
),
cohort_size AS (
SELECT cohort_ym, COUNT(DISTINCT customer_unique_id) AS cohort_customers
FROM cohorts
GROUP BY cohort_ym
),
cohort_activity AS (
SELECT
ch.cohort_ym,
PERIOD_DIFF(
REPLACE(DATE_FORMAT(o.order_purchase_timestamp,'%Y-%m'),'-',''),
REPLACE(ch.cohort_ym,'-','')
) AS month_index,
COUNT(DISTINCT c.customer_unique_id) AS active_customers
FROM cohorts ch
JOIN customers c ON c.customer_unique_id = ch.customer_unique_id
JOIN orders o ON o.customer_id = c.customer_id
GROUP BY ch.cohort_ym, month_index
)
SELECT
ca.cohort_ym,
ca.month_index,
ca.active_customers,
cs.cohort_customers,
ROUND(100 * ca.active_customers / cs.cohort_customers, 2) AS retention_pct
FROM cohort_activity ca
JOIN cohort_size cs USING (cohort_ym)
WHERE ca.month_index >= 0
ORDER BY ca.cohort_ym, ca.month_index;
Data quality validation is critical before relationship modeling. Orphan records and duplicates can break foreign key constraints.
Cohort analysis reveals deeper customer behavior than revenue alone. Retention metrics expose marketplace health.
Views create a reusable analytical abstraction layer, simplifying downstream queries and reducing error-prone joins.
Indexing decisions significantly affect query performance. Strategic indexes on join and filter columns are essential.
Run the SQL scripts in the following order: