Building a Production-Grade E-commerce Analytics Platform: A Journey into Modern Data Engineering (Part 2)

The complete user experience, monitoring setup, and lessons learned from building a real data platform

This is Part 2 of our Silism Commerce 360 series. If you haven’t read Part 1 about the architecture and getting started, start there →

The Streamlit Dashboard: Where Data Becomes Insights

The Streamlit interface is divided into several modules. Let’s walk through each one.

1. Executive Dashboard 📊

This is what you’d show to your CEO. High-level metrics that answer: “How’s the business doing?”

You’ll see:

Total Revenue: All-time revenue across all orders
Total Customers: Unique customers who’ve made purchases
Total Orders: Number of orders placed
Average Order Value: Revenue divided by orders

Below that, a revenue trend chart showing the last 12 months. You can hover over points to see exact values.

What’s happening behind the scenes:

-- This is what the dashboard queries
SELECT 
    COUNT(DISTINCT customer_fk) as total_customers,
    SUM(total_orders) as total_orders,
    ROUND(SUM(total_revenue), 2) as total_revenue,
    ROUND(AVG(avg_order_value), 2) as avg_order_value
FROM main.fct_monthly_revenue

The dashboard hits the MCP Server, which queries DuckDB, and returns the results in milliseconds.

2. AI Copilot 🤖

This is where it gets interesting. Instead of writing SQL, you just ask questions in plain English.

Try asking:

“Show me the top 10 customers by revenue”
“What’s our revenue trend over the last 6 months?”
“Which products are performing best?”
“How many customers are at risk of churning?”

The cool part? The AI understands your business context. It knows about customer segments, revenue metrics, and product performance because the MCP Server provides that context.

3. Advanced Analytics 📈

This module has four sub-tabs with deeper analysis.

Cohort Analysis

This shows customer retention by cohort (month they first purchased). Each row is a cohort, each column is months since first purchase.

Green cells: High retention (good!)
Yellow cells: Medium retention (watch these)
Red cells: Low retention (problem!)

You can see patterns like:

“Customers from January 2024 have 65% retention after 3 months”
“Retention drops off sharply after month 6”
“Recent cohorts are performing better than older ones”

This tells you if your product improvements and marketing are working.

Customer Segmentation

Customers are automatically segmented using RFM analysis (Recency, Frequency, Monetary):

Champions: Best customers (recent, frequent, high-value)
Loyal: Regular customers, good value
At Risk: Haven’t purchased recently, but were valuable
Lost: Haven’t purchased in a long time

The chart shows how many customers are in each segment, and the table shows metrics like average lifetime value and days since last order.

Business insight: If you see a lot of “At Risk” customers, you need re-engagement campaigns. If “Champions” are growing, your retention is working.

Product Performance

This shows:

Top products by revenue
Performance tiers (Star, Cash Cow, Question Mark, Dog – BCG matrix style)
Category performance
Products that haven’t sold recently

Use this to:

Identify best-sellers to promote
Find underperforming products to discount or discontinue
Spot trends in categories

Journey Analytics

This tracks customers through their journey:

Awareness: Page views, site visits
Consideration: Product views, category browsing
Purchase: Add to cart, checkout
Retention: Repeat purchases

The funnel shows where customers drop off. If you see a big drop between “Add to Cart” and “Checkout,” you have a checkout problem.

4. Data Explorer 🔍

For when you want to write your own SQL (or just explore the data).

Features:

Table browser: See all available tables
Sample data: Preview any table
Schema viewer: See column names and types
SQL editor: Write custom queries
Query examples: Pre-written queries to learn from
CSV export: Download results

This is your power-user interface. The AI Copilot is great for quick questions, but sometimes you need to write complex queries yourself.

5. Customer Intelligence 🎯

Deep dive into customer behavior and segmentation.

Shows:

Customer distribution by lifecycle stage (Active, At Risk, Dormant, Lost)
Average revenue by segment
Detailed segment metrics
Customer value trends

Use this to:

Identify which customers need attention
Calculate customer lifetime value
Plan targeted marketing campaigns

6. Product Performance 📦

Dedicated product analytics dashboard.

Shows:

Top products by revenue
Performance tier distribution
Category performance comparison
Detailed product metrics table

Use this to:

Optimize inventory
Plan promotions
Identify product gaps

Monitoring: Keeping Everything Running Smoothly

Here’s the truth about data platforms: they break. Databases go down, pipelines fail, queries time out. The difference between a hobby project and a production system is monitoring.

Why Monitoring Matters

Imagine this scenario:

Your data pipeline fails at 2 AM
No one notices until 9 AM when business users complain
You spend 2 hours figuring out what broke
You fix it, but the morning reports are delayed
Everyone’s mad

With proper monitoring:

The pipeline fails at 2 AM
You get an alert immediately (or it auto-recovers)
You check the dashboard, see exactly what failed
You fix it in 10 minutes
Morning reports run on time
You’re a hero

The Monitoring Stack

We use four main tools:

Prometheus: Collects metrics (CPU, memory, request rates, query times)
Grafana: Visualizes metrics in dashboards
Loki: Aggregates logs from all services
Jaeger: Traces requests across services

Plus Uptime Kuma for simple service health checks.

Accessing Grafana

Open http://localhost:3001 and log in with admin/admin123.

The Dashboards

Grafana comes pre-configured with several dashboards. Let’s walk through each one.

1. Data Platform Overview

This is your “mission control” dashboard. At a glance, you can see Service Health, Resource Usage …etc

2. Airflow Monitoring

This dashboard focuses on your data pipelines.

What to look for:

DAG duration increasing: Your data is growing, or queries are slowing down
High failure rate: Data quality issues or infrastructure problems
Tasks stuck in queue: Not enough workers, or scheduler is slow
Scheduler heartbeat missing: Scheduler crashed (very bad!)

Example alert: “DAG ecom_data_pipeline failed 3 times in the last hour”

3. MCP Server Monitoring

This tracks your AI/analytics service.

What to look for:

High latency: AI model is slow, or database is overloaded
High error rate: AI failing to generate SQL, or database errors
Slow queries: Need to optimize SQL or add indexes
Connection pool exhausted: Need more database connections

Example alert: “MCP Server response time p95 > 5 seconds for 5 minutes”

4. Streamlit Application

This tracks your user-facing dashboard.

What to look for:

High page load times: Queries are slow, or too much data rendering
Low cache hit rate: Cache not working, or too many unique queries
Increasing memory: Memory leak in Streamlit app
Application errors: Bugs in the code

Example alert: “Streamlit page load time p95 > 10 seconds”

Using Prometheus Directly

Sometimes you want to query metrics directly. Open http://localhost:9090.

Try these queries:

# CPU usage by service
100 - (avg by(container_name) (irate(container_cpu_usage_seconds_total[5m])) * 100)

# Memory usage by service
container_memory_usage_bytes / container_memory_max_bytes * 100

# Request rate to MCP Server
rate(http_requests_total{service="mcp-server"}[5m])

# 95th percentile response time
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))

# Error rate
rate(http_requests_total{code=~"5.."}[5m]) / rate(http_requests_total[5m]) * 100

Prometheus has a powerful query language (PromQL) for slicing and dicing metrics.

Checking Logs with Loki

When something breaks, you need logs. Open Grafana, go to “Explore,” and select “Loki” as the data source.

Try these queries:

# All logs from MCP Server
{container_name="silism-mcp-server"}

# Errors from any service
{container_name=~".+"} |= "ERROR"

# Airflow task failures
{container_name="silism-airflow-scheduler"} |= "Task failed"

# Slow queries (>1 second)
{container_name="silism-mcp-server"} |= "execution_time" | json | execution_time > 1000

Logs are structured as JSON, so you can filter by specific fields.

Distributed Tracing with Jaeger

Want to see exactly how a request flows through your system? Open http://localhost:16686.

Select a service (like mcp-server) and click “Find Traces.” You’ll see a list of requests. Click on one to see the full trace:

User Request → Streamlit → MCP Server → DuckDB
    100ms         50ms         30ms        20ms

This shows you:

Total request time (100ms)
Time spent in each service
Where the bottlenecks are

If a request is slow, tracing tells you exactly which part is slow.

Service Health with Uptime Kuma

For a simple “is it up or down” view, open http://localhost:3002.

This shows:

✅ Service is up
❌ Service is down
⚠️ Service is slow

It also tracks uptime percentage over time.

Setting Up Alerts

Monitoring is only useful if you get notified when things break. Grafana supports alerts to:

Email
Slack
PagerDuty
Webhooks
And many more

To set up an alert:

Open a dashboard in Grafana
Click on a panel (like “CPU Usage”)
Click “Edit”
Go to “Alert” tab
Click “Create Alert”
Set conditions (e.g., “CPU > 80% for 5 minutes”)
Set notification channel (e.g., Slack)
Save

Now you’ll get notified when CPU goes above 80% for more than 5 minutes.

Recommended alerts:

Any service down for >1 minute (Critical)
DAG failure (Warning)
CPU >80% for >5 minutes (Warning)
Disk space <20% (Warning)
Error rate >5% for >5 minutes (Warning)
Response time p95 >5 seconds for >5 minutes (Warning)

What I Learned Building This

Building this platform taught me a lot about what it takes to run data infrastructure in production:

1. Monitoring isn’t optional You can’t fix what you can’t see. The monitoring stack took almost as long to build as the data pipeline itself, but it’s worth it.

2. AI is powerful, but needs structure Just throwing questions at an LLM doesn’t work. The MCP Server provides structure – tools, resources, context. That’s what makes the AI Copilot actually useful.

3. DuckDB is underrated Everyone talks about Snowflake and BigQuery, but DuckDB is incredibly fast and runs anywhere. Perfect for local development and small-to-medium datasets.

4. dbt is the right abstraction Writing raw SQL scripts is fine for one-off queries, but dbt’s modularity, testing, and documentation make it the right choice for production transformations.

5. Docker makes everything easier 15+ services running with one command. No “works on my machine” problems. Docker Compose is magic.

6. Real-time + batch is the future Kafka for real-time events, scheduled pipelines for batch processing. You need both.

7. Good UX matters for data tools Data engineers often ignore UX, but a beautiful dashboard gets used. An ugly one gets ignored.

Final Thoughts

Building data platforms is hard. There are a lot of moving pieces, and things break in unexpected ways. But it’s also incredibly rewarding when everything clicks together and you can answer business questions in seconds instead of days.

This platform isn’t perfect – no platform is. But it’s a solid foundation that mirrors what you’d find at real companies. It’s production-grade without being overcomplicated.

If you’re learning data engineering, I hope this gives you a realistic view of what modern data platforms look like. If you’re an experienced engineer, I hope you find some useful patterns or tools you haven’t tried yet.

Either way, happy building! 🚀

Try It Yourself

The entire platform is open source and available on GitHub:

👉 github.com/Silism/silism-commerce-360-platform

Clone it, run it, break it, fix it, extend it. That’s how you learn.

If you build something cool with it, let me know! I’m always curious to see what people create.

← Back to Part 1: Architecture and Getting Started

Built with ❤️ by Silism — Empowering Data Teams with AI-Native Platforms

Building a Production-Grade E-commerce Analytics Platform: A Journey into Modern Data Engineering (Part 2)

The Streamlit Dashboard: Where Data Becomes Insights

1. Executive Dashboard 📊

2. AI Copilot 🤖

3. Advanced Analytics 📈

Cohort Analysis

Customer Segmentation

Product Performance

Journey Analytics

4. Data Explorer 🔍

5. Customer Intelligence 🎯

6. Product Performance 📦

Monitoring: Keeping Everything Running Smoothly

Why Monitoring Matters

The Monitoring Stack

Accessing Grafana

The Dashboards

1. Data Platform Overview

2. Airflow Monitoring

3. MCP Server Monitoring

4. Streamlit Application

Using Prometheus Directly

Checking Logs with Loki

Distributed Tracing with Jaeger

Service Health with Uptime Kuma

Setting Up Alerts

What I Learned Building This

Final Thoughts

Try It Yourself

One comment

Leave a ReplyCancel Reply

The Streamlit Dashboard: Where Data Becomes Insights

1. Executive Dashboard 📊

2. AI Copilot 🤖

3. Advanced Analytics 📈

Cohort Analysis

Customer Segmentation

Product Performance

Journey Analytics

4. Data Explorer 🔍

5. Customer Intelligence 🎯

6. Product Performance 📦

Monitoring: Keeping Everything Running Smoothly

Why Monitoring Matters

The Monitoring Stack

Accessing Grafana

The Dashboards

1. Data Platform Overview

2. Airflow Monitoring

3. MCP Server Monitoring

4. Streamlit Application

Using Prometheus Directly

Checking Logs with Loki

Distributed Tracing with Jaeger

Service Health with Uptime Kuma

Setting Up Alerts

What I Learned Building This

Final Thoughts

Try It Yourself

Related Posts

Building a Production-Grade E-commerce Analytics Platform: A Journey into Modern Data Engineering (Part 1)

One comment

Leave a ReplyCancel Reply