The Data Detective Challenge: A Real-World Data Quest
While this story mirrors real production challenges, our demo will use smaller datasets (10-100MB) to focus on learning the technology stack. The patterns and code remain the same - just scale up the resources for production!
The Urgent Call ๐จโ
It's 9:43 AM when your phone buzzes. Sarah from Marketing is calling:
"Alex! Our 'Summer Splash' campaign just went viral! We're seeing 3x normal user activity, and the CEO wants to know user engagement patterns by 5 PM. Our old Excel sheets can't handle this. Can you help?"
You smile, knowing you've been preparing for this moment with your new data stack.
Your Mission ๐ฏโ
You have 7 hours to:
- Process the incoming campaign data
- Analyze user engagement patterns
- Create an executive dashboard
- Present actionable insights
Let's dive in!
Step 1: Setting Up Your Workspace ๐งโ
First, access your JupyterHub instance:
# Access JupyterHub (replace with your actual domain)
https://jupyter.yourdomain.com
# Login with SSO credentials
username: alex.data
password: ********
Create a new notebook with PySpark:
# Initialize Environment
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
import pandas as pd
import matplotlib.pyplot as plt
# Create Spark session with Iceberg support
spark = SparkSession.builder \
.appName("CampaignAnalysis") \
.config("spark.jars.packages", "org.apache.iceberg:iceberg-spark-runtime-3.4_2.12:1.4.2") \
.config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions") \
.config("spark.sql.catalog.demo", "org.apache.iceberg.spark.SparkCatalog") \
.config("spark.sql.catalog.demo.type", "hadoop") \
.config("spark.sql.catalog.demo.warehouse", "/opt/iceberg/warehouse") \
.getOrCreate()
Step 2: Creating the Data Structure ๐โ
Set up Iceberg tables to capture the streaming campaign data:
# Create database
spark.sql("CREATE DATABASE IF NOT EXISTS demo.campaign")
# Create user_events table
spark.sql("""
CREATE TABLE IF NOT EXISTS demo.campaign.user_events (
event_id STRING,
user_id STRING,
event_timestamp TIMESTAMP,
event_type STRING,
page_url STRING,
session_id STRING,
campaign_id STRING,
user_agent STRING,
geo_location STRING
) USING iceberg
PARTITIONED BY (hours(event_timestamp))
""")
# Create user_profiles table
spark.sql("""
CREATE TABLE IF NOT EXISTS demo.campaign.user_profiles (
user_id STRING,
signup_date DATE,
age_group STRING,
gender STRING,
country STRING,
interests ARRAY<STRING>
) USING iceberg
""")
Step 3: Processing Incoming Data ๐โ
Set up a streaming job to process incoming campaign data:
# Read streaming data
stream_df = spark.readStream \
.format("rate") \ # In production, replace with Kafka/Kinesis
.option("rowsPerSecond", 100) \ # Simulating 100 events/second
.load()
# Transform stream data
from pyspark.sql.functions import expr, from_json, col
from pyspark.sql.types import *
# Define schema for incoming JSON
event_schema = StructType([
StructField("event_id", StringType()),
StructField("user_id", StringType()),
StructField("event_type", StringType()),
# ... other fields
])
# Process and write to Iceberg
def process_batch(batch_df, batch_id):
# Enrich data
enriched_df = batch_df.withColumn("event_timestamp", current_timestamp()) \
.withColumn("campaign_id", lit("SUMMER_SPLASH_2024"))
# Write to Iceberg table
enriched_df.write \
.format("iceberg") \
.mode("append") \
.save("demo.campaign.user_events")
# Start the streaming job
stream_df.writeStream \
.foreachBatch(process_batch) \
.outputMode("append") \
.start()
Step 4: Real-time Analysis ๐โ
Create analysis queries to understand user engagement:
# Analyze engagement patterns
engagement_df = spark.sql("""
SELECT
date_trunc('hour', event_timestamp) as hour,
event_type,
count(*) as event_count,
count(distinct user_id) as unique_users
FROM demo.campaign.user_events
WHERE campaign_id = 'SUMMER_SPLASH_2024'
GROUP BY 1, 2
ORDER BY 1, 2
""")
# Save results for visualization
engagement_df.write \
.format("iceberg") \
.mode("overwrite") \
.saveAsTable("demo.campaign.hourly_metrics")
Step 5: Creating Dashboards in Superset ๐โ
Now, let's visualize our findings in Superset:
- Access Superset (http://superset.yourdomain.com)
- Add Iceberg as a database connection:
-- Superset SQL Connection
demo_iceberg = {
'engine': 'iceberg',
'host': 'localhost',
'schema': 'demo.campaign',
'port': 10000
}
- Create these essential charts:
- Hourly User Engagement
- User Demographics
- Campaign Conversion Funnel
- Geographic Distribution
-- Example Superset Query for Engagement Chart
SELECT
hour,
event_type,
event_count,
unique_users,
ROUND(event_count * 100.0 / LAG(event_count)
OVER (PARTITION BY event_type ORDER BY hour), 2) as growth_rate
FROM demo.campaign.hourly_metrics
ORDER BY hour DESC
Step 6: The Big Reveal ๐โ
It's 4:30 PM. Your dashboard is ready, showing:
- Peak engagement times (2-3 PM)
- Most active user segments
- Campaign conversion rates
- Geographic hotspots
Results & Impact ๐โ
Your analysis revealed:
- 327% increase in user engagement
- 42% higher conversion rate
- Key demographic: Urban professionals, 25-34
- Most effective campaign touchpoint: Mobile app
Technical Architectureโ
Development Notes ๐โ
For this demo:
- Data volume: ~100MB/hour
- Spark cluster: 2-3 worker nodes
- Iceberg tables: Local storage
- Retention period: 7 days
Scale for production:
- Increase Spark workers
- Move to cloud storage
- Add monitoring
- Implement backup strategy
Need help? Check the troubleshooting guide in the next section...