The Data Detective Challenge: A Real-World Data Quest

Real-World Scenario vs Demo

While this story mirrors real production challenges, our demo will use smaller datasets (10-100MB) to focus on learning the technology stack. The patterns and code remain the same - just scale up the resources for production!

The Urgent Call 🚨

It's 9:43 AM when your phone buzzes. Sarah from Marketing is calling:

"Alex! Our 'Summer Splash' campaign just went viral! We're seeing 3x normal user activity, and the CEO wants to know user engagement patterns by 5 PM. Our old Excel sheets can't handle this. Can you help?"

You smile, knowing you've been preparing for this moment with your new data stack.

Your Mission 🎯

You have 7 hours to:

Process the incoming campaign data
Analyze user engagement patterns
Create an executive dashboard
Present actionable insights

Let's dive in!

Step 1: Setting Up Your Workspace 🔧

First, access your JupyterHub instance:

# Access JupyterHub (replace with your actual domain)
https://jupyter.yourdomain.com

# Login with SSO credentials
username: alex.data
password: ********

Create a new notebook with PySpark:

# Initialize Environment
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
import pandas as pd
import matplotlib.pyplot as plt

# Create Spark session with Iceberg support
spark = SparkSession.builder \
    .appName("CampaignAnalysis") \
    .config("spark.jars.packages", "org.apache.iceberg:iceberg-spark-runtime-3.4_2.12:1.4.2") \
    .config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions") \
    .config("spark.sql.catalog.demo", "org.apache.iceberg.spark.SparkCatalog") \
    .config("spark.sql.catalog.demo.type", "hadoop") \
    .config("spark.sql.catalog.demo.warehouse", "/opt/iceberg/warehouse") \
    .getOrCreate()

Step 2: Creating the Data Structure 📊

Set up Iceberg tables to capture the streaming campaign data:

# Create database
spark.sql("CREATE DATABASE IF NOT EXISTS demo.campaign")

# Create user_events table
spark.sql("""
CREATE TABLE IF NOT EXISTS demo.campaign.user_events (
    event_id STRING,
    user_id STRING,
    event_timestamp TIMESTAMP,
    event_type STRING,
    page_url STRING,
    session_id STRING,
    campaign_id STRING,
    user_agent STRING,
    geo_location STRING
) USING iceberg 
PARTITIONED BY (hours(event_timestamp))
""")

# Create user_profiles table
spark.sql("""
CREATE TABLE IF NOT EXISTS demo.campaign.user_profiles (
    user_id STRING,
    signup_date DATE,
    age_group STRING,
    gender STRING,
    country STRING,
    interests ARRAY<STRING>
) USING iceberg
""")

Step 3: Processing Incoming Data 🔄

Set up a streaming job to process incoming campaign data:

# Read streaming data
stream_df = spark.readStream \
    .format("rate") \  # In production, replace with Kafka/Kinesis
    .option("rowsPerSecond", 100) \  # Simulating 100 events/second
    .load()

# Transform stream data
from pyspark.sql.functions import expr, from_json, col
from pyspark.sql.types import *

# Define schema for incoming JSON
event_schema = StructType([
    StructField("event_id", StringType()),
    StructField("user_id", StringType()),
    StructField("event_type", StringType()),
    # ... other fields
])

# Process and write to Iceberg
def process_batch(batch_df, batch_id):
    # Enrich data
    enriched_df = batch_df.withColumn("event_timestamp", current_timestamp()) \
        .withColumn("campaign_id", lit("SUMMER_SPLASH_2024"))
    
    # Write to Iceberg table
    enriched_df.write \
        .format("iceberg") \
        .mode("append") \
        .save("demo.campaign.user_events")

# Start the streaming job
stream_df.writeStream \
    .foreachBatch(process_batch) \
    .outputMode("append") \
    .start()

Step 4: Real-time Analysis 📈

Create analysis queries to understand user engagement:

# Analyze engagement patterns
engagement_df = spark.sql("""
    SELECT 
        date_trunc('hour', event_timestamp) as hour,
        event_type,
        count(*) as event_count,
        count(distinct user_id) as unique_users
    FROM demo.campaign.user_events
    WHERE campaign_id = 'SUMMER_SPLASH_2024'
    GROUP BY 1, 2
    ORDER BY 1, 2
""")

# Save results for visualization
engagement_df.write \
    .format("iceberg") \
    .mode("overwrite") \
    .saveAsTable("demo.campaign.hourly_metrics")

Step 5: Creating Dashboards in Superset 📊

Now, let's visualize our findings in Superset:

Access Superset (http://superset.yourdomain.com)
Add Iceberg as a database connection:

-- Superset SQL Connection
demo_iceberg = {
    'engine': 'iceberg',
    'host': 'localhost',
    'schema': 'demo.campaign',
    'port': 10000
}

Create these essential charts:
- Hourly User Engagement
- User Demographics
- Campaign Conversion Funnel
- Geographic Distribution

-- Example Superset Query for Engagement Chart
SELECT 
    hour,
    event_type,
    event_count,
    unique_users,
    ROUND(event_count * 100.0 / LAG(event_count) 
        OVER (PARTITION BY event_type ORDER BY hour), 2) as growth_rate
FROM demo.campaign.hourly_metrics
ORDER BY hour DESC

Step 6: The Big Reveal 🎉

It's 4:30 PM. Your dashboard is ready, showing:

Peak engagement times (2-3 PM)
Most active user segments
Campaign conversion rates
Geographic hotspots

Results & Impact 🌟

Your analysis revealed:

327% increase in user engagement
42% higher conversion rate
Key demographic: Urban professionals, 25-34
Most effective campaign touchpoint: Mobile app

Technical Architecture

Development Notes 📝

For this demo:

Data volume: ~100MB/hour
Spark cluster: 2-3 worker nodes
Iceberg tables: Local storage
Retention period: 7 days

Scale for production:

Increase Spark workers
Move to cloud storage
Add monitoring
Implement backup strategy

Need help? Check the troubleshooting guide in the next section...

The Urgent Call 🚨​

Your Mission 🎯​

Step 1: Setting Up Your Workspace 🔧​

Step 2: Creating the Data Structure 📊​

Step 3: Processing Incoming Data 🔄​

Step 4: Real-time Analysis 📈​

Step 5: Creating Dashboards in Superset 📊​

Step 6: The Big Reveal 🎉​

Results & Impact 🌟​

Technical Architecture​

Development Notes 📝​