An HotTechStack Project
Skip to main content

The Data Detective Challenge: A Real-World Data Quest

Real-World Scenario vs Demo

While this story mirrors real production challenges, our demo will use smaller datasets (10-100MB) to focus on learning the technology stack. The patterns and code remain the same - just scale up the resources for production!

The Urgent Call ๐Ÿšจโ€‹

It's 9:43 AM when your phone buzzes. Sarah from Marketing is calling:

"Alex! Our 'Summer Splash' campaign just went viral! We're seeing 3x normal user activity, and the CEO wants to know user engagement patterns by 5 PM. Our old Excel sheets can't handle this. Can you help?"

You smile, knowing you've been preparing for this moment with your new data stack.

Your Mission ๐ŸŽฏโ€‹

You have 7 hours to:

  1. Process the incoming campaign data
  2. Analyze user engagement patterns
  3. Create an executive dashboard
  4. Present actionable insights

Let's dive in!

Step 1: Setting Up Your Workspace ๐Ÿ”งโ€‹

First, access your JupyterHub instance:

# Access JupyterHub (replace with your actual domain)
https://jupyter.yourdomain.com

# Login with SSO credentials
username: alex.data
password: ********

Create a new notebook with PySpark:

# Initialize Environment
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
import pandas as pd
import matplotlib.pyplot as plt

# Create Spark session with Iceberg support
spark = SparkSession.builder \
.appName("CampaignAnalysis") \
.config("spark.jars.packages", "org.apache.iceberg:iceberg-spark-runtime-3.4_2.12:1.4.2") \
.config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions") \
.config("spark.sql.catalog.demo", "org.apache.iceberg.spark.SparkCatalog") \
.config("spark.sql.catalog.demo.type", "hadoop") \
.config("spark.sql.catalog.demo.warehouse", "/opt/iceberg/warehouse") \
.getOrCreate()

Step 2: Creating the Data Structure ๐Ÿ“Šโ€‹

Set up Iceberg tables to capture the streaming campaign data:

# Create database
spark.sql("CREATE DATABASE IF NOT EXISTS demo.campaign")

# Create user_events table
spark.sql("""
CREATE TABLE IF NOT EXISTS demo.campaign.user_events (
event_id STRING,
user_id STRING,
event_timestamp TIMESTAMP,
event_type STRING,
page_url STRING,
session_id STRING,
campaign_id STRING,
user_agent STRING,
geo_location STRING
) USING iceberg
PARTITIONED BY (hours(event_timestamp))
""")

# Create user_profiles table
spark.sql("""
CREATE TABLE IF NOT EXISTS demo.campaign.user_profiles (
user_id STRING,
signup_date DATE,
age_group STRING,
gender STRING,
country STRING,
interests ARRAY<STRING>
) USING iceberg
""")

Step 3: Processing Incoming Data ๐Ÿ”„โ€‹

Set up a streaming job to process incoming campaign data:

# Read streaming data
stream_df = spark.readStream \
.format("rate") \ # In production, replace with Kafka/Kinesis
.option("rowsPerSecond", 100) \ # Simulating 100 events/second
.load()

# Transform stream data
from pyspark.sql.functions import expr, from_json, col
from pyspark.sql.types import *

# Define schema for incoming JSON
event_schema = StructType([
StructField("event_id", StringType()),
StructField("user_id", StringType()),
StructField("event_type", StringType()),
# ... other fields
])

# Process and write to Iceberg
def process_batch(batch_df, batch_id):
# Enrich data
enriched_df = batch_df.withColumn("event_timestamp", current_timestamp()) \
.withColumn("campaign_id", lit("SUMMER_SPLASH_2024"))

# Write to Iceberg table
enriched_df.write \
.format("iceberg") \
.mode("append") \
.save("demo.campaign.user_events")

# Start the streaming job
stream_df.writeStream \
.foreachBatch(process_batch) \
.outputMode("append") \
.start()

Step 4: Real-time Analysis ๐Ÿ“ˆโ€‹

Create analysis queries to understand user engagement:

# Analyze engagement patterns
engagement_df = spark.sql("""
SELECT
date_trunc('hour', event_timestamp) as hour,
event_type,
count(*) as event_count,
count(distinct user_id) as unique_users
FROM demo.campaign.user_events
WHERE campaign_id = 'SUMMER_SPLASH_2024'
GROUP BY 1, 2
ORDER BY 1, 2
""")

# Save results for visualization
engagement_df.write \
.format("iceberg") \
.mode("overwrite") \
.saveAsTable("demo.campaign.hourly_metrics")

Step 5: Creating Dashboards in Superset ๐Ÿ“Šโ€‹

Now, let's visualize our findings in Superset:

  1. Access Superset (http://superset.yourdomain.com)
  2. Add Iceberg as a database connection:
-- Superset SQL Connection
demo_iceberg = {
'engine': 'iceberg',
'host': 'localhost',
'schema': 'demo.campaign',
'port': 10000
}
  1. Create these essential charts:
    • Hourly User Engagement
    • User Demographics
    • Campaign Conversion Funnel
    • Geographic Distribution
-- Example Superset Query for Engagement Chart
SELECT
hour,
event_type,
event_count,
unique_users,
ROUND(event_count * 100.0 / LAG(event_count)
OVER (PARTITION BY event_type ORDER BY hour), 2) as growth_rate
FROM demo.campaign.hourly_metrics
ORDER BY hour DESC

Step 6: The Big Reveal ๐ŸŽ‰โ€‹

It's 4:30 PM. Your dashboard is ready, showing:

  • Peak engagement times (2-3 PM)
  • Most active user segments
  • Campaign conversion rates
  • Geographic hotspots

Results & Impact ๐ŸŒŸโ€‹

Your analysis revealed:

  • 327% increase in user engagement
  • 42% higher conversion rate
  • Key demographic: Urban professionals, 25-34
  • Most effective campaign touchpoint: Mobile app

Technical Architectureโ€‹

Development Notes ๐Ÿ“โ€‹

For this demo:

  • Data volume: ~100MB/hour
  • Spark cluster: 2-3 worker nodes
  • Iceberg tables: Local storage
  • Retention period: 7 days

Scale for production:

  • Increase Spark workers
  • Move to cloud storage
  • Add monitoring
  • Implement backup strategy

Need help? Check the troubleshooting guide in the next section...