Databricks: Some Notes

Big Data
Machine Learning
Databricks
Published

March 11, 2025

Introduction to Databricks

Databricks is a unified data analytics platform built on Apache Spark, designed to simplify big data and machine learning workflows. It provides a collaborative, cloud-based environment for data engineering, data science, and analytics.

Key Concepts

  • Cloud-Based Platform: Runs on AWS, Azure, and Google Cloud, offering fully managed Spark clusters with autoscaling and optimized performance.
  • Apache Spark as the Core Engine: Utilizes Apache Spark for large-scale data processing, ETL, and machine learning.
  • Collaborative Notebooks: Supports Python, Scala, SQL, and R in interactive notebooks for collaboration.
  • Delta Lake for Data Reliability: Implements ACID transactions, schema enforcement, and versioning.
  • MLflow for Machine Learning Lifecycle: Manages ML experiments, models, and deployments.
  • Data Engineering & Streaming: Supports batch and real-time streaming workloads.
  • Scalability & Performance Optimization: Features Photon execution engine, adaptive query execution, and caching.
  • Security & Compliance: Includes role-based access control, data encryption, and cloud-native security integration.

Common Use Cases

  • Big Data Processing: ETL pipelines, data lakes, and high-performance analytics.
  • Machine Learning & AI: Model training, feature engineering, and MLOps.
  • BI & Analytics: Interactive querying and reporting.
  • Real-Time Data Processing: Streaming analytics and event-driven architectures.

Setting Up Databricks

Databricks Community Edition offers a free environment for small-scale workloads.

Signing Up

  1. Visit Databricks Community Edition.
  2. Sign up with an email and verify the account.
  3. Access the Databricks workspace.

Creating a Cluster

  1. Click ComputeCreate Cluster.
  2. Name the cluster (e.g., my-cluster).
  3. Choose a runtime (e.g., Databricks Runtime 12.0).
  4. Click Create Cluster and wait for it to start.

Creating a Notebook

  1. Go to WorkspaceCreateNotebook.
  2. Choose a language (Python, Scala, SQL, or R).
  3. Name the notebook (e.g., my-first-notebook).

Loading and Processing Data

Loading a Sample Dataset

# Load sample dataset (airline delays)
df = spark.read.csv("/databricks-datasets/airlines/part-00000", header=True, inferSchema=True)

# Show first 5 rows
df.show(5)

Basic Data Operations

# Check schema
df.printSchema()

# Count total rows
df.count()

# Filter for flights from New York (JFK)
df.filter(df.Origin == "JFK").show(5)

Machine Learning with MLlib

Preparing the Data

from pyspark.ml.feature import VectorAssembler
from pyspark.ml.classification import LogisticRegression

# Select features
feature_cols = ["DepDelay", "Distance"]
assembler = VectorAssembler(inputCols=feature_cols, outputCol="features")
df_ml = assembler.transform(df).select("features", "ArrDelay")

# Drop missing values
df_ml = df_ml.na.drop()

Training a Logistic Regression Model

# Split data into train and test sets
train, test = df_ml.randomSplit([0.8, 0.2], seed=42)

# Train the model
lr = LogisticRegression(featuresCol="features", labelCol="ArrDelay")
model = lr.fit(train)

# Evaluate on test set
predictions = model.transform(test)
predictions.select("features", "ArrDelay", "prediction").show(5)

Use Case: Real-Time Fraud Detection

Scenario

A bank needs to detect fraudulent credit card transactions in real time.

Solution with Databricks

  1. Ingest real-time transactions using Spark Streaming.
  2. Feature engineering on transaction amount, location, and user history.
  3. Train a machine learning model using historical fraud data.
  4. Deploy the model to classify new transactions as fraud or legitimate in real time.

Databricks provides the scalability and performance needed for real-time fraud detection, making it a practical choice for financial institutions and similar applications.