End-to-End Customer Orders

This is the simplest complete SparkPlusPlus use case. It shows how a developer can move from YAML config to a curated Delta output using SparkETLApp plus inputs and outputs.

Goal

read customers and orders as input datasets
join them into customer_orders
write customer_orders in Delta format
keep Spark session settings in sparkConfig

Why This Page Exists

Use this page when you want the shortest realistic example of how SparkPlusPlus is supposed to feel in day-to-day development:

inputs and outputs declared once in YAML
app code focused on business logic
output behavior configured without changing Scala code

YAML

inputs:
  - name: customers
    path: s3://lakehouse/raw/customers
    format: parquet

  - name: orders
    path: s3://lakehouse/raw/orders
    format: parquet

outputs:
  - name: customer_orders
    path: s3://lakehouse/silver/customer_orders
    format: delta
    mode: overwrite
    partitionBy:
      - order_date
    repartition: 200

sparkConfig:
  spark.sql.shuffle.partitions: "200"
  spark.sql.session.timeZone: UTC
  spark.databricks.delta.schema.autoMerge.enabled: "true"

App

import io.github.sparkplusplus._
import io.github.sparkplusplus.app.{AppContext, SparkETLApp, SparkApp}
import io.github.sparkplusplus.io.{InputDatasetConfig, OutputDatasetConfig}
import org.apache.spark.sql.{DataFrame, SparkSession}
import org.apache.spark.sql.functions.{col, lower, to_date, trim}

final case class CustomerOrdersConfig(
  inputs: Seq[InputDatasetConfig],
  outputs: Seq[OutputDatasetConfig],
  sparkConfig: Map[String, String] = Map.empty
) extends SparkApp.WithInputDatasets with SparkApp.WithOutputDatasets with SparkApp.HasSparkConfig

object CustomerOrdersApp extends SparkETLApp[CustomerOrdersConfig] {

  override protected def appName: String = "customer-orders-app"

  override protected def configClass: Class[CustomerOrdersConfig] =
    classOf[CustomerOrdersConfig]

  override protected def configureSpark(
    builder: SparkSession.Builder,
    config: CustomerOrdersConfig
  ): SparkSession.Builder = {
    builder.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension")
  }

  override protected def transform(
    ctx: AppContext[CustomerOrdersConfig],
    inputs: Map[String, DataFrame]
  ): Map[String, DataFrame] = {
    val customers = inputs("customers")
      .select(
        col("customer_id"),
        trim(col("customer_name")).alias("customer_name"),
        trim(col("customer_segment")).alias("customer_segment")
      )

    val orders = inputs("orders")
      .select(
        col("order_id"),
        col("customer_id"),
        lower(trim(col("order_status"))).alias("order_status"),
        col("order_total"),
        to_date(col("created_at")).alias("order_date")
      )
      .dedup("order_id")

    val customerOrders = orders
      .join(customers, Seq("customer_id"), "left")
      .select(
        col("order_id"),
        col("customer_id"),
        col("customer_name"),
        col("customer_segment"),
        col("order_status"),
        col("order_total"),
        col("order_date")
      )

    Map("customer_orders" -> customerOrders)
  }
}

Run It

spark-submit \
  --class example.CustomerOrdersApp \
  your-app.jar \
  --config conf/customer-orders.yaml

End-to-End Flow

SparkPlusPlus loads the YAML config.
inputs and outputs are validated before Spark starts.
sparkConfig entries are applied to the Spark session.
SparkETLApp.extract() reads all configured inputs.
transform(...) joins and curates the records.
SparkETLApp.load() writes the returned customer_orders DataFrame.

What the Developer Gets

input and output paths live in YAML, not in code
output behavior such as mode, partitionBy, and repartition is also in YAML
the app code only focuses on Spark transforms
SparkETLApp removes most IO boilerplate
the same app can move between environments by changing only YAML

Result

The output dataset is written to:

s3://lakehouse/silver/customer_orders

Typical columns in the final Delta table:

order_id
customer_id
customer_name
customer_segment
order_status
order_total
order_date

This is the recommended starting point before moving to the more advanced customer-orders variant.

Goal​

Why This Page Exists​

YAML​

App​

Run It​

End-to-End Flow​

What the Developer Gets​

Result​