End-to-End Customer Orders
This is the simplest complete SparkPlusPlus use case. It shows how a developer can move from YAML config to a curated Delta output using SparkETLApp plus inputs and outputs.
Goal
- read
customersandordersas input datasets - join them into
customer_orders - write
customer_ordersin Delta format - keep Spark session settings in
sparkConfig
Why This Page Exists
Use this page when you want the shortest realistic example of how SparkPlusPlus is supposed to feel in day-to-day development:
- inputs and outputs declared once in YAML
- app code focused on business logic
- output behavior configured without changing Scala code
YAML
inputs:
- name: customers
path: s3://lakehouse/raw/customers
format: parquet
- name: orders
path: s3://lakehouse/raw/orders
format: parquet
outputs:
- name: customer_orders
path: s3://lakehouse/silver/customer_orders
format: delta
mode: overwrite
partitionBy:
- order_date
repartition: 200
sparkConfig:
spark.sql.shuffle.partitions: "200"
spark.sql.session.timeZone: UTC
spark.databricks.delta.schema.autoMerge.enabled: "true"
App
import io.github.sparkplusplus._
import io.github.sparkplusplus.app.{AppContext, SparkETLApp, SparkApp}
import io.github.sparkplusplus.io.{InputDatasetConfig, OutputDatasetConfig}
import org.apache.spark.sql.{DataFrame, SparkSession}
import org.apache.spark.sql.functions.{col, lower, to_date, trim}
final case class CustomerOrdersConfig(
inputs: Seq[InputDatasetConfig],
outputs: Seq[OutputDatasetConfig],
sparkConfig: Map[String, String] = Map.empty
) extends SparkApp.WithInputDatasets with SparkApp.WithOutputDatasets with SparkApp.HasSparkConfig
object CustomerOrdersApp extends SparkETLApp[CustomerOrdersConfig] {
override protected def appName: String = "customer-orders-app"
override protected def configClass: Class[CustomerOrdersConfig] =
classOf[CustomerOrdersConfig]
override protected def configureSpark(
builder: SparkSession.Builder,
config: CustomerOrdersConfig
): SparkSession.Builder = {
builder.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension")
}
override protected def transform(
ctx: AppContext[CustomerOrdersConfig],
inputs: Map[String, DataFrame]
): Map[String, DataFrame] = {
val customers = inputs("customers")
.select(
col("customer_id"),
trim(col("customer_name")).alias("customer_name"),
trim(col("customer_segment")).alias("customer_segment")
)
val orders = inputs("orders")
.select(
col("order_id"),
col("customer_id"),
lower(trim(col("order_status"))).alias("order_status"),
col("order_total"),
to_date(col("created_at")).alias("order_date")
)
.dedup("order_id")
val customerOrders = orders
.join(customers, Seq("customer_id"), "left")
.select(
col("order_id"),
col("customer_id"),
col("customer_name"),
col("customer_segment"),
col("order_status"),
col("order_total"),
col("order_date")
)
Map("customer_orders" -> customerOrders)
}
}
Run It
spark-submit \
--class example.CustomerOrdersApp \
your-app.jar \
--config conf/customer-orders.yaml
End-to-End Flow
- SparkPlusPlus loads the YAML config.
inputsandoutputsare validated before Spark starts.sparkConfigentries are applied to the Spark session.SparkETLApp.extract()reads all configured inputs.transform(...)joins and curates the records.SparkETLApp.load()writes the returnedcustomer_ordersDataFrame.
What the Developer Gets
- input and output paths live in YAML, not in code
- output behavior such as
mode,partitionBy, andrepartitionis also in YAML - the app code only focuses on Spark transforms
SparkETLAppremoves most IO boilerplate- the same app can move between environments by changing only YAML
Result
The output dataset is written to:
s3://lakehouse/silver/customer_orders
Typical columns in the final Delta table:
order_idcustomer_idcustomer_namecustomer_segmentorder_statusorder_totalorder_date
This is the recommended starting point before moving to the more advanced customer-orders variant.