Getting Started
SparkPlusPlus is designed for teams that want a light framework on top of Apache Spark without introducing a full orchestration layer.
What You Get
SparkETLApp[C]as the fastest application entrypoint for common ETL jobsSparkApp[C]when you need a fully custom runtime flowAppContext[C]carryingSparkSession, typed config, CLI passthrough args, and logger- YAML config decoding into Scala case classes
- simple IO through
ctx.readInput(...)andctx.writeOutput(...) DataFrameUtilsand implicit DataFrame extensions
Typical Flow
- Define a Scala case class for your app config.
- Extend
SparkETLApp[YourConfig]. - Point the app at a YAML file with
--config. - Implement business logic in
transform(ctx, inputs).
First Real Use Case
A common SparkPlusPlus job looks like this:
- input datasets:
customersandorders - output table or files:
customer_orders - target format: Delta
- goal: join and curate datasets into a cleaner analytics table
If you want a fuller example, see Use Case Examples.
The GitHub repository also includes a runnable sample project at samples/customer-orders that mirrors this use case as a standalone Maven build.
Minimal Example
import io.github.sparkplusplus.app.{AppContext, SparkApp, SparkETLApp}
import io.github.sparkplusplus.io.{InputDatasetConfig, OutputDatasetConfig}
import org.apache.spark.sql.DataFrame
final case class ExampleConfig(
inputs: Seq[InputDatasetConfig],
outputs: Seq[OutputDatasetConfig]
) extends SparkApp.WithInputDatasets with SparkApp.WithOutputDatasets
object ExampleJob extends SparkETLApp[ExampleConfig] {
override protected def appName: String = "example-job"
override protected def configClass: Class[ExampleConfig] = classOf[ExampleConfig]
override protected def transform(
ctx: AppContext[ExampleConfig],
inputs: Map[String, DataFrame]
): Map[String, DataFrame] = {
Map("output_orders" -> inputs("input_orders"))
}
}
Matching YAML:
inputs:
- name: input_orders
path: s3://bucket/raw/orders
format: parquet
outputs:
- name: output_orders
path: s3://bucket/curated/orders
format: parquet
mode: overwrite