YAML Config
SparkPlusPlus expects application config to come from a YAML file passed with --config.
CLI Pattern
spark-submit \
--class example.OrdersJob \
your-app.jar \
--config conf/orders.yaml \
--env dev
Arguments that are not consumed by SparkPlusPlus remain available through ctx.args.
Example Config
inputs:
- name: customers
path: s3://lakehouse/raw/customers
format: parquet
filter: is_active = true
- name: orders
path: s3://lakehouse/raw/orders
format: parquet
outputs:
- name: customer_orders
path: s3://lakehouse/silver/customer_orders
format: delta
mode: overwrite
partitionBy:
- order_date
repartition: 200
sparkConfig:
spark.sql.shuffle.partitions: "200"
spark.sql.session.timeZone: UTC
spark.databricks.delta.schema.autoMerge.enabled: "true"
Input and Output Rules
Each input entry must define:
namepathformat
Inputs may additionally define:
schemaPathschemaJsonfilteroptions
Each output entry must define:
namepathformat
Outputs may additionally define:
modepartitionByrepartitioncoalesceoptions
SparkPlusPlus validates these combinations before the application starts.
Validation highlights:
- names must be unique within
inputs - names must be unique within
outputs - names must also be unique across both sections
schemaPathandschemaJsoncannot be used togetherrepartitionandcoalescecannot be used together- blank filters fail fast
Config Concepts
partitionBycontrols how an output dataset is physically written.repartitionandcoalescecontrol how many Spark partitions are present before the write.spark.sql.shuffle.partitionsis a Spark session setting, not an output dataset property.
Keep Spark execution tuning in sparkConfig. Keep dataset-specific write behavior in the matching outputs entry.
Applying Spark Session Config
If your config case class exposes a sparkConfig: Map[String, String] field, SparkPlusPlus will automatically apply those entries to the SparkSession.Builder before your app-specific configureSpark(...) hook runs.
This is useful for settings such as:
- shuffle partitions
- timezone
- Delta-related Spark settings
- serializer and SQL behavior flags
Supported Shapes
The loader currently supports:
- case classes
- nested case classes
- lists and sequences
Map[String, T]Option[T]- default constructor values
Validation Behavior
- unknown YAML fields fail fast
- missing required fields fail fast
- optional fields map to
Nonewhen omitted - default parameter values are used when defined in the case class
sparkConfigmaps can be supplied in YAML and are added to the Spark session automatically- duplicate dataset names fail fast
- invalid dataset field combinations fail fast