The gcp:dataproc/gdcSparkApplication:GdcSparkApplication resource, part of the Pulumi GCP provider, defines a Spark workload that runs on a Dataproc GDC cluster: the application type, code location, and runtime configuration. This guide focuses on three capabilities: Spark job types (JAR-based, PySpark, SparkR, Spark SQL), dependency management for JARs and Python files, and SQL execution patterns.
Spark applications run on an existing Dataproc GDC service instance and Kubernetes namespace, and reference code in GCS or local cluster storage. The examples are intentionally small. Combine them with your own service instance, namespace, and code artifacts.
Run a Spark job with a JAR and main class
Most Spark workloads start with a JAR containing compiled code and a main class as the entry point. This is the standard approach for Java or Scala applications.
import * as pulumi from "@pulumi/pulumi";
import * as gcp from "@pulumi/gcp";
const spark_application = new gcp.dataproc.GdcSparkApplication("spark-application", {
sparkApplicationId: "tf-e2e-spark-app-basic",
serviceinstance: "do-not-delete-dataproc-gdc-instance",
project: "my-project",
location: "us-west2",
namespace: "default",
sparkApplicationConfig: {
mainClass: "org.apache.spark.examples.SparkPi",
jarFileUris: ["file:///usr/lib/spark/examples/jars/spark-examples.jar"],
args: ["10000"],
},
});
import pulumi
import pulumi_gcp as gcp
spark_application = gcp.dataproc.GdcSparkApplication("spark-application",
spark_application_id="tf-e2e-spark-app-basic",
serviceinstance="do-not-delete-dataproc-gdc-instance",
project="my-project",
location="us-west2",
namespace="default",
spark_application_config={
"main_class": "org.apache.spark.examples.SparkPi",
"jar_file_uris": ["file:///usr/lib/spark/examples/jars/spark-examples.jar"],
"args": ["10000"],
})
package main
import (
"github.com/pulumi/pulumi-gcp/sdk/v9/go/gcp/dataproc"
"github.com/pulumi/pulumi/sdk/v3/go/pulumi"
)
func main() {
pulumi.Run(func(ctx *pulumi.Context) error {
_, err := dataproc.NewGdcSparkApplication(ctx, "spark-application", &dataproc.GdcSparkApplicationArgs{
SparkApplicationId: pulumi.String("tf-e2e-spark-app-basic"),
Serviceinstance: pulumi.String("do-not-delete-dataproc-gdc-instance"),
Project: pulumi.String("my-project"),
Location: pulumi.String("us-west2"),
Namespace: pulumi.String("default"),
SparkApplicationConfig: &dataproc.GdcSparkApplicationSparkApplicationConfigArgs{
MainClass: pulumi.String("org.apache.spark.examples.SparkPi"),
JarFileUris: pulumi.StringArray{
pulumi.String("file:///usr/lib/spark/examples/jars/spark-examples.jar"),
},
Args: pulumi.StringArray{
pulumi.String("10000"),
},
},
})
if err != nil {
return err
}
return nil
})
}
using System.Collections.Generic;
using System.Linq;
using Pulumi;
using Gcp = Pulumi.Gcp;
return await Deployment.RunAsync(() =>
{
var spark_application = new Gcp.Dataproc.GdcSparkApplication("spark-application", new()
{
SparkApplicationId = "tf-e2e-spark-app-basic",
Serviceinstance = "do-not-delete-dataproc-gdc-instance",
Project = "my-project",
Location = "us-west2",
Namespace = "default",
SparkApplicationConfig = new Gcp.Dataproc.Inputs.GdcSparkApplicationSparkApplicationConfigArgs
{
MainClass = "org.apache.spark.examples.SparkPi",
JarFileUris = new[]
{
"file:///usr/lib/spark/examples/jars/spark-examples.jar",
},
Args = new[]
{
"10000",
},
},
});
});
package generated_program;
import com.pulumi.Context;
import com.pulumi.Pulumi;
import com.pulumi.core.Output;
import com.pulumi.gcp.dataproc.GdcSparkApplication;
import com.pulumi.gcp.dataproc.GdcSparkApplicationArgs;
import com.pulumi.gcp.dataproc.inputs.GdcSparkApplicationSparkApplicationConfigArgs;
import java.util.List;
import java.util.ArrayList;
import java.util.Map;
import java.io.File;
import java.nio.file.Files;
import java.nio.file.Paths;
public class App {
public static void main(String[] args) {
Pulumi.run(App::stack);
}
public static void stack(Context ctx) {
var spark_application = new GdcSparkApplication("spark-application", GdcSparkApplicationArgs.builder()
.sparkApplicationId("tf-e2e-spark-app-basic")
.serviceinstance("do-not-delete-dataproc-gdc-instance")
.project("my-project")
.location("us-west2")
.namespace("default")
.sparkApplicationConfig(GdcSparkApplicationSparkApplicationConfigArgs.builder()
.mainClass("org.apache.spark.examples.SparkPi")
.jarFileUris("file:///usr/lib/spark/examples/jars/spark-examples.jar")
.args("10000")
.build())
.build());
}
}
resources:
spark-application:
type: gcp:dataproc:GdcSparkApplication
properties:
sparkApplicationId: tf-e2e-spark-app-basic
serviceinstance: do-not-delete-dataproc-gdc-instance
project: my-project
location: us-west2
namespace: default
sparkApplicationConfig:
mainClass: org.apache.spark.examples.SparkPi
jarFileUris:
- file:///usr/lib/spark/examples/jars/spark-examples.jar
args:
- '10000'
The sparkApplicationConfig block defines JAR-based applications. The mainClass property specifies the entry point, jarFileUris lists the JARs to load, and args passes command-line arguments to the application. The serviceinstance and namespace properties place the job on a specific cluster and Kubernetes namespace.
Run a PySpark job with Python dependencies
Python-based Spark workloads use PySpark, which requires a main Python file and may need additional modules or JAR dependencies.
import * as pulumi from "@pulumi/pulumi";
import * as gcp from "@pulumi/gcp";
const spark_application = new gcp.dataproc.GdcSparkApplication("spark-application", {
sparkApplicationId: "tf-e2e-pyspark-app",
serviceinstance: "do-not-delete-dataproc-gdc-instance",
project: "my-project",
location: "us-west2",
namespace: "default",
displayName: "A Pyspark application for a Terraform create test",
dependencyImages: ["gcr.io/some/image"],
pysparkApplicationConfig: {
mainPythonFileUri: "gs://goog-dataproc-initialization-actions-us-west2/conda/test_conda.py",
jarFileUris: ["file:///usr/lib/spark/examples/jars/spark-examples.jar"],
pythonFileUris: ["gs://goog-dataproc-initialization-actions-us-west2/conda/get-sys-exec.py"],
fileUris: ["file://usr/lib/spark/examples/spark-examples.jar"],
archiveUris: ["file://usr/lib/spark/examples/spark-examples.jar"],
args: ["10"],
},
});
import pulumi
import pulumi_gcp as gcp
spark_application = gcp.dataproc.GdcSparkApplication("spark-application",
spark_application_id="tf-e2e-pyspark-app",
serviceinstance="do-not-delete-dataproc-gdc-instance",
project="my-project",
location="us-west2",
namespace="default",
display_name="A Pyspark application for a Terraform create test",
dependency_images=["gcr.io/some/image"],
pyspark_application_config={
"main_python_file_uri": "gs://goog-dataproc-initialization-actions-us-west2/conda/test_conda.py",
"jar_file_uris": ["file:///usr/lib/spark/examples/jars/spark-examples.jar"],
"python_file_uris": ["gs://goog-dataproc-initialization-actions-us-west2/conda/get-sys-exec.py"],
"file_uris": ["file://usr/lib/spark/examples/spark-examples.jar"],
"archive_uris": ["file://usr/lib/spark/examples/spark-examples.jar"],
"args": ["10"],
})
package main
import (
"github.com/pulumi/pulumi-gcp/sdk/v9/go/gcp/dataproc"
"github.com/pulumi/pulumi/sdk/v3/go/pulumi"
)
func main() {
pulumi.Run(func(ctx *pulumi.Context) error {
_, err := dataproc.NewGdcSparkApplication(ctx, "spark-application", &dataproc.GdcSparkApplicationArgs{
SparkApplicationId: pulumi.String("tf-e2e-pyspark-app"),
Serviceinstance: pulumi.String("do-not-delete-dataproc-gdc-instance"),
Project: pulumi.String("my-project"),
Location: pulumi.String("us-west2"),
Namespace: pulumi.String("default"),
DisplayName: pulumi.String("A Pyspark application for a Terraform create test"),
DependencyImages: pulumi.StringArray{
pulumi.String("gcr.io/some/image"),
},
PysparkApplicationConfig: &dataproc.GdcSparkApplicationPysparkApplicationConfigArgs{
MainPythonFileUri: pulumi.String("gs://goog-dataproc-initialization-actions-us-west2/conda/test_conda.py"),
JarFileUris: pulumi.StringArray{
pulumi.String("file:///usr/lib/spark/examples/jars/spark-examples.jar"),
},
PythonFileUris: pulumi.StringArray{
pulumi.String("gs://goog-dataproc-initialization-actions-us-west2/conda/get-sys-exec.py"),
},
FileUris: pulumi.StringArray{
pulumi.String("file://usr/lib/spark/examples/spark-examples.jar"),
},
ArchiveUris: pulumi.StringArray{
pulumi.String("file://usr/lib/spark/examples/spark-examples.jar"),
},
Args: pulumi.StringArray{
pulumi.String("10"),
},
},
})
if err != nil {
return err
}
return nil
})
}
using System.Collections.Generic;
using System.Linq;
using Pulumi;
using Gcp = Pulumi.Gcp;
return await Deployment.RunAsync(() =>
{
var spark_application = new Gcp.Dataproc.GdcSparkApplication("spark-application", new()
{
SparkApplicationId = "tf-e2e-pyspark-app",
Serviceinstance = "do-not-delete-dataproc-gdc-instance",
Project = "my-project",
Location = "us-west2",
Namespace = "default",
DisplayName = "A Pyspark application for a Terraform create test",
DependencyImages = new[]
{
"gcr.io/some/image",
},
PysparkApplicationConfig = new Gcp.Dataproc.Inputs.GdcSparkApplicationPysparkApplicationConfigArgs
{
MainPythonFileUri = "gs://goog-dataproc-initialization-actions-us-west2/conda/test_conda.py",
JarFileUris = new[]
{
"file:///usr/lib/spark/examples/jars/spark-examples.jar",
},
PythonFileUris = new[]
{
"gs://goog-dataproc-initialization-actions-us-west2/conda/get-sys-exec.py",
},
FileUris = new[]
{
"file://usr/lib/spark/examples/spark-examples.jar",
},
ArchiveUris = new[]
{
"file://usr/lib/spark/examples/spark-examples.jar",
},
Args = new[]
{
"10",
},
},
});
});
package generated_program;
import com.pulumi.Context;
import com.pulumi.Pulumi;
import com.pulumi.core.Output;
import com.pulumi.gcp.dataproc.GdcSparkApplication;
import com.pulumi.gcp.dataproc.GdcSparkApplicationArgs;
import com.pulumi.gcp.dataproc.inputs.GdcSparkApplicationPysparkApplicationConfigArgs;
import java.util.List;
import java.util.ArrayList;
import java.util.Map;
import java.io.File;
import java.nio.file.Files;
import java.nio.file.Paths;
public class App {
public static void main(String[] args) {
Pulumi.run(App::stack);
}
public static void stack(Context ctx) {
var spark_application = new GdcSparkApplication("spark-application", GdcSparkApplicationArgs.builder()
.sparkApplicationId("tf-e2e-pyspark-app")
.serviceinstance("do-not-delete-dataproc-gdc-instance")
.project("my-project")
.location("us-west2")
.namespace("default")
.displayName("A Pyspark application for a Terraform create test")
.dependencyImages("gcr.io/some/image")
.pysparkApplicationConfig(GdcSparkApplicationPysparkApplicationConfigArgs.builder()
.mainPythonFileUri("gs://goog-dataproc-initialization-actions-us-west2/conda/test_conda.py")
.jarFileUris("file:///usr/lib/spark/examples/jars/spark-examples.jar")
.pythonFileUris("gs://goog-dataproc-initialization-actions-us-west2/conda/get-sys-exec.py")
.fileUris("file://usr/lib/spark/examples/spark-examples.jar")
.archiveUris("file://usr/lib/spark/examples/spark-examples.jar")
.args("10")
.build())
.build());
}
}
resources:
spark-application:
type: gcp:dataproc:GdcSparkApplication
properties:
sparkApplicationId: tf-e2e-pyspark-app
serviceinstance: do-not-delete-dataproc-gdc-instance
project: my-project
location: us-west2
namespace: default
displayName: A Pyspark application for a Terraform create test
dependencyImages:
- gcr.io/some/image
pysparkApplicationConfig:
mainPythonFileUri: gs://goog-dataproc-initialization-actions-us-west2/conda/test_conda.py
jarFileUris:
- file:///usr/lib/spark/examples/jars/spark-examples.jar
pythonFileUris:
- gs://goog-dataproc-initialization-actions-us-west2/conda/get-sys-exec.py
fileUris:
- file://usr/lib/spark/examples/spark-examples.jar
archiveUris:
- file://usr/lib/spark/examples/spark-examples.jar
args:
- '10'
The pysparkApplicationConfig block defines Python applications. The mainPythonFileUri points to the entry script, pythonFileUris lists additional Python modules, and dependencyImages specifies container images for file dependencies. Files are copied sequentially from each image; later images override earlier ones if filenames conflict.
Run a SparkR job with R scripts
Data scientists working in R can run Spark workloads using SparkR, which executes R scripts on the cluster.
import * as pulumi from "@pulumi/pulumi";
import * as gcp from "@pulumi/gcp";
const spark_application = new gcp.dataproc.GdcSparkApplication("spark-application", {
sparkApplicationId: "tf-e2e-sparkr-app",
serviceinstance: "do-not-delete-dataproc-gdc-instance",
project: "my-project",
location: "us-west2",
namespace: "default",
displayName: "A SparkR application for a Terraform create test",
sparkRApplicationConfig: {
mainRFileUri: "gs://some-bucket/something.R",
fileUris: ["file://usr/lib/spark/examples/spark-examples.jar"],
archiveUris: ["file://usr/lib/spark/examples/spark-examples.jar"],
args: ["10"],
},
});
import pulumi
import pulumi_gcp as gcp
spark_application = gcp.dataproc.GdcSparkApplication("spark-application",
spark_application_id="tf-e2e-sparkr-app",
serviceinstance="do-not-delete-dataproc-gdc-instance",
project="my-project",
location="us-west2",
namespace="default",
display_name="A SparkR application for a Terraform create test",
spark_r_application_config={
"main_r_file_uri": "gs://some-bucket/something.R",
"file_uris": ["file://usr/lib/spark/examples/spark-examples.jar"],
"archive_uris": ["file://usr/lib/spark/examples/spark-examples.jar"],
"args": ["10"],
})
package main
import (
"github.com/pulumi/pulumi-gcp/sdk/v9/go/gcp/dataproc"
"github.com/pulumi/pulumi/sdk/v3/go/pulumi"
)
func main() {
pulumi.Run(func(ctx *pulumi.Context) error {
_, err := dataproc.NewGdcSparkApplication(ctx, "spark-application", &dataproc.GdcSparkApplicationArgs{
SparkApplicationId: pulumi.String("tf-e2e-sparkr-app"),
Serviceinstance: pulumi.String("do-not-delete-dataproc-gdc-instance"),
Project: pulumi.String("my-project"),
Location: pulumi.String("us-west2"),
Namespace: pulumi.String("default"),
DisplayName: pulumi.String("A SparkR application for a Terraform create test"),
SparkRApplicationConfig: &dataproc.GdcSparkApplicationSparkRApplicationConfigArgs{
MainRFileUri: pulumi.String("gs://some-bucket/something.R"),
FileUris: pulumi.StringArray{
pulumi.String("file://usr/lib/spark/examples/spark-examples.jar"),
},
ArchiveUris: pulumi.StringArray{
pulumi.String("file://usr/lib/spark/examples/spark-examples.jar"),
},
Args: pulumi.StringArray{
pulumi.String("10"),
},
},
})
if err != nil {
return err
}
return nil
})
}
using System.Collections.Generic;
using System.Linq;
using Pulumi;
using Gcp = Pulumi.Gcp;
return await Deployment.RunAsync(() =>
{
var spark_application = new Gcp.Dataproc.GdcSparkApplication("spark-application", new()
{
SparkApplicationId = "tf-e2e-sparkr-app",
Serviceinstance = "do-not-delete-dataproc-gdc-instance",
Project = "my-project",
Location = "us-west2",
Namespace = "default",
DisplayName = "A SparkR application for a Terraform create test",
SparkRApplicationConfig = new Gcp.Dataproc.Inputs.GdcSparkApplicationSparkRApplicationConfigArgs
{
MainRFileUri = "gs://some-bucket/something.R",
FileUris = new[]
{
"file://usr/lib/spark/examples/spark-examples.jar",
},
ArchiveUris = new[]
{
"file://usr/lib/spark/examples/spark-examples.jar",
},
Args = new[]
{
"10",
},
},
});
});
package generated_program;
import com.pulumi.Context;
import com.pulumi.Pulumi;
import com.pulumi.core.Output;
import com.pulumi.gcp.dataproc.GdcSparkApplication;
import com.pulumi.gcp.dataproc.GdcSparkApplicationArgs;
import com.pulumi.gcp.dataproc.inputs.GdcSparkApplicationSparkRApplicationConfigArgs;
import java.util.List;
import java.util.ArrayList;
import java.util.Map;
import java.io.File;
import java.nio.file.Files;
import java.nio.file.Paths;
public class App {
public static void main(String[] args) {
Pulumi.run(App::stack);
}
public static void stack(Context ctx) {
var spark_application = new GdcSparkApplication("spark-application", GdcSparkApplicationArgs.builder()
.sparkApplicationId("tf-e2e-sparkr-app")
.serviceinstance("do-not-delete-dataproc-gdc-instance")
.project("my-project")
.location("us-west2")
.namespace("default")
.displayName("A SparkR application for a Terraform create test")
.sparkRApplicationConfig(GdcSparkApplicationSparkRApplicationConfigArgs.builder()
.mainRFileUri("gs://some-bucket/something.R")
.fileUris("file://usr/lib/spark/examples/spark-examples.jar")
.archiveUris("file://usr/lib/spark/examples/spark-examples.jar")
.args("10")
.build())
.build());
}
}
resources:
spark-application:
type: gcp:dataproc:GdcSparkApplication
properties:
sparkApplicationId: tf-e2e-sparkr-app
serviceinstance: do-not-delete-dataproc-gdc-instance
project: my-project
location: us-west2
namespace: default
displayName: A SparkR application for a Terraform create test
sparkRApplicationConfig:
mainRFileUri: gs://some-bucket/something.R
fileUris:
- file://usr/lib/spark/examples/spark-examples.jar
archiveUris:
- file://usr/lib/spark/examples/spark-examples.jar
args:
- '10'
The sparkRApplicationConfig block defines R applications. The mainRFileUri points to the entry script, fileUris and archiveUris provide additional dependencies, and args passes command-line arguments. File URIs can reference local cluster storage (file://) or GCS (gs://).
Run SQL queries inline with Spark SQL
Teams running SQL-based analytics can execute queries directly through Spark SQL without writing application code.
import * as pulumi from "@pulumi/pulumi";
import * as gcp from "@pulumi/gcp";
const spark_application = new gcp.dataproc.GdcSparkApplication("spark-application", {
sparkApplicationId: "tf-e2e-sparksql-app",
serviceinstance: "do-not-delete-dataproc-gdc-instance",
project: "my-project",
location: "us-west2",
namespace: "default",
displayName: "A SparkSql application for a Terraform create test",
sparkSqlApplicationConfig: {
jarFileUris: ["file:///usr/lib/spark/examples/jars/spark-examples.jar"],
queryList: {
queries: ["show tables;"],
},
scriptVariables: {
MY_VAR: "1",
},
},
});
import pulumi
import pulumi_gcp as gcp
spark_application = gcp.dataproc.GdcSparkApplication("spark-application",
spark_application_id="tf-e2e-sparksql-app",
serviceinstance="do-not-delete-dataproc-gdc-instance",
project="my-project",
location="us-west2",
namespace="default",
display_name="A SparkSql application for a Terraform create test",
spark_sql_application_config={
"jar_file_uris": ["file:///usr/lib/spark/examples/jars/spark-examples.jar"],
"query_list": {
"queries": ["show tables;"],
},
"script_variables": {
"MY_VAR": "1",
},
})
package main
import (
"github.com/pulumi/pulumi-gcp/sdk/v9/go/gcp/dataproc"
"github.com/pulumi/pulumi/sdk/v3/go/pulumi"
)
func main() {
pulumi.Run(func(ctx *pulumi.Context) error {
_, err := dataproc.NewGdcSparkApplication(ctx, "spark-application", &dataproc.GdcSparkApplicationArgs{
SparkApplicationId: pulumi.String("tf-e2e-sparksql-app"),
Serviceinstance: pulumi.String("do-not-delete-dataproc-gdc-instance"),
Project: pulumi.String("my-project"),
Location: pulumi.String("us-west2"),
Namespace: pulumi.String("default"),
DisplayName: pulumi.String("A SparkSql application for a Terraform create test"),
SparkSqlApplicationConfig: &dataproc.GdcSparkApplicationSparkSqlApplicationConfigArgs{
JarFileUris: pulumi.StringArray{
pulumi.String("file:///usr/lib/spark/examples/jars/spark-examples.jar"),
},
QueryList: &dataproc.GdcSparkApplicationSparkSqlApplicationConfigQueryListArgs{
Queries: pulumi.StringArray{
pulumi.String("show tables;"),
},
},
ScriptVariables: pulumi.StringMap{
"MY_VAR": pulumi.String("1"),
},
},
})
if err != nil {
return err
}
return nil
})
}
using System.Collections.Generic;
using System.Linq;
using Pulumi;
using Gcp = Pulumi.Gcp;
return await Deployment.RunAsync(() =>
{
var spark_application = new Gcp.Dataproc.GdcSparkApplication("spark-application", new()
{
SparkApplicationId = "tf-e2e-sparksql-app",
Serviceinstance = "do-not-delete-dataproc-gdc-instance",
Project = "my-project",
Location = "us-west2",
Namespace = "default",
DisplayName = "A SparkSql application for a Terraform create test",
SparkSqlApplicationConfig = new Gcp.Dataproc.Inputs.GdcSparkApplicationSparkSqlApplicationConfigArgs
{
JarFileUris = new[]
{
"file:///usr/lib/spark/examples/jars/spark-examples.jar",
},
QueryList = new Gcp.Dataproc.Inputs.GdcSparkApplicationSparkSqlApplicationConfigQueryListArgs
{
Queries = new[]
{
"show tables;",
},
},
ScriptVariables =
{
{ "MY_VAR", "1" },
},
},
});
});
package generated_program;
import com.pulumi.Context;
import com.pulumi.Pulumi;
import com.pulumi.core.Output;
import com.pulumi.gcp.dataproc.GdcSparkApplication;
import com.pulumi.gcp.dataproc.GdcSparkApplicationArgs;
import com.pulumi.gcp.dataproc.inputs.GdcSparkApplicationSparkSqlApplicationConfigArgs;
import com.pulumi.gcp.dataproc.inputs.GdcSparkApplicationSparkSqlApplicationConfigQueryListArgs;
import java.util.List;
import java.util.ArrayList;
import java.util.Map;
import java.io.File;
import java.nio.file.Files;
import java.nio.file.Paths;
public class App {
public static void main(String[] args) {
Pulumi.run(App::stack);
}
public static void stack(Context ctx) {
var spark_application = new GdcSparkApplication("spark-application", GdcSparkApplicationArgs.builder()
.sparkApplicationId("tf-e2e-sparksql-app")
.serviceinstance("do-not-delete-dataproc-gdc-instance")
.project("my-project")
.location("us-west2")
.namespace("default")
.displayName("A SparkSql application for a Terraform create test")
.sparkSqlApplicationConfig(GdcSparkApplicationSparkSqlApplicationConfigArgs.builder()
.jarFileUris("file:///usr/lib/spark/examples/jars/spark-examples.jar")
.queryList(GdcSparkApplicationSparkSqlApplicationConfigQueryListArgs.builder()
.queries("show tables;")
.build())
.scriptVariables(Map.of("MY_VAR", "1"))
.build())
.build());
}
}
resources:
spark-application:
type: gcp:dataproc:GdcSparkApplication
properties:
sparkApplicationId: tf-e2e-sparksql-app
serviceinstance: do-not-delete-dataproc-gdc-instance
project: my-project
location: us-west2
namespace: default
displayName: A SparkSql application for a Terraform create test
sparkSqlApplicationConfig:
jarFileUris:
- file:///usr/lib/spark/examples/jars/spark-examples.jar
queryList:
queries:
- show tables;
scriptVariables:
MY_VAR: '1'
The sparkSqlApplicationConfig block defines SQL applications. The queryList property contains inline SQL queries, and scriptVariables provides parameterized values that can be referenced in queries. This approach works well for simple queries or testing.
Run SQL queries from a file with Spark SQL
For complex SQL workloads or version-controlled queries, teams store SQL in files rather than inline strings.
import * as pulumi from "@pulumi/pulumi";
import * as gcp from "@pulumi/gcp";
const spark_application = new gcp.dataproc.GdcSparkApplication("spark-application", {
sparkApplicationId: "tf-e2e-sparksql-app",
serviceinstance: "do-not-delete-dataproc-gdc-instance",
project: "my-project",
location: "us-west2",
namespace: "default",
displayName: "A SparkSql application for a Terraform create test",
sparkSqlApplicationConfig: {
jarFileUris: ["file:///usr/lib/spark/examples/jars/spark-examples.jar"],
queryFileUri: "gs://some-bucket/something.sql",
scriptVariables: {
MY_VAR: "1",
},
},
});
import pulumi
import pulumi_gcp as gcp
spark_application = gcp.dataproc.GdcSparkApplication("spark-application",
spark_application_id="tf-e2e-sparksql-app",
serviceinstance="do-not-delete-dataproc-gdc-instance",
project="my-project",
location="us-west2",
namespace="default",
display_name="A SparkSql application for a Terraform create test",
spark_sql_application_config={
"jar_file_uris": ["file:///usr/lib/spark/examples/jars/spark-examples.jar"],
"query_file_uri": "gs://some-bucket/something.sql",
"script_variables": {
"MY_VAR": "1",
},
})
package main
import (
"github.com/pulumi/pulumi-gcp/sdk/v9/go/gcp/dataproc"
"github.com/pulumi/pulumi/sdk/v3/go/pulumi"
)
func main() {
pulumi.Run(func(ctx *pulumi.Context) error {
_, err := dataproc.NewGdcSparkApplication(ctx, "spark-application", &dataproc.GdcSparkApplicationArgs{
SparkApplicationId: pulumi.String("tf-e2e-sparksql-app"),
Serviceinstance: pulumi.String("do-not-delete-dataproc-gdc-instance"),
Project: pulumi.String("my-project"),
Location: pulumi.String("us-west2"),
Namespace: pulumi.String("default"),
DisplayName: pulumi.String("A SparkSql application for a Terraform create test"),
SparkSqlApplicationConfig: &dataproc.GdcSparkApplicationSparkSqlApplicationConfigArgs{
JarFileUris: pulumi.StringArray{
pulumi.String("file:///usr/lib/spark/examples/jars/spark-examples.jar"),
},
QueryFileUri: pulumi.String("gs://some-bucket/something.sql"),
ScriptVariables: pulumi.StringMap{
"MY_VAR": pulumi.String("1"),
},
},
})
if err != nil {
return err
}
return nil
})
}
using System.Collections.Generic;
using System.Linq;
using Pulumi;
using Gcp = Pulumi.Gcp;
return await Deployment.RunAsync(() =>
{
var spark_application = new Gcp.Dataproc.GdcSparkApplication("spark-application", new()
{
SparkApplicationId = "tf-e2e-sparksql-app",
Serviceinstance = "do-not-delete-dataproc-gdc-instance",
Project = "my-project",
Location = "us-west2",
Namespace = "default",
DisplayName = "A SparkSql application for a Terraform create test",
SparkSqlApplicationConfig = new Gcp.Dataproc.Inputs.GdcSparkApplicationSparkSqlApplicationConfigArgs
{
JarFileUris = new[]
{
"file:///usr/lib/spark/examples/jars/spark-examples.jar",
},
QueryFileUri = "gs://some-bucket/something.sql",
ScriptVariables =
{
{ "MY_VAR", "1" },
},
},
});
});
package generated_program;
import com.pulumi.Context;
import com.pulumi.Pulumi;
import com.pulumi.core.Output;
import com.pulumi.gcp.dataproc.GdcSparkApplication;
import com.pulumi.gcp.dataproc.GdcSparkApplicationArgs;
import com.pulumi.gcp.dataproc.inputs.GdcSparkApplicationSparkSqlApplicationConfigArgs;
import java.util.List;
import java.util.ArrayList;
import java.util.Map;
import java.io.File;
import java.nio.file.Files;
import java.nio.file.Paths;
public class App {
public static void main(String[] args) {
Pulumi.run(App::stack);
}
public static void stack(Context ctx) {
var spark_application = new GdcSparkApplication("spark-application", GdcSparkApplicationArgs.builder()
.sparkApplicationId("tf-e2e-sparksql-app")
.serviceinstance("do-not-delete-dataproc-gdc-instance")
.project("my-project")
.location("us-west2")
.namespace("default")
.displayName("A SparkSql application for a Terraform create test")
.sparkSqlApplicationConfig(GdcSparkApplicationSparkSqlApplicationConfigArgs.builder()
.jarFileUris("file:///usr/lib/spark/examples/jars/spark-examples.jar")
.queryFileUri("gs://some-bucket/something.sql")
.scriptVariables(Map.of("MY_VAR", "1"))
.build())
.build());
}
}
resources:
spark-application:
type: gcp:dataproc:GdcSparkApplication
properties:
sparkApplicationId: tf-e2e-sparksql-app
serviceinstance: do-not-delete-dataproc-gdc-instance
project: my-project
location: us-west2
namespace: default
displayName: A SparkSql application for a Terraform create test
sparkSqlApplicationConfig:
jarFileUris:
- file:///usr/lib/spark/examples/jars/spark-examples.jar
queryFileUri: gs://some-bucket/something.sql
scriptVariables:
MY_VAR: '1'
The queryFileUri property points to a SQL file in GCS, replacing the inline queryList. This approach supports larger queries and integrates with version control systems. The scriptVariables property works the same way, providing parameterized values for the queries.
Beyond these examples
These snippets focus on specific Spark application features: Spark job types (JAR, PySpark, SparkR, Spark SQL), dependency management (JARs, Python files, container images), and SQL execution (inline queries and file-based). They’re intentionally minimal rather than full data processing pipelines.
The examples reference pre-existing infrastructure such as Dataproc GDC service instances, Kubernetes namespaces on the cluster, GCS buckets for code and data files, and container registries for dependency images. They focus on configuring the application rather than provisioning the cluster infrastructure.
To keep things focused, common application patterns are omitted, including:
- Labels and annotations for organization
- Application environment inheritance (applicationEnvironment)
- Spark properties tuning (properties object)
- Version pinning (version property)
These omissions are intentional: the goal is to illustrate how each Spark application type is wired, not provide drop-in data processing modules. See the GDC Spark Application resource reference for all available configuration options.
Let's deploy GCP Dataproc Spark Applications
Get started with Pulumi Cloud, then follow our quick setup guide to deploy this infrastructure.
Try Pulumi Cloud for FREEFrequently Asked Questions
Configuration & Application Types
sparkApplicationConfig, Python apps using pysparkApplicationConfig, R apps using sparkRApplicationConfig, and SQL apps using sparkSqlApplicationConfig. Only one config type can be specified per application.queryList.queries for inline queries or queryFileUri to reference a SQL file in Google Cloud Storage.applicationEnvironment property to reference an existing GdcApplicationEnvironment resource by name.Immutability & Updates
sparkApplicationId, location, project, serviceinstance, namespace, applicationEnvironment, all application configs (sparkApplicationConfig, pysparkApplicationConfig, etc.), properties, annotations, displayName, dependencyImages, and version. Only labels can be updated. Changes to immutable properties force resource replacement.namespace, application configs, or sparkApplicationId) forces Pulumi to replace the entire resource.Labels & Annotations
labels and annotations fields are non-authoritative and only manage values in your configuration. To see all labels/annotations (including those set by other clients or services), use effectiveLabels and effectiveAnnotations.Prerequisites & Dependencies
namespace property must already exist on the cluster before creating the Spark application.Using a different cloud?
Explore analytics guides for other cloud providers: