The gcp:dataproc/gdcSparkApplication:GdcSparkApplication resource, part of the Pulumi GCP provider, defines a Spark application workload that runs on a Dataproc GDC cluster. This guide focuses on three capabilities: JAR-based Spark jobs, PySpark applications with dependencies, and SparkSQL query execution.
Spark applications run on an existing Dataproc GDC service instance within a Kubernetes namespace. They may reference Cloud Storage paths for code and data, or container images for dependencies. The examples are intentionally small. Combine them with your own service instance, namespace, and storage configuration.
Run a Spark job with a JAR and main class
Most Spark workloads start by specifying a JAR file containing compiled code and the main class to execute. This is the standard approach for running Spark applications written in Java or Scala.
import * as pulumi from "@pulumi/pulumi";
import * as gcp from "@pulumi/gcp";
const spark_application = new gcp.dataproc.GdcSparkApplication("spark-application", {
sparkApplicationId: "tf-e2e-spark-app-basic",
serviceinstance: "do-not-delete-dataproc-gdc-instance",
project: "my-project",
location: "us-west2",
namespace: "default",
sparkApplicationConfig: {
mainClass: "org.apache.spark.examples.SparkPi",
jarFileUris: ["file:///usr/lib/spark/examples/jars/spark-examples.jar"],
args: ["10000"],
},
});
import pulumi
import pulumi_gcp as gcp
spark_application = gcp.dataproc.GdcSparkApplication("spark-application",
spark_application_id="tf-e2e-spark-app-basic",
serviceinstance="do-not-delete-dataproc-gdc-instance",
project="my-project",
location="us-west2",
namespace="default",
spark_application_config={
"main_class": "org.apache.spark.examples.SparkPi",
"jar_file_uris": ["file:///usr/lib/spark/examples/jars/spark-examples.jar"],
"args": ["10000"],
})
package main
import (
"github.com/pulumi/pulumi-gcp/sdk/v9/go/gcp/dataproc"
"github.com/pulumi/pulumi/sdk/v3/go/pulumi"
)
func main() {
pulumi.Run(func(ctx *pulumi.Context) error {
_, err := dataproc.NewGdcSparkApplication(ctx, "spark-application", &dataproc.GdcSparkApplicationArgs{
SparkApplicationId: pulumi.String("tf-e2e-spark-app-basic"),
Serviceinstance: pulumi.String("do-not-delete-dataproc-gdc-instance"),
Project: pulumi.String("my-project"),
Location: pulumi.String("us-west2"),
Namespace: pulumi.String("default"),
SparkApplicationConfig: &dataproc.GdcSparkApplicationSparkApplicationConfigArgs{
MainClass: pulumi.String("org.apache.spark.examples.SparkPi"),
JarFileUris: pulumi.StringArray{
pulumi.String("file:///usr/lib/spark/examples/jars/spark-examples.jar"),
},
Args: pulumi.StringArray{
pulumi.String("10000"),
},
},
})
if err != nil {
return err
}
return nil
})
}
using System.Collections.Generic;
using System.Linq;
using Pulumi;
using Gcp = Pulumi.Gcp;
return await Deployment.RunAsync(() =>
{
var spark_application = new Gcp.Dataproc.GdcSparkApplication("spark-application", new()
{
SparkApplicationId = "tf-e2e-spark-app-basic",
Serviceinstance = "do-not-delete-dataproc-gdc-instance",
Project = "my-project",
Location = "us-west2",
Namespace = "default",
SparkApplicationConfig = new Gcp.Dataproc.Inputs.GdcSparkApplicationSparkApplicationConfigArgs
{
MainClass = "org.apache.spark.examples.SparkPi",
JarFileUris = new[]
{
"file:///usr/lib/spark/examples/jars/spark-examples.jar",
},
Args = new[]
{
"10000",
},
},
});
});
package generated_program;
import com.pulumi.Context;
import com.pulumi.Pulumi;
import com.pulumi.core.Output;
import com.pulumi.gcp.dataproc.GdcSparkApplication;
import com.pulumi.gcp.dataproc.GdcSparkApplicationArgs;
import com.pulumi.gcp.dataproc.inputs.GdcSparkApplicationSparkApplicationConfigArgs;
import java.util.List;
import java.util.ArrayList;
import java.util.Map;
import java.io.File;
import java.nio.file.Files;
import java.nio.file.Paths;
public class App {
public static void main(String[] args) {
Pulumi.run(App::stack);
}
public static void stack(Context ctx) {
var spark_application = new GdcSparkApplication("spark-application", GdcSparkApplicationArgs.builder()
.sparkApplicationId("tf-e2e-spark-app-basic")
.serviceinstance("do-not-delete-dataproc-gdc-instance")
.project("my-project")
.location("us-west2")
.namespace("default")
.sparkApplicationConfig(GdcSparkApplicationSparkApplicationConfigArgs.builder()
.mainClass("org.apache.spark.examples.SparkPi")
.jarFileUris("file:///usr/lib/spark/examples/jars/spark-examples.jar")
.args("10000")
.build())
.build());
}
}
resources:
spark-application:
type: gcp:dataproc:GdcSparkApplication
properties:
sparkApplicationId: tf-e2e-spark-app-basic
serviceinstance: do-not-delete-dataproc-gdc-instance
project: my-project
location: us-west2
namespace: default
sparkApplicationConfig:
mainClass: org.apache.spark.examples.SparkPi
jarFileUris:
- file:///usr/lib/spark/examples/jars/spark-examples.jar
args:
- '10000'
When the application runs, Dataproc GDC loads the JAR from the specified path, invokes the mainClass entry point, and passes the args array as command-line arguments. The sparkApplicationConfig block defines the JAR location, entry point, and runtime parameters for Java or Scala applications.
Run a PySpark job with Python dependencies
Python-based Spark applications require a different configuration that points to Python files rather than JARs. Teams often need to include additional Python modules and dependency images for their workloads.
import * as pulumi from "@pulumi/pulumi";
import * as gcp from "@pulumi/gcp";
const spark_application = new gcp.dataproc.GdcSparkApplication("spark-application", {
sparkApplicationId: "tf-e2e-pyspark-app",
serviceinstance: "do-not-delete-dataproc-gdc-instance",
project: "my-project",
location: "us-west2",
namespace: "default",
displayName: "A Pyspark application for a Terraform create test",
dependencyImages: ["gcr.io/some/image"],
pysparkApplicationConfig: {
mainPythonFileUri: "gs://goog-dataproc-initialization-actions-us-west2/conda/test_conda.py",
jarFileUris: ["file:///usr/lib/spark/examples/jars/spark-examples.jar"],
pythonFileUris: ["gs://goog-dataproc-initialization-actions-us-west2/conda/get-sys-exec.py"],
fileUris: ["file://usr/lib/spark/examples/spark-examples.jar"],
archiveUris: ["file://usr/lib/spark/examples/spark-examples.jar"],
args: ["10"],
},
});
import pulumi
import pulumi_gcp as gcp
spark_application = gcp.dataproc.GdcSparkApplication("spark-application",
spark_application_id="tf-e2e-pyspark-app",
serviceinstance="do-not-delete-dataproc-gdc-instance",
project="my-project",
location="us-west2",
namespace="default",
display_name="A Pyspark application for a Terraform create test",
dependency_images=["gcr.io/some/image"],
pyspark_application_config={
"main_python_file_uri": "gs://goog-dataproc-initialization-actions-us-west2/conda/test_conda.py",
"jar_file_uris": ["file:///usr/lib/spark/examples/jars/spark-examples.jar"],
"python_file_uris": ["gs://goog-dataproc-initialization-actions-us-west2/conda/get-sys-exec.py"],
"file_uris": ["file://usr/lib/spark/examples/spark-examples.jar"],
"archive_uris": ["file://usr/lib/spark/examples/spark-examples.jar"],
"args": ["10"],
})
package main
import (
"github.com/pulumi/pulumi-gcp/sdk/v9/go/gcp/dataproc"
"github.com/pulumi/pulumi/sdk/v3/go/pulumi"
)
func main() {
pulumi.Run(func(ctx *pulumi.Context) error {
_, err := dataproc.NewGdcSparkApplication(ctx, "spark-application", &dataproc.GdcSparkApplicationArgs{
SparkApplicationId: pulumi.String("tf-e2e-pyspark-app"),
Serviceinstance: pulumi.String("do-not-delete-dataproc-gdc-instance"),
Project: pulumi.String("my-project"),
Location: pulumi.String("us-west2"),
Namespace: pulumi.String("default"),
DisplayName: pulumi.String("A Pyspark application for a Terraform create test"),
DependencyImages: pulumi.StringArray{
pulumi.String("gcr.io/some/image"),
},
PysparkApplicationConfig: &dataproc.GdcSparkApplicationPysparkApplicationConfigArgs{
MainPythonFileUri: pulumi.String("gs://goog-dataproc-initialization-actions-us-west2/conda/test_conda.py"),
JarFileUris: pulumi.StringArray{
pulumi.String("file:///usr/lib/spark/examples/jars/spark-examples.jar"),
},
PythonFileUris: pulumi.StringArray{
pulumi.String("gs://goog-dataproc-initialization-actions-us-west2/conda/get-sys-exec.py"),
},
FileUris: pulumi.StringArray{
pulumi.String("file://usr/lib/spark/examples/spark-examples.jar"),
},
ArchiveUris: pulumi.StringArray{
pulumi.String("file://usr/lib/spark/examples/spark-examples.jar"),
},
Args: pulumi.StringArray{
pulumi.String("10"),
},
},
})
if err != nil {
return err
}
return nil
})
}
using System.Collections.Generic;
using System.Linq;
using Pulumi;
using Gcp = Pulumi.Gcp;
return await Deployment.RunAsync(() =>
{
var spark_application = new Gcp.Dataproc.GdcSparkApplication("spark-application", new()
{
SparkApplicationId = "tf-e2e-pyspark-app",
Serviceinstance = "do-not-delete-dataproc-gdc-instance",
Project = "my-project",
Location = "us-west2",
Namespace = "default",
DisplayName = "A Pyspark application for a Terraform create test",
DependencyImages = new[]
{
"gcr.io/some/image",
},
PysparkApplicationConfig = new Gcp.Dataproc.Inputs.GdcSparkApplicationPysparkApplicationConfigArgs
{
MainPythonFileUri = "gs://goog-dataproc-initialization-actions-us-west2/conda/test_conda.py",
JarFileUris = new[]
{
"file:///usr/lib/spark/examples/jars/spark-examples.jar",
},
PythonFileUris = new[]
{
"gs://goog-dataproc-initialization-actions-us-west2/conda/get-sys-exec.py",
},
FileUris = new[]
{
"file://usr/lib/spark/examples/spark-examples.jar",
},
ArchiveUris = new[]
{
"file://usr/lib/spark/examples/spark-examples.jar",
},
Args = new[]
{
"10",
},
},
});
});
package generated_program;
import com.pulumi.Context;
import com.pulumi.Pulumi;
import com.pulumi.core.Output;
import com.pulumi.gcp.dataproc.GdcSparkApplication;
import com.pulumi.gcp.dataproc.GdcSparkApplicationArgs;
import com.pulumi.gcp.dataproc.inputs.GdcSparkApplicationPysparkApplicationConfigArgs;
import java.util.List;
import java.util.ArrayList;
import java.util.Map;
import java.io.File;
import java.nio.file.Files;
import java.nio.file.Paths;
public class App {
public static void main(String[] args) {
Pulumi.run(App::stack);
}
public static void stack(Context ctx) {
var spark_application = new GdcSparkApplication("spark-application", GdcSparkApplicationArgs.builder()
.sparkApplicationId("tf-e2e-pyspark-app")
.serviceinstance("do-not-delete-dataproc-gdc-instance")
.project("my-project")
.location("us-west2")
.namespace("default")
.displayName("A Pyspark application for a Terraform create test")
.dependencyImages("gcr.io/some/image")
.pysparkApplicationConfig(GdcSparkApplicationPysparkApplicationConfigArgs.builder()
.mainPythonFileUri("gs://goog-dataproc-initialization-actions-us-west2/conda/test_conda.py")
.jarFileUris("file:///usr/lib/spark/examples/jars/spark-examples.jar")
.pythonFileUris("gs://goog-dataproc-initialization-actions-us-west2/conda/get-sys-exec.py")
.fileUris("file://usr/lib/spark/examples/spark-examples.jar")
.archiveUris("file://usr/lib/spark/examples/spark-examples.jar")
.args("10")
.build())
.build());
}
}
resources:
spark-application:
type: gcp:dataproc:GdcSparkApplication
properties:
sparkApplicationId: tf-e2e-pyspark-app
serviceinstance: do-not-delete-dataproc-gdc-instance
project: my-project
location: us-west2
namespace: default
displayName: A Pyspark application for a Terraform create test
dependencyImages:
- gcr.io/some/image
pysparkApplicationConfig:
mainPythonFileUri: gs://goog-dataproc-initialization-actions-us-west2/conda/test_conda.py
jarFileUris:
- file:///usr/lib/spark/examples/jars/spark-examples.jar
pythonFileUris:
- gs://goog-dataproc-initialization-actions-us-west2/conda/get-sys-exec.py
fileUris:
- file://usr/lib/spark/examples/spark-examples.jar
archiveUris:
- file://usr/lib/spark/examples/spark-examples.jar
args:
- '10'
The pysparkApplicationConfig block specifies the main Python file via mainPythonFileUri, additional Python modules via pythonFileUris, and container images via dependencyImages. Dataproc GDC copies files from each dependency image sequentially; if multiple images contain the same filename, the later image’s version is used.
Execute SQL queries with inline query definitions
Analytics teams often need to run SQL queries against Spark data sources. SparkSQL applications can execute queries defined inline or loaded from files, with variable substitution for parameterization.
import * as pulumi from "@pulumi/pulumi";
import * as gcp from "@pulumi/gcp";
const spark_application = new gcp.dataproc.GdcSparkApplication("spark-application", {
sparkApplicationId: "tf-e2e-sparksql-app",
serviceinstance: "do-not-delete-dataproc-gdc-instance",
project: "my-project",
location: "us-west2",
namespace: "default",
displayName: "A SparkSql application for a Terraform create test",
sparkSqlApplicationConfig: {
jarFileUris: ["file:///usr/lib/spark/examples/jars/spark-examples.jar"],
queryList: {
queries: ["show tables;"],
},
scriptVariables: {
MY_VAR: "1",
},
},
});
import pulumi
import pulumi_gcp as gcp
spark_application = gcp.dataproc.GdcSparkApplication("spark-application",
spark_application_id="tf-e2e-sparksql-app",
serviceinstance="do-not-delete-dataproc-gdc-instance",
project="my-project",
location="us-west2",
namespace="default",
display_name="A SparkSql application for a Terraform create test",
spark_sql_application_config={
"jar_file_uris": ["file:///usr/lib/spark/examples/jars/spark-examples.jar"],
"query_list": {
"queries": ["show tables;"],
},
"script_variables": {
"MY_VAR": "1",
},
})
package main
import (
"github.com/pulumi/pulumi-gcp/sdk/v9/go/gcp/dataproc"
"github.com/pulumi/pulumi/sdk/v3/go/pulumi"
)
func main() {
pulumi.Run(func(ctx *pulumi.Context) error {
_, err := dataproc.NewGdcSparkApplication(ctx, "spark-application", &dataproc.GdcSparkApplicationArgs{
SparkApplicationId: pulumi.String("tf-e2e-sparksql-app"),
Serviceinstance: pulumi.String("do-not-delete-dataproc-gdc-instance"),
Project: pulumi.String("my-project"),
Location: pulumi.String("us-west2"),
Namespace: pulumi.String("default"),
DisplayName: pulumi.String("A SparkSql application for a Terraform create test"),
SparkSqlApplicationConfig: &dataproc.GdcSparkApplicationSparkSqlApplicationConfigArgs{
JarFileUris: pulumi.StringArray{
pulumi.String("file:///usr/lib/spark/examples/jars/spark-examples.jar"),
},
QueryList: &dataproc.GdcSparkApplicationSparkSqlApplicationConfigQueryListArgs{
Queries: pulumi.StringArray{
pulumi.String("show tables;"),
},
},
ScriptVariables: pulumi.StringMap{
"MY_VAR": pulumi.String("1"),
},
},
})
if err != nil {
return err
}
return nil
})
}
using System.Collections.Generic;
using System.Linq;
using Pulumi;
using Gcp = Pulumi.Gcp;
return await Deployment.RunAsync(() =>
{
var spark_application = new Gcp.Dataproc.GdcSparkApplication("spark-application", new()
{
SparkApplicationId = "tf-e2e-sparksql-app",
Serviceinstance = "do-not-delete-dataproc-gdc-instance",
Project = "my-project",
Location = "us-west2",
Namespace = "default",
DisplayName = "A SparkSql application for a Terraform create test",
SparkSqlApplicationConfig = new Gcp.Dataproc.Inputs.GdcSparkApplicationSparkSqlApplicationConfigArgs
{
JarFileUris = new[]
{
"file:///usr/lib/spark/examples/jars/spark-examples.jar",
},
QueryList = new Gcp.Dataproc.Inputs.GdcSparkApplicationSparkSqlApplicationConfigQueryListArgs
{
Queries = new[]
{
"show tables;",
},
},
ScriptVariables =
{
{ "MY_VAR", "1" },
},
},
});
});
package generated_program;
import com.pulumi.Context;
import com.pulumi.Pulumi;
import com.pulumi.core.Output;
import com.pulumi.gcp.dataproc.GdcSparkApplication;
import com.pulumi.gcp.dataproc.GdcSparkApplicationArgs;
import com.pulumi.gcp.dataproc.inputs.GdcSparkApplicationSparkSqlApplicationConfigArgs;
import com.pulumi.gcp.dataproc.inputs.GdcSparkApplicationSparkSqlApplicationConfigQueryListArgs;
import java.util.List;
import java.util.ArrayList;
import java.util.Map;
import java.io.File;
import java.nio.file.Files;
import java.nio.file.Paths;
public class App {
public static void main(String[] args) {
Pulumi.run(App::stack);
}
public static void stack(Context ctx) {
var spark_application = new GdcSparkApplication("spark-application", GdcSparkApplicationArgs.builder()
.sparkApplicationId("tf-e2e-sparksql-app")
.serviceinstance("do-not-delete-dataproc-gdc-instance")
.project("my-project")
.location("us-west2")
.namespace("default")
.displayName("A SparkSql application for a Terraform create test")
.sparkSqlApplicationConfig(GdcSparkApplicationSparkSqlApplicationConfigArgs.builder()
.jarFileUris("file:///usr/lib/spark/examples/jars/spark-examples.jar")
.queryList(GdcSparkApplicationSparkSqlApplicationConfigQueryListArgs.builder()
.queries("show tables;")
.build())
.scriptVariables(Map.of("MY_VAR", "1"))
.build())
.build());
}
}
resources:
spark-application:
type: gcp:dataproc:GdcSparkApplication
properties:
sparkApplicationId: tf-e2e-sparksql-app
serviceinstance: do-not-delete-dataproc-gdc-instance
project: my-project
location: us-west2
namespace: default
displayName: A SparkSql application for a Terraform create test
sparkSqlApplicationConfig:
jarFileUris:
- file:///usr/lib/spark/examples/jars/spark-examples.jar
queryList:
queries:
- show tables;
scriptVariables:
MY_VAR: '1'
The sparkSqlApplicationConfig block defines SQL queries inline via queryList and provides variable substitution through scriptVariables. At runtime, Dataproc GDC replaces variable references in the queries with the provided values before execution.
Execute SQL queries from a file in Cloud Storage
For complex SQL workloads, teams store queries in files rather than embedding them inline. This approach supports version control and reuse across multiple applications.
import * as pulumi from "@pulumi/pulumi";
import * as gcp from "@pulumi/gcp";
const spark_application = new gcp.dataproc.GdcSparkApplication("spark-application", {
sparkApplicationId: "tf-e2e-sparksql-app",
serviceinstance: "do-not-delete-dataproc-gdc-instance",
project: "my-project",
location: "us-west2",
namespace: "default",
displayName: "A SparkSql application for a Terraform create test",
sparkSqlApplicationConfig: {
jarFileUris: ["file:///usr/lib/spark/examples/jars/spark-examples.jar"],
queryFileUri: "gs://some-bucket/something.sql",
scriptVariables: {
MY_VAR: "1",
},
},
});
import pulumi
import pulumi_gcp as gcp
spark_application = gcp.dataproc.GdcSparkApplication("spark-application",
spark_application_id="tf-e2e-sparksql-app",
serviceinstance="do-not-delete-dataproc-gdc-instance",
project="my-project",
location="us-west2",
namespace="default",
display_name="A SparkSql application for a Terraform create test",
spark_sql_application_config={
"jar_file_uris": ["file:///usr/lib/spark/examples/jars/spark-examples.jar"],
"query_file_uri": "gs://some-bucket/something.sql",
"script_variables": {
"MY_VAR": "1",
},
})
package main
import (
"github.com/pulumi/pulumi-gcp/sdk/v9/go/gcp/dataproc"
"github.com/pulumi/pulumi/sdk/v3/go/pulumi"
)
func main() {
pulumi.Run(func(ctx *pulumi.Context) error {
_, err := dataproc.NewGdcSparkApplication(ctx, "spark-application", &dataproc.GdcSparkApplicationArgs{
SparkApplicationId: pulumi.String("tf-e2e-sparksql-app"),
Serviceinstance: pulumi.String("do-not-delete-dataproc-gdc-instance"),
Project: pulumi.String("my-project"),
Location: pulumi.String("us-west2"),
Namespace: pulumi.String("default"),
DisplayName: pulumi.String("A SparkSql application for a Terraform create test"),
SparkSqlApplicationConfig: &dataproc.GdcSparkApplicationSparkSqlApplicationConfigArgs{
JarFileUris: pulumi.StringArray{
pulumi.String("file:///usr/lib/spark/examples/jars/spark-examples.jar"),
},
QueryFileUri: pulumi.String("gs://some-bucket/something.sql"),
ScriptVariables: pulumi.StringMap{
"MY_VAR": pulumi.String("1"),
},
},
})
if err != nil {
return err
}
return nil
})
}
using System.Collections.Generic;
using System.Linq;
using Pulumi;
using Gcp = Pulumi.Gcp;
return await Deployment.RunAsync(() =>
{
var spark_application = new Gcp.Dataproc.GdcSparkApplication("spark-application", new()
{
SparkApplicationId = "tf-e2e-sparksql-app",
Serviceinstance = "do-not-delete-dataproc-gdc-instance",
Project = "my-project",
Location = "us-west2",
Namespace = "default",
DisplayName = "A SparkSql application for a Terraform create test",
SparkSqlApplicationConfig = new Gcp.Dataproc.Inputs.GdcSparkApplicationSparkSqlApplicationConfigArgs
{
JarFileUris = new[]
{
"file:///usr/lib/spark/examples/jars/spark-examples.jar",
},
QueryFileUri = "gs://some-bucket/something.sql",
ScriptVariables =
{
{ "MY_VAR", "1" },
},
},
});
});
package generated_program;
import com.pulumi.Context;
import com.pulumi.Pulumi;
import com.pulumi.core.Output;
import com.pulumi.gcp.dataproc.GdcSparkApplication;
import com.pulumi.gcp.dataproc.GdcSparkApplicationArgs;
import com.pulumi.gcp.dataproc.inputs.GdcSparkApplicationSparkSqlApplicationConfigArgs;
import java.util.List;
import java.util.ArrayList;
import java.util.Map;
import java.io.File;
import java.nio.file.Files;
import java.nio.file.Paths;
public class App {
public static void main(String[] args) {
Pulumi.run(App::stack);
}
public static void stack(Context ctx) {
var spark_application = new GdcSparkApplication("spark-application", GdcSparkApplicationArgs.builder()
.sparkApplicationId("tf-e2e-sparksql-app")
.serviceinstance("do-not-delete-dataproc-gdc-instance")
.project("my-project")
.location("us-west2")
.namespace("default")
.displayName("A SparkSql application for a Terraform create test")
.sparkSqlApplicationConfig(GdcSparkApplicationSparkSqlApplicationConfigArgs.builder()
.jarFileUris("file:///usr/lib/spark/examples/jars/spark-examples.jar")
.queryFileUri("gs://some-bucket/something.sql")
.scriptVariables(Map.of("MY_VAR", "1"))
.build())
.build());
}
}
resources:
spark-application:
type: gcp:dataproc:GdcSparkApplication
properties:
sparkApplicationId: tf-e2e-sparksql-app
serviceinstance: do-not-delete-dataproc-gdc-instance
project: my-project
location: us-west2
namespace: default
displayName: A SparkSql application for a Terraform create test
sparkSqlApplicationConfig:
jarFileUris:
- file:///usr/lib/spark/examples/jars/spark-examples.jar
queryFileUri: gs://some-bucket/something.sql
scriptVariables:
MY_VAR: '1'
Instead of queryList, the queryFileUri property points to a SQL file in Cloud Storage. This configuration extends the inline query pattern by loading queries from external files, enabling teams to manage SQL code separately from application definitions.
Beyond these examples
These snippets focus on specific Spark application features: Spark job types (JAR, PySpark, SparkSQL), dependency management (JARs, Python files, container images), and SQL query execution (inline and file-based). They’re intentionally minimal rather than full data processing pipelines.
The examples reference pre-existing infrastructure such as Dataproc GDC service instances, Kubernetes namespaces on the cluster, Cloud Storage buckets (for PySpark and SparkSQL examples), and container registries (for dependency images). They focus on configuring the application rather than provisioning the underlying platform.
To keep things focused, common Spark application patterns are omitted, including:
- Application environments for shared configuration (applicationEnvironment)
- Spark properties tuning (properties block)
- Labels and annotations for organization
- SparkR applications (sparkRApplicationConfig)
- Advanced file dependencies (archiveUris, fileUris)
These omissions are intentional: the goal is to illustrate how each Spark application type is wired, not provide drop-in data processing modules. See the GDC Spark Application resource reference for all available configuration options.
Let's deploy GCP Dataproc Spark Applications
Get started with Pulumi Cloud, then follow our quick setup guide to deploy this infrastructure.
Try Pulumi Cloud for FREEFrequently Asked Questions
Configuration & Application Types
sparkApplicationConfig for Java/Scala applications, pysparkApplicationConfig for Python, sparkRApplicationConfig for R, or sparkSqlApplicationConfig for SQL queries.sparkApplicationConfig, use either mainClass (e.g., org.apache.spark.examples.SparkPi) or mainJarFileUri to point to your JAR file.queryList for inline SQL queries or queryFileUri to reference an external SQL file.Immutability & Updates
location, project, serviceinstance, sparkApplicationId, namespace, annotations, applicationEnvironment, dependencyImages, properties, version, and all application config objects. Only labels can be updated. Changing immutable properties forces resource recreation.Labels & Annotations
labels contains only the labels you manage in your Pulumi configuration, while effectiveLabels includes all labels from all sources (Pulumi, other clients, and GCP services). Both fields are non-authoritative.annotations field is non-authoritative and only manages annotations in your configuration. Use effectiveAnnotations to see all annotations from all sources.Dependencies & Files
Prerequisites & Setup
namespace property must already exist on the cluster before you create the Spark application.Using a different cloud?
Explore analytics guides for other cloud providers: