The aws:kendra/dataSource:DataSource resource, part of the Pulumi AWS provider, connects a Kendra index to document repositories: S3 buckets, websites, or custom ingestion pipelines. This guide focuses on three capabilities: S3 bucket indexing with filtering, web crawler configuration, and scheduled synchronization.
Data sources require an existing Kendra index and IAM role with permissions to read from the repository. The examples are intentionally small. Combine them with your own index, IAM roles, and repository configuration.
Create a custom data source with manual sync
Teams building custom ingestion pipelines often start with a CUSTOM data source that accepts documents through the API rather than connecting to a repository.
import * as pulumi from "@pulumi/pulumi";
import * as aws from "@pulumi/aws";
const example = new aws.kendra.DataSource("example", {
indexId: exampleAwsKendraIndex.id,
name: "example",
description: "example",
languageCode: "en",
type: "CUSTOM",
tags: {
hello: "world",
},
});
import pulumi
import pulumi_aws as aws
example = aws.kendra.DataSource("example",
index_id=example_aws_kendra_index["id"],
name="example",
description="example",
language_code="en",
type="CUSTOM",
tags={
"hello": "world",
})
package main
import (
"github.com/pulumi/pulumi-aws/sdk/v7/go/aws/kendra"
"github.com/pulumi/pulumi/sdk/v3/go/pulumi"
)
func main() {
pulumi.Run(func(ctx *pulumi.Context) error {
_, err := kendra.NewDataSource(ctx, "example", &kendra.DataSourceArgs{
IndexId: pulumi.Any(exampleAwsKendraIndex.Id),
Name: pulumi.String("example"),
Description: pulumi.String("example"),
LanguageCode: pulumi.String("en"),
Type: pulumi.String("CUSTOM"),
Tags: pulumi.StringMap{
"hello": pulumi.String("world"),
},
})
if err != nil {
return err
}
return nil
})
}
using System.Collections.Generic;
using System.Linq;
using Pulumi;
using Aws = Pulumi.Aws;
return await Deployment.RunAsync(() =>
{
var example = new Aws.Kendra.DataSource("example", new()
{
IndexId = exampleAwsKendraIndex.Id,
Name = "example",
Description = "example",
LanguageCode = "en",
Type = "CUSTOM",
Tags =
{
{ "hello", "world" },
},
});
});
package generated_program;
import com.pulumi.Context;
import com.pulumi.Pulumi;
import com.pulumi.core.Output;
import com.pulumi.aws.kendra.DataSource;
import com.pulumi.aws.kendra.DataSourceArgs;
import java.util.List;
import java.util.ArrayList;
import java.util.Map;
import java.io.File;
import java.nio.file.Files;
import java.nio.file.Paths;
public class App {
public static void main(String[] args) {
Pulumi.run(App::stack);
}
public static void stack(Context ctx) {
var example = new DataSource("example", DataSourceArgs.builder()
.indexId(exampleAwsKendraIndex.id())
.name("example")
.description("example")
.languageCode("en")
.type("CUSTOM")
.tags(Map.of("hello", "world"))
.build());
}
}
resources:
example:
type: aws:kendra:DataSource
properties:
indexId: ${exampleAwsKendraIndex.id}
name: example
description: example
languageCode: en
type: CUSTOM
tags:
hello: world
The type property set to CUSTOM indicates this data source receives documents programmatically. Without a schedule property, you control synchronization by calling the StartDataSourceSyncJob API. Custom data sources don’t require roleArn or configuration blocks since they don’t connect to external repositories.
Index S3 documents on a schedule
Document repositories in S3 need periodic synchronization to keep the search index current as files are added or updated.
import * as pulumi from "@pulumi/pulumi";
import * as aws from "@pulumi/aws";
const example = new aws.kendra.DataSource("example", {
indexId: exampleAwsKendraIndex.id,
name: "example",
type: "S3",
roleArn: exampleAwsIamRole.arn,
schedule: "cron(9 10 1 * ? *)",
configuration: {
s3Configuration: {
bucketName: exampleAwsS3Bucket.id,
},
},
});
import pulumi
import pulumi_aws as aws
example = aws.kendra.DataSource("example",
index_id=example_aws_kendra_index["id"],
name="example",
type="S3",
role_arn=example_aws_iam_role["arn"],
schedule="cron(9 10 1 * ? *)",
configuration={
"s3_configuration": {
"bucket_name": example_aws_s3_bucket["id"],
},
})
package main
import (
"github.com/pulumi/pulumi-aws/sdk/v7/go/aws/kendra"
"github.com/pulumi/pulumi/sdk/v3/go/pulumi"
)
func main() {
pulumi.Run(func(ctx *pulumi.Context) error {
_, err := kendra.NewDataSource(ctx, "example", &kendra.DataSourceArgs{
IndexId: pulumi.Any(exampleAwsKendraIndex.Id),
Name: pulumi.String("example"),
Type: pulumi.String("S3"),
RoleArn: pulumi.Any(exampleAwsIamRole.Arn),
Schedule: pulumi.String("cron(9 10 1 * ? *)"),
Configuration: &kendra.DataSourceConfigurationArgs{
S3Configuration: &kendra.DataSourceConfigurationS3ConfigurationArgs{
BucketName: pulumi.Any(exampleAwsS3Bucket.Id),
},
},
})
if err != nil {
return err
}
return nil
})
}
using System.Collections.Generic;
using System.Linq;
using Pulumi;
using Aws = Pulumi.Aws;
return await Deployment.RunAsync(() =>
{
var example = new Aws.Kendra.DataSource("example", new()
{
IndexId = exampleAwsKendraIndex.Id,
Name = "example",
Type = "S3",
RoleArn = exampleAwsIamRole.Arn,
Schedule = "cron(9 10 1 * ? *)",
Configuration = new Aws.Kendra.Inputs.DataSourceConfigurationArgs
{
S3Configuration = new Aws.Kendra.Inputs.DataSourceConfigurationS3ConfigurationArgs
{
BucketName = exampleAwsS3Bucket.Id,
},
},
});
});
package generated_program;
import com.pulumi.Context;
import com.pulumi.Pulumi;
import com.pulumi.core.Output;
import com.pulumi.aws.kendra.DataSource;
import com.pulumi.aws.kendra.DataSourceArgs;
import com.pulumi.aws.kendra.inputs.DataSourceConfigurationArgs;
import com.pulumi.aws.kendra.inputs.DataSourceConfigurationS3ConfigurationArgs;
import java.util.List;
import java.util.ArrayList;
import java.util.Map;
import java.io.File;
import java.nio.file.Files;
import java.nio.file.Paths;
public class App {
public static void main(String[] args) {
Pulumi.run(App::stack);
}
public static void stack(Context ctx) {
var example = new DataSource("example", DataSourceArgs.builder()
.indexId(exampleAwsKendraIndex.id())
.name("example")
.type("S3")
.roleArn(exampleAwsIamRole.arn())
.schedule("cron(9 10 1 * ? *)")
.configuration(DataSourceConfigurationArgs.builder()
.s3Configuration(DataSourceConfigurationS3ConfigurationArgs.builder()
.bucketName(exampleAwsS3Bucket.id())
.build())
.build())
.build());
}
}
resources:
example:
type: aws:kendra:DataSource
properties:
indexId: ${exampleAwsKendraIndex.id}
name: example
type: S3
roleArn: ${exampleAwsIamRole.arn}
schedule: cron(9 10 1 * ? *)
configuration:
s3Configuration:
bucketName: ${exampleAwsS3Bucket.id}
The schedule property uses cron syntax to define when Kendra checks the S3 bucket for changes. The s3Configuration block specifies which bucket to index. The roleArn grants Kendra permission to read from S3 and write to the index.
Filter S3 documents with patterns and metadata
Large S3 buckets often contain mixed content where only specific files or directories should be indexed.
import * as pulumi from "@pulumi/pulumi";
import * as aws from "@pulumi/aws";
const example = new aws.kendra.DataSource("example", {
indexId: exampleAwsKendraIndex.id,
name: "example",
type: "S3",
roleArn: exampleAwsIamRole.arn,
configuration: {
s3Configuration: {
bucketName: exampleAwsS3Bucket.id,
exclusionPatterns: ["example"],
inclusionPatterns: ["hello"],
inclusionPrefixes: ["world"],
documentsMetadataConfiguration: {
s3Prefix: "example",
},
},
},
});
import pulumi
import pulumi_aws as aws
example = aws.kendra.DataSource("example",
index_id=example_aws_kendra_index["id"],
name="example",
type="S3",
role_arn=example_aws_iam_role["arn"],
configuration={
"s3_configuration": {
"bucket_name": example_aws_s3_bucket["id"],
"exclusion_patterns": ["example"],
"inclusion_patterns": ["hello"],
"inclusion_prefixes": ["world"],
"documents_metadata_configuration": {
"s3_prefix": "example",
},
},
})
package main
import (
"github.com/pulumi/pulumi-aws/sdk/v7/go/aws/kendra"
"github.com/pulumi/pulumi/sdk/v3/go/pulumi"
)
func main() {
pulumi.Run(func(ctx *pulumi.Context) error {
_, err := kendra.NewDataSource(ctx, "example", &kendra.DataSourceArgs{
IndexId: pulumi.Any(exampleAwsKendraIndex.Id),
Name: pulumi.String("example"),
Type: pulumi.String("S3"),
RoleArn: pulumi.Any(exampleAwsIamRole.Arn),
Configuration: &kendra.DataSourceConfigurationArgs{
S3Configuration: &kendra.DataSourceConfigurationS3ConfigurationArgs{
BucketName: pulumi.Any(exampleAwsS3Bucket.Id),
ExclusionPatterns: pulumi.StringArray{
pulumi.String("example"),
},
InclusionPatterns: pulumi.StringArray{
pulumi.String("hello"),
},
InclusionPrefixes: pulumi.StringArray{
pulumi.String("world"),
},
DocumentsMetadataConfiguration: &kendra.DataSourceConfigurationS3ConfigurationDocumentsMetadataConfigurationArgs{
S3Prefix: pulumi.String("example"),
},
},
},
})
if err != nil {
return err
}
return nil
})
}
using System.Collections.Generic;
using System.Linq;
using Pulumi;
using Aws = Pulumi.Aws;
return await Deployment.RunAsync(() =>
{
var example = new Aws.Kendra.DataSource("example", new()
{
IndexId = exampleAwsKendraIndex.Id,
Name = "example",
Type = "S3",
RoleArn = exampleAwsIamRole.Arn,
Configuration = new Aws.Kendra.Inputs.DataSourceConfigurationArgs
{
S3Configuration = new Aws.Kendra.Inputs.DataSourceConfigurationS3ConfigurationArgs
{
BucketName = exampleAwsS3Bucket.Id,
ExclusionPatterns = new[]
{
"example",
},
InclusionPatterns = new[]
{
"hello",
},
InclusionPrefixes = new[]
{
"world",
},
DocumentsMetadataConfiguration = new Aws.Kendra.Inputs.DataSourceConfigurationS3ConfigurationDocumentsMetadataConfigurationArgs
{
S3Prefix = "example",
},
},
},
});
});
package generated_program;
import com.pulumi.Context;
import com.pulumi.Pulumi;
import com.pulumi.core.Output;
import com.pulumi.aws.kendra.DataSource;
import com.pulumi.aws.kendra.DataSourceArgs;
import com.pulumi.aws.kendra.inputs.DataSourceConfigurationArgs;
import com.pulumi.aws.kendra.inputs.DataSourceConfigurationS3ConfigurationArgs;
import com.pulumi.aws.kendra.inputs.DataSourceConfigurationS3ConfigurationDocumentsMetadataConfigurationArgs;
import java.util.List;
import java.util.ArrayList;
import java.util.Map;
import java.io.File;
import java.nio.file.Files;
import java.nio.file.Paths;
public class App {
public static void main(String[] args) {
Pulumi.run(App::stack);
}
public static void stack(Context ctx) {
var example = new DataSource("example", DataSourceArgs.builder()
.indexId(exampleAwsKendraIndex.id())
.name("example")
.type("S3")
.roleArn(exampleAwsIamRole.arn())
.configuration(DataSourceConfigurationArgs.builder()
.s3Configuration(DataSourceConfigurationS3ConfigurationArgs.builder()
.bucketName(exampleAwsS3Bucket.id())
.exclusionPatterns("example")
.inclusionPatterns("hello")
.inclusionPrefixes("world")
.documentsMetadataConfiguration(DataSourceConfigurationS3ConfigurationDocumentsMetadataConfigurationArgs.builder()
.s3Prefix("example")
.build())
.build())
.build())
.build());
}
}
resources:
example:
type: aws:kendra:DataSource
properties:
indexId: ${exampleAwsKendraIndex.id}
name: example
type: S3
roleArn: ${exampleAwsIamRole.arn}
configuration:
s3Configuration:
bucketName: ${exampleAwsS3Bucket.id}
exclusionPatterns:
- example
inclusionPatterns:
- hello
inclusionPrefixes:
- world
documentsMetadataConfiguration:
s3Prefix: example
The inclusionPatterns and exclusionPatterns properties use glob patterns to control which objects are indexed. The inclusionPrefixes property limits indexing to specific S3 prefixes. The documentsMetadataConfiguration block points to a location containing metadata files that enrich indexed documents.
Crawl websites starting from seed URLs
Public documentation sites and knowledge bases can be indexed by providing starting URLs that Kendra crawls to discover linked pages.
import * as pulumi from "@pulumi/pulumi";
import * as aws from "@pulumi/aws";
const example = new aws.kendra.DataSource("example", {
indexId: exampleAwsKendraIndex.id,
name: "example",
type: "WEBCRAWLER",
roleArn: exampleAwsIamRole.arn,
configuration: {
webCrawlerConfiguration: {
urls: {
seedUrlConfiguration: {
seedUrls: ["REPLACE_WITH_YOUR_URL"],
},
},
},
},
});
import pulumi
import pulumi_aws as aws
example = aws.kendra.DataSource("example",
index_id=example_aws_kendra_index["id"],
name="example",
type="WEBCRAWLER",
role_arn=example_aws_iam_role["arn"],
configuration={
"web_crawler_configuration": {
"urls": {
"seed_url_configuration": {
"seed_urls": ["REPLACE_WITH_YOUR_URL"],
},
},
},
})
package main
import (
"github.com/pulumi/pulumi-aws/sdk/v7/go/aws/kendra"
"github.com/pulumi/pulumi/sdk/v3/go/pulumi"
)
func main() {
pulumi.Run(func(ctx *pulumi.Context) error {
_, err := kendra.NewDataSource(ctx, "example", &kendra.DataSourceArgs{
IndexId: pulumi.Any(exampleAwsKendraIndex.Id),
Name: pulumi.String("example"),
Type: pulumi.String("WEBCRAWLER"),
RoleArn: pulumi.Any(exampleAwsIamRole.Arn),
Configuration: &kendra.DataSourceConfigurationArgs{
WebCrawlerConfiguration: &kendra.DataSourceConfigurationWebCrawlerConfigurationArgs{
Urls: &kendra.DataSourceConfigurationWebCrawlerConfigurationUrlsArgs{
SeedUrlConfiguration: &kendra.DataSourceConfigurationWebCrawlerConfigurationUrlsSeedUrlConfigurationArgs{
SeedUrls: pulumi.StringArray{
pulumi.String("REPLACE_WITH_YOUR_URL"),
},
},
},
},
},
})
if err != nil {
return err
}
return nil
})
}
using System.Collections.Generic;
using System.Linq;
using Pulumi;
using Aws = Pulumi.Aws;
return await Deployment.RunAsync(() =>
{
var example = new Aws.Kendra.DataSource("example", new()
{
IndexId = exampleAwsKendraIndex.Id,
Name = "example",
Type = "WEBCRAWLER",
RoleArn = exampleAwsIamRole.Arn,
Configuration = new Aws.Kendra.Inputs.DataSourceConfigurationArgs
{
WebCrawlerConfiguration = new Aws.Kendra.Inputs.DataSourceConfigurationWebCrawlerConfigurationArgs
{
Urls = new Aws.Kendra.Inputs.DataSourceConfigurationWebCrawlerConfigurationUrlsArgs
{
SeedUrlConfiguration = new Aws.Kendra.Inputs.DataSourceConfigurationWebCrawlerConfigurationUrlsSeedUrlConfigurationArgs
{
SeedUrls = new[]
{
"REPLACE_WITH_YOUR_URL",
},
},
},
},
},
});
});
package generated_program;
import com.pulumi.Context;
import com.pulumi.Pulumi;
import com.pulumi.core.Output;
import com.pulumi.aws.kendra.DataSource;
import com.pulumi.aws.kendra.DataSourceArgs;
import com.pulumi.aws.kendra.inputs.DataSourceConfigurationArgs;
import com.pulumi.aws.kendra.inputs.DataSourceConfigurationWebCrawlerConfigurationArgs;
import com.pulumi.aws.kendra.inputs.DataSourceConfigurationWebCrawlerConfigurationUrlsArgs;
import com.pulumi.aws.kendra.inputs.DataSourceConfigurationWebCrawlerConfigurationUrlsSeedUrlConfigurationArgs;
import java.util.List;
import java.util.ArrayList;
import java.util.Map;
import java.io.File;
import java.nio.file.Files;
import java.nio.file.Paths;
public class App {
public static void main(String[] args) {
Pulumi.run(App::stack);
}
public static void stack(Context ctx) {
var example = new DataSource("example", DataSourceArgs.builder()
.indexId(exampleAwsKendraIndex.id())
.name("example")
.type("WEBCRAWLER")
.roleArn(exampleAwsIamRole.arn())
.configuration(DataSourceConfigurationArgs.builder()
.webCrawlerConfiguration(DataSourceConfigurationWebCrawlerConfigurationArgs.builder()
.urls(DataSourceConfigurationWebCrawlerConfigurationUrlsArgs.builder()
.seedUrlConfiguration(DataSourceConfigurationWebCrawlerConfigurationUrlsSeedUrlConfigurationArgs.builder()
.seedUrls("REPLACE_WITH_YOUR_URL")
.build())
.build())
.build())
.build())
.build());
}
}
resources:
example:
type: aws:kendra:DataSource
properties:
indexId: ${exampleAwsKendraIndex.id}
name: example
type: WEBCRAWLER
roleArn: ${exampleAwsIamRole.arn}
configuration:
webCrawlerConfiguration:
urls:
seedUrlConfiguration:
seedUrls:
- REPLACE_WITH_YOUR_URL
The webCrawlerConfiguration block defines how Kendra crawls websites. The seedUrlConfiguration provides starting URLs; Kendra follows links from these pages to discover additional content. You must replace the placeholder URL with your actual website.
Control web crawler depth and scope
Deep website hierarchies can generate excessive crawl volume; limiting depth helps focus on primary content.
import * as pulumi from "@pulumi/pulumi";
import * as aws from "@pulumi/aws";
const example = new aws.kendra.DataSource("example", {
indexId: exampleAwsKendraIndex.id,
name: "example",
type: "WEBCRAWLER",
roleArn: exampleAwsIamRole.arn,
configuration: {
webCrawlerConfiguration: {
crawlDepth: 3,
urls: {
seedUrlConfiguration: {
seedUrls: ["REPLACE_WITH_YOUR_URL"],
},
},
},
},
});
import pulumi
import pulumi_aws as aws
example = aws.kendra.DataSource("example",
index_id=example_aws_kendra_index["id"],
name="example",
type="WEBCRAWLER",
role_arn=example_aws_iam_role["arn"],
configuration={
"web_crawler_configuration": {
"crawl_depth": 3,
"urls": {
"seed_url_configuration": {
"seed_urls": ["REPLACE_WITH_YOUR_URL"],
},
},
},
})
package main
import (
"github.com/pulumi/pulumi-aws/sdk/v7/go/aws/kendra"
"github.com/pulumi/pulumi/sdk/v3/go/pulumi"
)
func main() {
pulumi.Run(func(ctx *pulumi.Context) error {
_, err := kendra.NewDataSource(ctx, "example", &kendra.DataSourceArgs{
IndexId: pulumi.Any(exampleAwsKendraIndex.Id),
Name: pulumi.String("example"),
Type: pulumi.String("WEBCRAWLER"),
RoleArn: pulumi.Any(exampleAwsIamRole.Arn),
Configuration: &kendra.DataSourceConfigurationArgs{
WebCrawlerConfiguration: &kendra.DataSourceConfigurationWebCrawlerConfigurationArgs{
CrawlDepth: pulumi.Int(3),
Urls: &kendra.DataSourceConfigurationWebCrawlerConfigurationUrlsArgs{
SeedUrlConfiguration: &kendra.DataSourceConfigurationWebCrawlerConfigurationUrlsSeedUrlConfigurationArgs{
SeedUrls: pulumi.StringArray{
pulumi.String("REPLACE_WITH_YOUR_URL"),
},
},
},
},
},
})
if err != nil {
return err
}
return nil
})
}
using System.Collections.Generic;
using System.Linq;
using Pulumi;
using Aws = Pulumi.Aws;
return await Deployment.RunAsync(() =>
{
var example = new Aws.Kendra.DataSource("example", new()
{
IndexId = exampleAwsKendraIndex.Id,
Name = "example",
Type = "WEBCRAWLER",
RoleArn = exampleAwsIamRole.Arn,
Configuration = new Aws.Kendra.Inputs.DataSourceConfigurationArgs
{
WebCrawlerConfiguration = new Aws.Kendra.Inputs.DataSourceConfigurationWebCrawlerConfigurationArgs
{
CrawlDepth = 3,
Urls = new Aws.Kendra.Inputs.DataSourceConfigurationWebCrawlerConfigurationUrlsArgs
{
SeedUrlConfiguration = new Aws.Kendra.Inputs.DataSourceConfigurationWebCrawlerConfigurationUrlsSeedUrlConfigurationArgs
{
SeedUrls = new[]
{
"REPLACE_WITH_YOUR_URL",
},
},
},
},
},
});
});
package generated_program;
import com.pulumi.Context;
import com.pulumi.Pulumi;
import com.pulumi.core.Output;
import com.pulumi.aws.kendra.DataSource;
import com.pulumi.aws.kendra.DataSourceArgs;
import com.pulumi.aws.kendra.inputs.DataSourceConfigurationArgs;
import com.pulumi.aws.kendra.inputs.DataSourceConfigurationWebCrawlerConfigurationArgs;
import com.pulumi.aws.kendra.inputs.DataSourceConfigurationWebCrawlerConfigurationUrlsArgs;
import com.pulumi.aws.kendra.inputs.DataSourceConfigurationWebCrawlerConfigurationUrlsSeedUrlConfigurationArgs;
import java.util.List;
import java.util.ArrayList;
import java.util.Map;
import java.io.File;
import java.nio.file.Files;
import java.nio.file.Paths;
public class App {
public static void main(String[] args) {
Pulumi.run(App::stack);
}
public static void stack(Context ctx) {
var example = new DataSource("example", DataSourceArgs.builder()
.indexId(exampleAwsKendraIndex.id())
.name("example")
.type("WEBCRAWLER")
.roleArn(exampleAwsIamRole.arn())
.configuration(DataSourceConfigurationArgs.builder()
.webCrawlerConfiguration(DataSourceConfigurationWebCrawlerConfigurationArgs.builder()
.crawlDepth(3)
.urls(DataSourceConfigurationWebCrawlerConfigurationUrlsArgs.builder()
.seedUrlConfiguration(DataSourceConfigurationWebCrawlerConfigurationUrlsSeedUrlConfigurationArgs.builder()
.seedUrls("REPLACE_WITH_YOUR_URL")
.build())
.build())
.build())
.build())
.build());
}
}
resources:
example:
type: aws:kendra:DataSource
properties:
indexId: ${exampleAwsKendraIndex.id}
name: example
type: WEBCRAWLER
roleArn: ${exampleAwsIamRole.arn}
configuration:
webCrawlerConfiguration:
crawlDepth: 3
urls:
seedUrlConfiguration:
seedUrls:
- REPLACE_WITH_YOUR_URL
The crawlDepth property limits how many link levels the crawler follows from seed URLs. A depth of 3 means Kendra crawls the seed URL, pages linked from it, pages linked from those, and one more level. This prevents unbounded crawling of large sites.
Filter crawled URLs with pattern matching
Websites often contain sections that shouldn’t be indexed, such as login pages or administrative interfaces.
import * as pulumi from "@pulumi/pulumi";
import * as aws from "@pulumi/aws";
const example = new aws.kendra.DataSource("example", {
indexId: exampleAwsKendraIndex.id,
name: "example",
type: "WEBCRAWLER",
roleArn: exampleAwsIamRole.arn,
configuration: {
webCrawlerConfiguration: {
urlExclusionPatterns: ["example"],
urlInclusionPatterns: ["hello"],
urls: {
seedUrlConfiguration: {
seedUrls: ["REPLACE_WITH_YOUR_URL"],
},
},
},
},
});
import pulumi
import pulumi_aws as aws
example = aws.kendra.DataSource("example",
index_id=example_aws_kendra_index["id"],
name="example",
type="WEBCRAWLER",
role_arn=example_aws_iam_role["arn"],
configuration={
"web_crawler_configuration": {
"url_exclusion_patterns": ["example"],
"url_inclusion_patterns": ["hello"],
"urls": {
"seed_url_configuration": {
"seed_urls": ["REPLACE_WITH_YOUR_URL"],
},
},
},
})
package main
import (
"github.com/pulumi/pulumi-aws/sdk/v7/go/aws/kendra"
"github.com/pulumi/pulumi/sdk/v3/go/pulumi"
)
func main() {
pulumi.Run(func(ctx *pulumi.Context) error {
_, err := kendra.NewDataSource(ctx, "example", &kendra.DataSourceArgs{
IndexId: pulumi.Any(exampleAwsKendraIndex.Id),
Name: pulumi.String("example"),
Type: pulumi.String("WEBCRAWLER"),
RoleArn: pulumi.Any(exampleAwsIamRole.Arn),
Configuration: &kendra.DataSourceConfigurationArgs{
WebCrawlerConfiguration: &kendra.DataSourceConfigurationWebCrawlerConfigurationArgs{
UrlExclusionPatterns: pulumi.StringArray{
pulumi.String("example"),
},
UrlInclusionPatterns: pulumi.StringArray{
pulumi.String("hello"),
},
Urls: &kendra.DataSourceConfigurationWebCrawlerConfigurationUrlsArgs{
SeedUrlConfiguration: &kendra.DataSourceConfigurationWebCrawlerConfigurationUrlsSeedUrlConfigurationArgs{
SeedUrls: pulumi.StringArray{
pulumi.String("REPLACE_WITH_YOUR_URL"),
},
},
},
},
},
})
if err != nil {
return err
}
return nil
})
}
using System.Collections.Generic;
using System.Linq;
using Pulumi;
using Aws = Pulumi.Aws;
return await Deployment.RunAsync(() =>
{
var example = new Aws.Kendra.DataSource("example", new()
{
IndexId = exampleAwsKendraIndex.Id,
Name = "example",
Type = "WEBCRAWLER",
RoleArn = exampleAwsIamRole.Arn,
Configuration = new Aws.Kendra.Inputs.DataSourceConfigurationArgs
{
WebCrawlerConfiguration = new Aws.Kendra.Inputs.DataSourceConfigurationWebCrawlerConfigurationArgs
{
UrlExclusionPatterns = new[]
{
"example",
},
UrlInclusionPatterns = new[]
{
"hello",
},
Urls = new Aws.Kendra.Inputs.DataSourceConfigurationWebCrawlerConfigurationUrlsArgs
{
SeedUrlConfiguration = new Aws.Kendra.Inputs.DataSourceConfigurationWebCrawlerConfigurationUrlsSeedUrlConfigurationArgs
{
SeedUrls = new[]
{
"REPLACE_WITH_YOUR_URL",
},
},
},
},
},
});
});
package generated_program;
import com.pulumi.Context;
import com.pulumi.Pulumi;
import com.pulumi.core.Output;
import com.pulumi.aws.kendra.DataSource;
import com.pulumi.aws.kendra.DataSourceArgs;
import com.pulumi.aws.kendra.inputs.DataSourceConfigurationArgs;
import com.pulumi.aws.kendra.inputs.DataSourceConfigurationWebCrawlerConfigurationArgs;
import com.pulumi.aws.kendra.inputs.DataSourceConfigurationWebCrawlerConfigurationUrlsArgs;
import com.pulumi.aws.kendra.inputs.DataSourceConfigurationWebCrawlerConfigurationUrlsSeedUrlConfigurationArgs;
import java.util.List;
import java.util.ArrayList;
import java.util.Map;
import java.io.File;
import java.nio.file.Files;
import java.nio.file.Paths;
public class App {
public static void main(String[] args) {
Pulumi.run(App::stack);
}
public static void stack(Context ctx) {
var example = new DataSource("example", DataSourceArgs.builder()
.indexId(exampleAwsKendraIndex.id())
.name("example")
.type("WEBCRAWLER")
.roleArn(exampleAwsIamRole.arn())
.configuration(DataSourceConfigurationArgs.builder()
.webCrawlerConfiguration(DataSourceConfigurationWebCrawlerConfigurationArgs.builder()
.urlExclusionPatterns("example")
.urlInclusionPatterns("hello")
.urls(DataSourceConfigurationWebCrawlerConfigurationUrlsArgs.builder()
.seedUrlConfiguration(DataSourceConfigurationWebCrawlerConfigurationUrlsSeedUrlConfigurationArgs.builder()
.seedUrls("REPLACE_WITH_YOUR_URL")
.build())
.build())
.build())
.build())
.build());
}
}
resources:
example:
type: aws:kendra:DataSource
properties:
indexId: ${exampleAwsKendraIndex.id}
name: example
type: WEBCRAWLER
roleArn: ${exampleAwsIamRole.arn}
configuration:
webCrawlerConfiguration:
urlExclusionPatterns:
- example
urlInclusionPatterns:
- hello
urls:
seedUrlConfiguration:
seedUrls:
- REPLACE_WITH_YOUR_URL
The urlInclusionPatterns and urlExclusionPatterns properties use regex patterns to filter which URLs are indexed. Inclusion patterns define allowed URLs; exclusion patterns define blocked URLs. Kendra evaluates both when deciding whether to index a page.
Beyond these examples
These snippets focus on specific data source features: S3 and web crawler connectors, scheduled synchronization, and URL and document filtering. They’re intentionally minimal rather than full search implementations.
The examples reference pre-existing infrastructure such as Kendra indexes, IAM roles with data source permissions, and S3 buckets for S3 connector examples. They focus on configuring the data source rather than provisioning the surrounding infrastructure.
To keep things focused, common data source patterns are omitted, including:
- Access control lists for document-level permissions
- Authentication for protected websites (basicAuthentications)
- Proxy configuration for network restrictions
- Custom document enrichment during ingestion
- Site maps as alternative to seed URLs
- Template-based connectors (WEBCRAWLERV2)
These omissions are intentional: the goal is to illustrate how each data source feature is wired, not provide drop-in search modules. See the Kendra DataSource resource reference for all available configuration options.
Let's configure AWS Kendra Data Sources
Get started with Pulumi Cloud, then follow our quick setup guide to deploy this infrastructure.
Try Pulumi Cloud for FREEFrequently Asked Questions
Configuration & Requirements
indexId and type properties are immutable and cannot be changed after creation. Modifying these requires replacing the resource.type is set to CUSTOM, you cannot specify roleArn or configuration. For all other data source types, roleArn is required.S3, WEBCRAWLER, TEMPLATE, and CUSTOM. For a complete list, see the AWS Kendra documentation on valid Type values.Scheduling & Synchronization
schedule property with a cron expression like cron(9 10 1 * ? *). Without a schedule, you must manually trigger syncs using the StartDataSourceSyncJob API.S3 Data Sources
type to S3, provide a roleArn with S3 access permissions, and configure s3Configuration with the bucketName.inclusionPatterns, exclusionPatterns, and inclusionPrefixes within s3Configuration to control which documents are indexed.accessControlListConfiguration with a keyPath pointing to your ACL file in S3 (e.g., s3://bucket-name/path).Web Crawler Data Sources
type to WEBCRAWLER, provide a roleArn, and configure webCrawlerConfiguration with either seedUrlConfiguration (seed URLs) or siteMapsConfiguration (site maps).authenticationConfiguration with basicAuthentications, providing credentials (Secrets Manager ARN), host, and port. Use dependsOn to ensure the secret version exists first.crawlDepth to limit how deep the crawler follows links, maxLinksPerPage to limit links per page, and maxUrlsPerMinuteCrawlRate to control the crawl speed.urlInclusionPatterns to specify URLs to crawl and urlExclusionPatterns to exclude specific URLs from crawling.Using a different cloud?
Explore analytics guides for other cloud providers: