How do I define an AWS Glue Catalog Table using its code in TypeScript

Question

Pulumi · Accepted Answer

To define an AWS Glue Catalog Table using Pulumi and TypeScript, you'll need to use the aws.glue.CatalogTable resource from the AWS Pulumi package. This resource allows you to create and manage a table within the AWS Glue Data Catalog, which is a centralized metadata repository for your data lakes and analytics services.

The example below will guide you through creating a simple AWS Glue Catalog Table. The table is defined within a Glue database, and its structure is defined by a set of columns and data types. We also include some optional parameters, such as a description for the table and the serialization library for the data stored in S3.

Here is how you could define an AWS Glue Catalog Table within a Pulumi program:

import * as pulumi from "@pulumi/pulumi";
import * as aws from "@pulumi/aws";

// First, create a new Glue Catalog Database if it doesn't already exist.
const catalogDatabase = new aws.glue.CatalogDatabase("my-catalog-database", {
    name: "my_database", // The name of the database
});

// Then, define the AWS Glue Catalog Table.
const catalogTable = new aws.glue.CatalogTable("my-catalog-table", {
    name: "my_table",  // The name of the table
    databaseName: catalogDatabase.name, // Reference the database by its name
    storageDescriptor: {
        columns: [
            {
                name: "username", // Column name
                type: "string"   // Column data type
            },
            {
                name: "timestamp",
                type: "timestamp"
            },
            {
                name: "info",
                type: "struct<order_id:string,price:double,items:array<string>>"
            }
        ],
        location: "s3://my-data-bucket/prefix/", // The S3 path for the table data
        inputFormat: "org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat",
        outputFormat: "org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat",
        serDeInfo: {
            serializationLibrary: "org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe",
        }
    },
    tableType: "EXTERNAL_TABLE", // Type of table, EXTERNAL_TABLE signifies it's located in S3
    parameters: {
        "classification": "parquet",
        "compressionType": "none"
    },
});

// Export the name of the database and the table name
export const databaseName = catalogDatabase.name;
export const tableName = catalogTable.name;

In this example:

We import the necessary Pulumi libraries for AWS and general Pulumi programming.
We create an AWS Glue CatalogDatabase as a prerequisite to the table because all Glue tables must reside within a database.
We define the CatalogTable resource with aws.glue.CatalogTable, specifying the name, databaseName, storageDescriptor, and other properties.
The storageDescriptor includes a list of columns which represents the schema of the table. It also includes information about the data location (location) and the SerDe (serDeInfo) that specifies how data is serialized and deserialized.
We set the tableType to EXTERNAL_TABLE, indicating that the data is actually stored in Amazon S3.
Finally, we export the database and table names for later use.

Remember to replace the bucket name and other details with your actual AWS configuration. After running this program with Pulumi, a new Glue Catalog Database and Table will be created in your AWS account according to the definitions specified.

For further learning, you can visit the AWS Glue CatalogTable documentation.