Cosmos DB for Storing NoSQL AI Training Data
PythonTo store NoSQL AI training data, Azure Cosmos DB is a great choice due to its global distribution, multi-model support, and elastic scalability of throughput and storage. Azure Cosmos DB supports various NoSQL models, including MongoDB, Cassandra, Gremlin (graph), and SQL (document) APIs, allowing you to choose the model that best fits your specific use case.
We will focus on creating an instance of Azure Cosmos DB using the SQL API because it's common for AI and machine learning workloads, and it allows you to store schema-less JSON data.
In this example, we'll see how to:
- Set up an Azure Cosmos DB account.
- Create a SQL database within the Cosmos DB account.
- Create a container (like a table in SQL databases) for your JSON documents.
Azure resources are named and can be grouped, so we will first create a resource group. Then we use that resource group and other resource names to create the Azure Cosmos DB account and the SQL database.
Here's the Pulumi program that creates these resources:
import pulumi import pulumi_azure_native as azure_native # Create an Azure Resource Group resource_group = azure_native.resources.ResourceGroup("ai_training_data_rg") # Create an Azure Cosmos DB Account cosmosdb_account = azure_native.documentdb.DatabaseAccount("ai_training_data_account", resource_group_name=resource_group.name, database_account_offer_type=azure_native.documentdb.DatabaseAccountOfferType.STANDARD, locations=[{ "location_name": resource_group.location, "failover_priority": 0 }], capabilities=[{ "name": "EnableServerless" }] ) # Create a SQL Database within the Cosmos DB Account sql_database = azure_native.documentdb.SqlDatabase("ai_training_sql_db", resource_group_name=resource_group.name, account_name=cosmosdb_account.name, resource={ "id": "ai_training_database" } ) # Create a Container (like a table) for the JSON documents within the SQL Database container = azure_native.documentdb.SqlContainer("ai_training_container", resource_group_name=resource_group.name, account_name=cosmosdb_account.name, database_name=sql_database.name, resource={ "id": "training_data_container" }, options={ "throughput": 400 } ) # Export the connection strings for the Cosmos DB account to use in the application primary_connection_string = cosmosdb_account.connection_strings.apply(lambda cs: cs[0].connection_string) pulumi.export("primary_connection_string", primary_connection_string) pulumi.export("cosmosdb_account_endpoint", cosmosdb_account.document_endpoint)
Let's review what each part of the code is doing:
- We begin by importing the necessary Pulumi packages for deploying Azure resources.
- We create an Azure Resource Group, which acts as a logical container for our Azure resources.
- The
DatabaseAccount
resource defines the Cosmos DB account. We configure it for theSTANDARD
offer and enable serverless capability, which can be cost-effective for fluctuating workloads like AI training. - Within this Cosmos DB account, we create a
SqlDatabase
which is essentially a namespace for our data. - We then define a
SqlContainer
within our database. Containers are used to manage and query your data. For the throughput, we set a modest level of 400 RU/s; you can adjust this based on your expected workload.
Lastly, we export some critical values:
primary_connection_string
: Connection string that your application will use to communicate with Cosmos DB.cosmosdb_account_endpoint
: The endpoint of the Cosmos DB account.
To deploy this infrastructure:
- Ensure you have the Pulumi CLI installed and Azure configured with the necessary permissions.
- Save this code in a
__main__.py
file. - Run
pulumi up
and follow the on-screen prompts to deploy the resources.
After deploying, use the connection string and endpoint in your AI applications to store and retrieve your training data from the Cosmos DB database.