1. Multi-NIC Virtual Machines for Distributed AI Training


    Multi-NIC (Network Interface Controller) Virtual Machines (VMs) are utilized in scenarios where high network throughput or segregation of traffic is necessary. When it comes to Distributed AI Training, using multi-NIC VMs allows distributing data and model training processes across multiple network interfaces, which can lower the network latency, improve bandwidth, and enhance overall training performance.

    We're going to look into creating a set of virtual machines configured with multiple network interfaces using cloud resources. In this context, Azure and AWS are two of the most popular cloud providers that support such infrastructure. I'll walk you through how to create such a setup on Azure using Pulumi, a modern infrastructure as code platform.

    Please note that you'll need to have an Azure account set up with appropriate permissions to manage network interfaces, virtual machines, and related resources. I'll assume you also have Pulumi CLI setup and an appropriate Pulumi project initialized for use with Azure.

    We'll use two main resources from azure-native:

    1. NetworkInterface: Represents the network interfaces that we will attach to the virtual machine.
    2. VirtualMachine: Represents the virtual machine to which we will attach the network interfaces.

    To create a multi-NIC VM in Azure using Pulumi, you will need to do the following:

    • Define a Virtual Network and Subnets where the VM will reside.
    • Create the Network Interfaces (NICs) that will be associated with the VM.
    • Use the Azure Virtual Machine resource to create the VM and attach the NICs.

    Here is a program that creates a VM with multiple NICs:

    import pulumi from pulumi_azure_native import network from pulumi_azure_native import compute # Define a resource group resource_group = network.ResourceGroup("resource_group") # Create a virtual network vnet = network.VirtualNetwork( "vnet", resource_group_name=resource_group.name, address_space=network.AddressSpaceArgs( address_prefixes=[""], ), ) # Create two subnets in the virtual network subnet1 = network.Subnet( "subnet1", resource_group_name=resource_group.name, address_prefix="", virtual_network_name=vnet.name, ) subnet2 = network.Subnet( "subnet2", resource_group_name=resource_group.name, address_prefix="", virtual_network_name=vnet.name, ) # Create two network interfaces and associate them with respective subnets nic1 = network.NetworkInterface( "nic1", resource_group_name=resource_group.name, ip_configurations=[network.NetworkInterfaceIPConfigurationArgs( name="ipconfig1", subnet=network.SubnetArgs(id=subnet1.id), private_ip_allocation_method="Dynamic", )], ) nic2 = network.NetworkInterface( "nic2", resource_group_name=resource_group.name, ip_configurations=[network.NetworkInterfaceIPConfigurationArgs( name="ipconfig2", subnet=network.SubnetArgs(id=subnet2.id), private_ip_allocation_method="Dynamic", )], ) # Create a virtual machine with the two network interfaces vm = compute.VirtualMachine( "vm", resource_group_name=resource_group.name, network_profile=compute.NetworkProfileArgs( network_interfaces=[ compute.NetworkInterfaceReferenceArgs(id=nic1.id, primary=True), compute.NetworkInterfaceReferenceArgs(id=nic2.id), ], ), os_profile=compute.OSProfileArgs( computer_name="hostname", admin_username="username", admin_password="P@ssw0rd1234!", ), hardware_profile=compute.HardwareProfileArgs( vm_size="Standard_DS1_v2", ), storage_profile=compute.StorageProfileArgs( image_reference=compute.ImageReferenceArgs( publisher="Canonical", offer="UbuntuServer", sku="16.04-LTS", version="latest", ), os_disk=compute.OSDiskArgs( caching=compute.CachingTypes.READ_WRITE, managed_disk=compute.ManagedDiskParametersArgs( storage_account_type=compute.StorageAccountTypes.PREMIUM_LRS, ), create_option=compute.DiskCreateOptionTypes.FROM_IMAGE, ), ), ) # Export the ID of the VM and the NICs pulumi.export('vm_id', vm.id) pulumi.export('nic1_id', nic1.id) pulumi.export('nic2_id', nic2.id)


    • resource_group: This is the Azure Resource Group that will contain all of our resources. An Azure Resource Group is a logical container for resources in an Azure subscription.

    • vnet: This is the Virtual Network within which our network interfaces and VM will reside.

    • subnet1 and subnet2: These are subnets within the virtual network. Each subnet is a range of IP addresses in the VNet. They allow you to segment the network further.

    • nic1 and nic2: These are the two network interfaces that we create. We specify which subnet they are associated with in their respective configurations, and we allocate the ip configurations dynamically.

    • vm: This is our virtual machine. We specify a network_profile with references to nic1 and nic2 to attach them to the VM.

    • os_profile, hardware_profile, and storage_profile: These arguments configure the VM's operating system, hardware specifications, and storage settings, respectively.

    • pulumi.export: This line exposes the IDs of the created resources as stack outputs, which can be useful for querying the stack's resulting state or integrating with other Pulumi stacks.

    This example demonstrates the creation of a simple VM with two NICs for Distributed AI Training purposes on Azure using Pulumi's infrastructure-as-code approach. You can extend this basic template by adding more NICs, customizing the VM size and image, or incorporating other services like load balancers or security groups as needed for your specific application.