Storing Large Language Model Training Datasets on GitHub

Question

Pulumi · Accepted Answer

When dealing with large language model training datasets, it's crucial to organize and manage the data effectively to ensure efficient access and version control. GitHub, while typically known for source code management, can also be used to store smaller datasets.

However, GitHub has limitations regarding file and repository sizes that might not be suitable for typical large language model datasets. Specifically, GitHub restricts the size of files to a maximum of 100 MB and advises keeping the repository size under 1 GB for optimal performance. Therefore, for truly large datasets, GitHub might not be the best option, and you may need to consider other data storage solutions like Git LFS (Large File Storage), DVC (Data Version Control), or cloud-based storage services such as Amazon S3, Google Cloud Storage, or Azure Blob Storage.

Nevertheless, if you'd like to store smaller datasets or artifacts related to your language models (like scripts, configurations, and smaller datasets), here's a Pulumi program that sets up a GitHub repository, using the `pulumi_github` package, where such data can be stored.

This program does the following:
1. Imports the necessary `pulumi_github` module to interact with GitHub resources through Pulumi.
2. Sets up a new GitHub repository.
3. Configures the repository with attributes like the name, description, and visibility.

Let's start with the Pulumi program written in Python:

```python
import pulumi
import pulumi_github as github

# Initialize a new GitHub repository to store datasets
# Replace 'data-repository' with a unique name for your GitHub repository
# Change the visibility to 'private' if you want to restrict access
dataset_repo = github.Repository('data-repository',
    description='Repository to store language model training datasets',
    visibility='public',  # or 'private' for private repositories
)

# Export the resulting repository's URL for easy access
pulumi.export('repository_url', dataset_repo.html_url)
```

What the above code does is that it creates a GitHub repository using Pulumi's GitHub provider plugin. The `Repository` resource takes a few parameters like `name`, `description`, and `visibility` and creates a repository with those specifications on GitHub under the authenticated account.

The final line `pulumi.export` provides you with the URL of the newly created repository, which you can use to access it directly from the web browser.

Remember to install the `pulumi_github` Python package before running the program and authenticate Pulumi with GitHub using a personal access token.

This setup is quite straightforward. However, if you plan to store large datasets, consider using services specifically designed for large file storage or looking into splitting datasets into smaller chunks compatible with GitHub's limitations.