Cross-Organization Insights with BigQuery Analytics Hub
PythonBigQuery Analytics Hub allows organizations to share, discover, and consume analytics assets like datasets, ML models, and queries. It enables cross-organization insights by sharing this data securely across Google Cloud projects or even with external organizations. To create a cross-organization insight system with BigQuery Analytics Hub, you would typically set up a Data Exchange, publish Listings to share data assets, and then consumers can subscribe to these listings.
Below, I'll guide you through the process of setting up a simple exchange hub for sharing a dataset using Pulumi and the Google Cloud Platform (GCP) provider for Pulumi.
Setting up a Data Exchange and Listing
The following Pulumi program will:
- Create a new BigQuery Dataset, which is the actual container for your tables, views, and data.
- Initialize a Data Exchange within BigQuery Analytics Hub, where you will catalog your sharable assets.
- Publish a Listing to the Data Exchange which external organizations or other projects can use to access the shared datasets.
I'll provide comments within the code to help explain the purpose of each section. It's important to have GCP configured with the appropriate credentials and permissions before running this Pulumi program.
import pulumi import pulumi_gcp as gcp # Replace these variables with your own information project_id = 'your-gcp-project-id' location = 'your-data-exchange-location' # E.g., 'us-central1' data_exchange_id = 'your-desired-data-exchange-id' listing_id = 'your-desired-listing-id' dataset_id = 'your-bigquery-dataset-id' # Create a BigQuery Dataset bigquery_dataset = gcp.bigquery.Dataset("analytics-dataset", dataset_id=dataset_id, location=location, project=project_id, ) # Initialize a BigQuery Analytics Hub Data Exchange data_exchange = gcp.bigqueryanalyticshub.DataExchange("data-exchange", data_exchange_id=data_exchange_id, project=project_id, location=location, display_name="Company Analytics Exchange", description="A Data Exchange Hub for company-wide analytics assets.", ) # Publish a Listing to the Data Exchange listing = gcp.bigqueryanalyticshub.Listing("data-listing", listing_id=listing_id, data_exchange_id=data_exchange.data_exchange_id, project=project_id, location=location, display_name="Quarterly Sales Data", description="Listing to share quarterly sales data across organizations.", bigquery_dataset=gcp.bigqueryanalyticshub.ListingBigqueryDatasetArgs( dataset=bigquery_dataset.dataset_id, ), ) # Exporting the Listing ID so it can be shared with consumers pulumi.export('listing_id', listing.listing_id)
This program deploys resources that enable you to share your analytics assets. Here's an explanation of what each resource achieves:
gcp.bigquery.Dataset
: Represents a dataset in BigQuery, where your analytics assets like tables and views will reside. It's the foundational component that will host the data you intend to share.gcp.bigqueryanalyticshub.DataExchange
: A resource representing the Data Exchange in BigQuery Analytics Hub. This exchange serves as the marketplace for your organization's analytics assets. It's like creating a storefront for your datasets.gcp.bigqueryanalyticshub.Listing
: A resource that creates a listing within the Data Exchange. A listing is akin to a product on your storefront, representing the dataset or analytics asset that you want to share.
Once this Pulumi program is applied, you will have a working Analytics Hub setup with one Dataset listed for sharing. To execute this program, save the code to a file, navigate to the file directory, and run
pulumi up
through the Pulumi CLI. Ensure that you have the appropriate permissions to create resources in your GCP project.For more information on the resources used in this program, you can refer to the following documentation:
After creating a listing, share the
listing_id
with data consumers, who can then subscribe to your data listing through their BigQuery instances. This establishes a cross-organization insight sharing system, making it possible for various stakeholders to utilize shared data assets efficiently and securely.