Google Cloudbeginner

How to Create a BigQuery Dataset

Set up BigQuery datasets for data storage and analysis

4 min readUpdated January 2025

How to Create a BigQuery Dataset

A BigQuery dataset is a container that organizes and controls access to your tables and views. This guide walks you through creating a dataset for billing data exports, analytics, or any other data storage needs in Google Cloud.

Prerequisites

Before you begin, ensure you have:

  • A Google Cloud project with the BigQuery API enabled
  • Owner or Editor role on the project (or BigQuery Admin role)
  • Access to the Google Cloud Console or gcloud CLI installed
  • Basic understanding of data location and access control requirements

Understanding BigQuery Datasets

A dataset in BigQuery is similar to a database in traditional SQL systems. Key characteristics:

  • Organizational unit: Datasets contain tables, views, and models
  • Access control boundary: Permissions are set at the dataset level
  • Regional or multi-regional: Data location affects query performance and cost
  • Naming restrictions: Dataset names must be alphanumeric and underscores only

Step-by-Step Guide

Method 1: Using Google Cloud Console

  1. Navigate to BigQuery

    • Open the Google Cloud Console
    • Click the navigation menu (three horizontal lines) in the top-left
    • Select BigQuery under the "Analytics" section
    • The BigQuery Studio interface will open
  2. Open the Create Dataset Dialog

    • In the Explorer panel on the left, locate your project name
    • Click the three-dot menu (⋮) next to your project name
    • Select Create dataset
  3. Configure Dataset Settings

    • Dataset ID: Enter a descriptive name (e.g., billing_data, analytics_dataset)

      • Use lowercase letters, numbers, and underscores only
      • Must be unique within the project
      • Cannot be changed after creation
    • Data location: Choose where data will be stored

      • Multi-region (recommended for billing exports):
        • US (multiple regions in United States)
        • EU (multiple regions in European Union)
      • Region (for specific compliance requirements):
        • us-central1, us-east1, europe-west1, etc.
      • Note: Location cannot be changed after creation
  4. Set Default Table Expiration (Optional)

    • Enable table expiration: Check this box to automatically delete tables after a specified time
    • Days until expiration: Enter number of days (e.g., 365 for one year)
    • Use case: Useful for managing storage costs on historical data
    • Note: This is a default setting; individual tables can override it
  5. Configure Encryption (Optional)

    • Default encryption: Google-managed encryption key (recommended)
    • Customer-managed encryption key (CMEK): Select a Cloud KMS key for additional control
    • Use case: Required for some compliance frameworks
  6. Create the Dataset

    • Click CREATE DATASET
    • The new dataset will appear in the Explorer panel
    • You'll see a confirmation message

Method 2: Using gcloud CLI

Basic Dataset Creation

# Create a dataset with default settings
bq mk \
  --dataset \
  --location=US \
  PROJECT_ID:DATASET_NAME

Example:

bq mk \
  --dataset \
  --location=US \
  my-project:billing_data

Dataset with Table Expiration

# Create dataset with 365-day default table expiration
bq mk \
  --dataset \
  --location=US \
  --default_table_expiration=31536000 \
  PROJECT_ID:DATASET_NAME

Note: Table expiration is specified in seconds (31,536,000 seconds = 365 days)

Dataset with Description

# Create dataset with description
bq mk \
  --dataset \
  --location=US \
  --description="Billing export data from Cloud Billing" \
  PROJECT_ID:DATASET_NAME

Verify Dataset Creation

# List all datasets in a project
bq ls --project_id=PROJECT_ID

# Show dataset details
bq show --format=prettyjson PROJECT_ID:DATASET_NAME

Method 3: Using Terraform (Infrastructure as Code)

For repeatable infrastructure deployments:

resource "google_bigquery_dataset" "billing_dataset" {
  dataset_id                  = "billing_data"
  friendly_name               = "Billing Export Data"
  description                 = "BigQuery dataset for Cloud Billing exports"
  location                    = "US"
  default_table_expiration_ms = 31536000000  # 365 days in milliseconds

  access {
    role          = "OWNER"
    user_by_email = "[email protected]"
  }

  access {
    role          = "READER"
    special_group = "projectReaders"
  }

  labels = {
    environment = "production"
    team        = "finance"
  }
}

Apply with:

terraform init
terraform plan
terraform apply

Configuring Dataset Access Control

After creating your dataset, configure who can access it:

Using Console

  1. Navigate to Dataset Settings

    • Click on your dataset in the Explorer panel
    • Click the SHARING tab (or click the three-dot menu and select Share)
  2. Add Principals

    • Click ADD PRINCIPAL
    • Enter email addresses or group names
    • Assign roles:
      • BigQuery Data Viewer: Read-only access to tables
      • BigQuery Data Editor: Read and write access
      • BigQuery Admin: Full control over the dataset
  3. Save Changes

    • Click DONE
    • Changes take effect immediately

Using gcloud CLI

# Grant BigQuery Data Viewer role to a user
bq update \
  --dataset_access_entry="role=READER,[email protected]" \
  PROJECT_ID:DATASET_NAME

# Grant BigQuery Data Editor role to a group
bq update \
  --dataset_access_entry="role=WRITER,[email protected]" \
  PROJECT_ID:DATASET_NAME

Best Practices

Naming Conventions

  1. Use descriptive names: billing_data, analytics_prod, customer_data_warehouse
  2. Include environment: billing_data_prod, billing_data_dev
  3. Avoid generic names: Don't use dataset1, temp, test in production
  4. Use underscores: Not hyphens or spaces (e.g., billing_data not billing-data)

Data Location Selection

  1. Billing exports: Use US multi-region (Google Cloud billing is US-based)
  2. Compliance requirements: Select region based on data residency laws (e.g., EU for GDPR)
  3. Performance: Choose location close to data sources and users
  4. Cost optimization: Multi-region is more expensive than single region

Access Control

  1. Principle of least privilege: Grant minimum necessary permissions
  2. Use groups: Assign permissions to groups, not individual users
  3. Service accounts: Use dedicated service accounts for automated processes
  4. Audit regularly: Review dataset access quarterly

Cost Management

  1. Set default table expiration: Automatically delete old data to reduce storage costs
  2. Use partitioned tables: Reduce query costs by scanning less data
  3. Monitor usage: Set up billing alerts for BigQuery storage and queries
  4. Archive old datasets: Export unused datasets to Cloud Storage

Security

  1. Enable encryption: Use customer-managed keys (CMEK) for sensitive data
  2. Audit logging: Enable Data Access audit logs for compliance
  3. VPC Service Controls: Restrict data access to specific networks (enterprise feature)
  4. Data masking: Use column-level security for sensitive fields

Common Dataset Configurations

For Billing Exports

bq mk \
  --dataset \
  --location=US \
  --description="Cloud Billing export data" \
  --default_table_expiration=63072000 \
  my-project:billing_data

For Analytics Workloads

bq mk \
  --dataset \
  --location=us-central1 \
  --description="Product analytics and user behavior data" \
  --default_table_expiration=7776000 \
  my-project:analytics_prod

For Data Warehousing

bq mk \
  --dataset \
  --location=US \
  --description="Enterprise data warehouse" \
  my-project:data_warehouse

Troubleshooting

Permission Denied Error

Problem: "Permission denied" when creating a dataset

Solution:

  • Verify you have Owner, Editor, or BigQuery Admin role
  • Check that the BigQuery API is enabled in your project
  • Ensure you're authenticated: gcloud auth login
  • Verify you're working in the correct project: gcloud config get-value project

Dataset Already Exists

Problem: "Dataset already exists" error

Solution:

  • Dataset names must be unique within a project
  • List existing datasets: bq ls
  • Choose a different dataset name
  • Delete the existing dataset if it's no longer needed: bq rm -d PROJECT_ID:DATASET_NAME

Invalid Dataset ID

Problem: Error about invalid dataset name

Solution:

  • Use only lowercase letters, numbers, and underscores
  • Cannot start with a number
  • Maximum 1024 characters
  • Cannot use reserved words like table, project, dataset

Location Mismatch

Problem: "Cannot query table from different location" error

Solution:

  • Ensure your dataset location matches your billing export location
  • Billing exports default to US multi-region
  • Cannot change dataset location after creation
  • Create a new dataset in the correct location and reconfigure exports

Insufficient Quota

Problem: "Quota exceeded" when creating dataset

Solution:

  • Check BigQuery quotas: gcloud compute project-info describe --project PROJECT_ID
  • Default limit is 1024 datasets per project
  • Delete unused datasets to free up quota
  • Request quota increase via Google Cloud Console

Next Steps

After creating your BigQuery dataset:

  • Enable billing exports: Configure Cloud Billing to export to your new dataset
  • Create tables: Define table schemas for your data
  • Set up access control: Grant appropriate permissions to team members
  • Configure monitoring: Set up alerts for storage and query costs
  • Create views: Build SQL views for common queries

Related Resources

Frequently Asked Questions

Find answers to common questions

To set a default table expiration using the gcloud CLI, utilize the `--defaulttableexpiration` flag in your `bq mk` command. For instance, to set a 365-day expiration, use: `bq mk --dataset --location=US --defaulttableexpiration=31536000 PROJECTID:DATASETNAME`. This expiration is specified in seconds (31,536,000 seconds = 365 days) and automatically applies to all tables created within the dataset unless overridden individually. This is particularly useful for managing storage costs associated with historical data.

Need Professional Help?

Our team of experts can help you implement and configure these solutions for your organization.