How do I handle schema mismatches between historical and current cost data when merging?

To address schema mismatches, utilize schema mapping to normalize field names before merging. For example, rename 'UsageDateTime' to 'Date' and 'PreTaxCost' to 'Cost' in your transformation scripts or data factory pipelines. Ensure all fields align with the latest schema. In Azure Data Factory, utilize the 'TabularTranslator' to manage these mappings in your copy activities. Regularly validate your merged datasets to catch any discrepancies early, especially after schema changes in the current exports.

What techniques can I use to optimize performance during the merging of large datasets?

For optimal performance during large dataset merges, consider using Azure Data Factory’s pipeline for automated processing, as it scales efficiently. When using Azure Synapse Analytics, employ materialized views for faster query performance after merging. Additionally, use partitioning in your merged data storage, such as Parquet format, to enhance read speeds. Implementing deduplication logic with ROWNUMBER() in SQL can also minimize resource usage and speed up query execution by reducing unnecessary data scans.

What steps should I take to ensure ongoing synchronization of new cost exports with historical data?

To ensure ongoing synchronization, schedule your Azure Data Factory pipelines to run daily, which will automatically merge new cost exports with your historical data. Use triggers to create a continuous integration workflow. In your pipeline, implement logic to append only the new records based on a defined cutoff date. Additionally, consider utilizing Azure Functions or Logic Apps to monitor new data arrivals and initiate the merging process, ensuring your dataset remains up-to-date for accurate trend analysis.

How to Merge Historical and New Cost Export Data in Azure

Azure Cost Management exports generate daily or monthly billing files, but historical data from before you set up exports isn't automatically included. This creates a gap when you need to perform year-over-year comparisons, trend analysis, or long-term forecasting. This guide shows you how to combine historical billing data with your ongoing cost exports to create a unified, continuous dataset for comprehensive financial analysis.

Overview

Merging historical and new cost data involves several challenges:

Different data schemas: Historical exports may use older formats than current exports
File location differences: Old data might be in different storage accounts or containers
Duplicate prevention: Overlapping date ranges need deduplication
Performance at scale: Processing years of billing data requires efficient queries
Ongoing synchronization: New exports must continuously merge with historical baseline

This guide covers four approaches:

Azure Data Factory: Automated pipeline-based merging (recommended for production)
Azure Synapse Analytics: SQL-based merging with serverless or dedicated pools
PowerShell scripting: Manual or scheduled merging for smaller datasets
Azure Databricks: Advanced transformations and large-scale processing

Prerequisites

Before you begin, ensure you have:

Azure subscription with existing cost exports configured
Historical billing data exported and stored in Azure Storage (CSV or Parquet format)
Azure PowerShell module (7.0.0+) OR Azure CLI (2.40.0+)
Storage account access with Storage Blob Data Contributor role
Azure Data Factory OR Azure Synapse Analytics workspace (depending on chosen method)
Power BI or analytics tool for validation (optional)
Understanding of Azure billing data schema (UsageDate, Cost, ResourceId, etc.)

Understanding Cost Data Schema

Common Fields Across Versions

Both historical and current exports typically include:

UsageDateTime / Date - When the resource was consumed
Cost / PreTaxCost - Actual cost in billing currency
MeterCategory - Service category (e.g., Virtual Machines, Storage)
ResourceId - Full Azure Resource Manager ID
ResourceGroup - Resource group name
SubscriptionId - Subscription GUID
Tags - Resource tags as JSON

Schema Evolution Considerations

Schema changes between versions:

Field name changes: UsageDateTime → Date, PreTaxCost → Cost
New fields added: PricingModel, CostInUsd, ExchangeRate
Deprecated fields: Some older fields may no longer exist

Solution: Use schema mapping to normalize field names before merging.

Method 1: Azure Data Factory Pipeline (Recommended)

Azure Data Factory provides automated, scalable pipelines for merging historical and incremental data.

Step 1: Create Data Factory Workspace

# Create resource group
az group create \
  --name rg-cost-analytics \
  --location eastus

# Create Data Factory
az datafactory create \
  --resource-group rg-cost-analytics \
  --name adf-cost-merge \
  --location eastus

Step 2: Create Linked Services

# Create linked service for storage account (historical data)
az datafactory linked-service create \
  --resource-group rg-cost-analytics \
  --factory-name adf-cost-merge \
  --linked-service-name LS_HistoricalStorage \
  --properties '{
    "type": "AzureBlobStorage",
    "typeProperties": {
      "connectionString": "DefaultEndpointsProtocol=https;AccountName=historicalcosts;AccountKey=..."
    }
  }'

# Create linked service for current exports storage
az datafactory linked-service create \
  --resource-group rg-cost-analytics \
  --factory-name adf-cost-merge \
  --linked-service-name LS_CurrentExports \
  --properties '{
    "type": "AzureBlobStorage",
    "typeProperties": {
      "connectionString": "DefaultEndpointsProtocol=https;AccountName=currentcosts;AccountKey=..."
    }
  }'

Step 3: Create Datasets

Historical Dataset:

{
  "name": "DS_HistoricalCosts",
  "properties": {
    "linkedServiceName": {
      "referenceName": "LS_HistoricalStorage",
      "type": "LinkedServiceReference"
    },
    "type": "DelimitedText",
    "typeProperties": {
      "location": {
        "type": "AzureBlobStorageLocation",
        "container": "historical-exports",
        "folderPath": "costs"
      },
      "columnDelimiter": ",",
      "rowDelimiter": "\n",
      "firstRowAsHeader": true
    },
    "schema": []
  }
}

Current Exports Dataset:

{
  "name": "DS_CurrentExports",
  "properties": {
    "linkedServiceName": {
      "referenceName": "LS_CurrentExports",
      "type": "LinkedServiceReference"
    },
    "type": "DelimitedText",
    "typeProperties": {
      "location": {
        "type": "AzureBlobStorageLocation",
        "container": "cost-exports",
        "folderPath": "daily"
      },
      "columnDelimiter": ",",
      "rowDelimiter": "\n",
      "firstRowAsHeader": true
    },
    "schema": []
  }
}

Step 4: Create Merge Pipeline

{
  "name": "PL_MergeCostData",
  "properties": {
    "activities": [
      {
        "name": "CopyHistoricalData",
        "type": "Copy",
        "inputs": [
          {
            "referenceName": "DS_HistoricalCosts",
            "type": "DatasetReference"
          }
        ],
        "outputs": [
          {
            "referenceName": "DS_MergedOutput",
            "type": "DatasetReference"
          }
        ],
        "typeProperties": {
          "source": {
            "type": "DelimitedTextSource"
          },
          "sink": {
            "type": "DelimitedTextSink",
            "writeBehavior": "append"
          },
          "enableStaging": false,
          "translator": {
            "type": "TabularTranslator",
            "mappings": [
              {
                "source": { "name": "UsageDateTime" },
                "sink": { "name": "Date" }
              },
              {
                "source": { "name": "PreTaxCost" },
                "sink": { "name": "Cost" }
              }
            ]
          }
        }
      },
      {
        "name": "AppendCurrentData",
        "type": "Copy",
        "dependsOn": [
          {
            "activity": "CopyHistoricalData",
            "dependencyConditions": ["Succeeded"]
          }
        ],
        "inputs": [
          {
            "referenceName": "DS_CurrentExports",
            "type": "DatasetReference"
          }
        ],
        "outputs": [
          {
            "referenceName": "DS_MergedOutput",
            "type": "DatasetReference"
          }
        ],
        "typeProperties": {
          "source": {
            "type": "DelimitedTextSource"
          },
          "sink": {
            "type": "DelimitedTextSink",
            "writeBehavior": "append"
          }
        }
      }
    ]
  }
}

Step 5: Run and Schedule Pipeline

# Trigger pipeline manually
az datafactory pipeline create-run \
  --resource-group rg-cost-analytics \
  --factory-name adf-cost-merge \
  --name PL_MergeCostData

# Create daily schedule trigger
az datafactory trigger create \
  --resource-group rg-cost-analytics \
  --factory-name adf-cost-merge \
  --name TR_DailyMerge \
  --properties '{
    "type": "ScheduleTrigger",
    "typeProperties": {
      "recurrence": {
        "frequency": "Day",
        "interval": 1,
        "startTime": "2025-01-01T02:00:00Z"
      }
    },
    "pipelines": [
      {
        "pipelineReference": {
          "referenceName": "PL_MergeCostData",
          "type": "PipelineReference"
        }
      }
    ]
  }'

# Start the trigger
az datafactory trigger start \
  --resource-group rg-cost-analytics \
  --factory-name adf-cost-merge \
  --name TR_DailyMerge

Method 2: Azure Synapse Analytics

Step 1: Create Synapse Workspace

# Create Synapse workspace
az synapse workspace create \
  --name synapse-cost-analytics \
  --resource-group rg-cost-analytics \
  --storage-account historicalcosts \
  --file-system cost-data \
  --sql-admin-login-user sqladmin \
  --sql-admin-login-password "YourP@ssw0rd!" \
  --location eastus

Step 2: Create External Tables for Historical Data

-- Connect to Synapse serverless SQL pool

-- Create database
CREATE DATABASE CostAnalytics;
GO

USE CostAnalytics;
GO

-- Create data source for historical costs
CREATE EXTERNAL DATA SOURCE HistoricalCosts
WITH (
    LOCATION = 'https://historicalcosts.blob.core.windows.net/historical-exports',
    CREDENTIAL = StorageCredential
);

-- Create file format
CREATE EXTERNAL FILE FORMAT CsvFormat
WITH (
    FORMAT_TYPE = DELIMITEDTEXT,
    FORMAT_OPTIONS (
        FIELD_TERMINATOR = ',',
        STRING_DELIMITER = '"',
        FIRST_ROW = 2,
        USE_TYPE_DEFAULT = FALSE
    )
);

-- Create external table for historical data
CREATE EXTERNAL TABLE HistoricalCostData
(
    UsageDateTime DATE,
    SubscriptionId VARCHAR(36),
    ResourceGroup VARCHAR(90),
    ResourceId VARCHAR(255),
    MeterCategory VARCHAR(50),
    PreTaxCost DECIMAL(18,4),
    Currency VARCHAR(3),
    Tags VARCHAR(MAX)
)
WITH (
    LOCATION = '/costs/*.csv',
    DATA_SOURCE = HistoricalCosts,
    FILE_FORMAT = CsvFormat
);

Step 3: Create External Tables for Current Exports

-- Create data source for current exports
CREATE EXTERNAL DATA SOURCE CurrentExports
WITH (
    LOCATION = 'https://currentcosts.blob.core.windows.net/cost-exports',
    CREDENTIAL = StorageCredential
);

-- Create external table for current exports
CREATE EXTERNAL TABLE CurrentCostData
(
    Date DATE,
    SubscriptionId VARCHAR(36),
    ResourceGroup VARCHAR(90),
    ResourceId VARCHAR(255),
    MeterCategory VARCHAR(50),
    Cost DECIMAL(18,4),
    Currency VARCHAR(3),
    Tags VARCHAR(MAX)
)
WITH (
    LOCATION = '/daily/*.csv',
    DATA_SOURCE = CurrentExports,
    FILE_FORMAT = CsvFormat
);

Step 4: Create Unified View with Schema Normalization

-- Create view that merges and normalizes both datasets
CREATE VIEW vw_MergedCostData AS
SELECT
    UsageDateTime AS Date,
    SubscriptionId,
    ResourceGroup,
    ResourceId,
    MeterCategory,
    PreTaxCost AS Cost,
    Currency,
    Tags,
    'Historical' AS DataSource
FROM HistoricalCostData
WHERE UsageDateTime < '2025-01-01'  -- Cutoff date for historical data

UNION ALL

SELECT
    Date,
    SubscriptionId,
    ResourceGroup,
    ResourceId,
    MeterCategory,
    Cost,
    Currency,
    Tags,
    'Current' AS DataSource
FROM CurrentCostData
WHERE Date >= '2025-01-01';  -- Start date for current exports
GO

Step 5: Materialize Merged Data (Optional)

-- Create materialized table for better query performance
CREATE TABLE MergedCostData
WITH (
    DISTRIBUTION = HASH(SubscriptionId),
    CLUSTERED COLUMNSTORE INDEX
)
AS
SELECT * FROM vw_MergedCostData;

-- Create statistics for query optimization
CREATE STATISTICS stat_date ON MergedCostData(Date);
CREATE STATISTICS stat_subscription ON MergedCostData(SubscriptionId);
CREATE STATISTICS stat_resourcegroup ON MergedCostData(ResourceGroup);

Step 6: Schedule Refresh with Synapse Pipeline

-- Create stored procedure for incremental refresh
CREATE PROCEDURE sp_RefreshMergedCosts
AS
BEGIN
    -- Delete and reload last 3 days to handle late-arriving data
    DELETE FROM MergedCostData
    WHERE Date >= DATEADD(day, -3, GETDATE());

    -- Insert new/updated data
    INSERT INTO MergedCostData
    SELECT * FROM vw_MergedCostData
    WHERE Date >= DATEADD(day, -3, GETDATE());
END;
GO

Create Synapse pipeline to run this stored procedure daily.

Method 3: PowerShell Script

For smaller datasets or manual merging:

# Cost Data Merge Script
param(
    [Parameter(Mandatory=$true)]
    [string]$HistoricalStorageAccount,

    [Parameter(Mandatory=$true)]
    [string]$HistoricalContainer,

    [Parameter(Mandatory=$true)]
    [string]$CurrentStorageAccount,

    [Parameter(Mandatory=$true)]
    [string]$CurrentContainer,

    [Parameter(Mandatory=$true)]
    [string]$OutputStorageAccount,

    [Parameter(Mandatory=$true)]
    [string]$OutputContainer,

    [Parameter(Mandatory=$false)]
    [datetime]$CutoffDate = (Get-Date "2025-01-01")
)

# Install required modules
Install-Module -Name Az.Storage -Force -AllowClobber

# Connect to Azure
Connect-AzAccount

# Function to normalize schema
function Normalize-CostData {
    param($Data, $IsHistorical)

    if ($IsHistorical) {
        # Map old field names to new schema
        $Data | Select-Object `
            @{N='Date';E={$_.UsageDateTime}},
            @{N='Cost';E={$_.PreTaxCost}},
            SubscriptionId,
            ResourceGroup,
            ResourceId,
            MeterCategory,
            Currency,
            Tags,
            @{N='DataSource';E={'Historical'}}
    } else {
        # Current schema, just add source flag
        $Data | Select-Object *,
            @{N='DataSource';E={'Current'}}
    }
}

# Get storage contexts
$historicalCtx = New-AzStorageContext `
    -StorageAccountName $HistoricalStorageAccount `
    -UseConnectedAccount

$currentCtx = New-AzStorageContext `
    -StorageAccountName $CurrentStorageAccount `
    -UseConnectedAccount

$outputCtx = New-AzStorageContext `
    -StorageAccountName $OutputStorageAccount `
    -UseConnectedAccount

Write-Host "Downloading historical cost data..." -ForegroundColor Cyan

# Download and process historical data
$historicalBlobs = Get-AzStorageBlob `
    -Context $historicalCtx `
    -Container $HistoricalContainer

$historicalData = @()
foreach ($blob in $historicalBlobs) {
    $tempFile = [System.IO.Path]::GetTempFileName()
    Get-AzStorageBlobContent `
        -Context $historicalCtx `
        -Container $HistoricalContainer `
        -Blob $blob.Name `
        -Destination $tempFile `
        -Force | Out-Null

    $csvData = Import-Csv -Path $tempFile
    $normalizedData = Normalize-CostData -Data $csvData -IsHistorical $true
    $historicalData += $normalizedData | Where-Object { $_.Date -lt $CutoffDate }

    Remove-Item $tempFile
}

Write-Host "Downloaded $($historicalData.Count) historical records" -ForegroundColor Green

Write-Host "Downloading current cost data..." -ForegroundColor Cyan

# Download and process current data
$currentBlobs = Get-AzStorageBlob `
    -Context $currentCtx `
    -Container $CurrentContainer

$currentData = @()
foreach ($blob in $currentBlobs) {
    $tempFile = [System.IO.Path]::GetTempFileName()
    Get-AzStorageBlobContent `
        -Context $currentCtx `
        -Container $CurrentContainer `
        -Blob $blob.Name `
        -Destination $tempFile `
        -Force | Out-Null

    $csvData = Import-Csv -Path $tempFile
    $normalizedData = Normalize-CostData -Data $csvData -IsHistorical $false
    $currentData += $normalizedData | Where-Object { $_.Date -ge $CutoffDate }

    Remove-Item $tempFile
}

Write-Host "Downloaded $($currentData.Count) current records" -ForegroundColor Green

Write-Host "Merging datasets and removing duplicates..." -ForegroundColor Cyan

# Merge and deduplicate
$mergedData = $historicalData + $currentData
$uniqueData = $mergedData | Group-Object -Property Date,SubscriptionId,ResourceId |
    ForEach-Object { $_.Group | Select-Object -First 1 }

Write-Host "Merged dataset contains $($uniqueData.Count) unique records" -ForegroundColor Green

# Export merged data
$outputFile = [System.IO.Path]::GetTempFileName()
$uniqueData | Export-Csv -Path $outputFile -NoTypeInformation

# Upload to output storage
$outputBlobName = "merged-costs-$(Get-Date -Format 'yyyyMMdd').csv"
Set-AzStorageBlobContent `
    -Context $outputCtx `
    -Container $OutputContainer `
    -File $outputFile `
    -Blob $outputBlobName `
    -Force

Write-Host "Merged data uploaded to: $outputBlobName" -ForegroundColor Green

# Cleanup
Remove-Item $outputFile

Run the script:

.\Merge-CostData.ps1 `
    -HistoricalStorageAccount "historicalcosts" `
    -HistoricalContainer "historical-exports" `
    -CurrentStorageAccount "currentcosts" `
    -CurrentContainer "cost-exports" `
    -OutputStorageAccount "mergedcosts" `
    -OutputContainer "merged-data" `
    -CutoffDate "2025-01-01"

Method 4: Azure Databricks

For advanced transformations and large-scale processing:

# Install libraries
%pip install azure-storage-blob

from pyspark.sql import SparkSession
from pyspark.sql.functions import col, lit, to_date
from pyspark.sql.types import *

# Initialize Spark session
spark = SparkSession.builder \
    .appName("CostDataMerge") \
    .config("spark.sql.adaptive.enabled", "true") \
    .getOrCreate()

# Configure storage access
storage_account = "historicalcosts"
storage_key = "your-storage-key"

spark.conf.set(
    f"fs.azure.account.key.{storage_account}.blob.core.windows.net",
    storage_key
)

# Define schema for historical data
historical_schema = StructType([
    StructField("UsageDateTime", StringType(), True),
    StructField("SubscriptionId", StringType(), True),
    StructField("ResourceGroup", StringType(), True),
    StructField("ResourceId", StringType(), True),
    StructField("MeterCategory", StringType(), True),
    StructField("PreTaxCost", DoubleType(), True),
    StructField("Currency", StringType(), True),
    StructField("Tags", StringType(), True)
])

# Load historical data
historical_df = spark.read \
    .option("header", "true") \
    .schema(historical_schema) \
    .csv("wasbs://[email protected]/costs/*.csv")

# Normalize historical data schema
historical_normalized = historical_df \
    .withColumn("Date", to_date(col("UsageDateTime"))) \
    .withColumnRenamed("PreTaxCost", "Cost") \
    .withColumn("DataSource", lit("Historical")) \
    .filter(col("Date") < "2025-01-01") \
    .select("Date", "SubscriptionId", "ResourceGroup", "ResourceId",
            "MeterCategory", "Cost", "Currency", "Tags", "DataSource")

# Define schema for current data
current_schema = StructType([
    StructField("Date", StringType(), True),
    StructField("SubscriptionId", StringType(), True),
    StructField("ResourceGroup", StringType(), True),
    StructField("ResourceId", StringType(), True),
    StructField("MeterCategory", StringType(), True),
    StructField("Cost", DoubleType(), True),
    StructField("Currency", StringType(), True),
    StructField("Tags", StringType(), True)
])

# Load current exports
current_df = spark.read \
    .option("header", "true") \
    .schema(current_schema) \
    .csv("wasbs://[email protected]/daily/*.csv")

# Normalize current data
current_normalized = current_df \
    .withColumn("Date", to_date(col("Date"))) \
    .withColumn("DataSource", lit("Current")) \
    .filter(col("Date") >= "2025-01-01") \
    .select("Date", "SubscriptionId", "ResourceGroup", "ResourceId",
            "MeterCategory", "Cost", "Currency", "Tags", "DataSource")

# Merge datasets
merged_df = historical_normalized.union(current_normalized)

# Remove duplicates (keep most recent)
deduplicated_df = merged_df \
    .dropDuplicates(["Date", "SubscriptionId", "ResourceId"])

# Show statistics
print(f"Historical records: {historical_normalized.count()}")
print(f"Current records: {current_normalized.count()}")
print(f"Merged records: {merged_df.count()}")
print(f"After deduplication: {deduplicated_df.count()}")

# Write merged data to Parquet (optimized format)
deduplicated_df.write \
    .mode("overwrite") \
    .partitionBy("Date") \
    .parquet("wasbs://[email protected]/costs/")

# Create Delta table for incremental updates (optional)
deduplicated_df.write \
    .format("delta") \
    .mode("overwrite") \
    .save("wasbs://[email protected]/costs-delta/")

print("Merge complete!")

Best Practices

1. Define Clear Cutoff Dates

-- Use exact timestamp to prevent overlaps
DECLARE @CutoffDate DATETIME = '2025-01-01 00:00:00';

SELECT * FROM HistoricalData WHERE Date < @CutoffDate
UNION ALL
SELECT * FROM CurrentData WHERE Date >= @CutoffDate;

2. Implement Deduplication Logic

# Python/PySpark deduplication example
deduplicated = merged_df.dropDuplicates([
    "Date",
    "SubscriptionId",
    "ResourceId",
    "MeterId"
])

3. Validate Merged Data

-- Check for gaps in date range
WITH DateRange AS (
    SELECT MIN(Date) AS StartDate, MAX(Date) AS EndDate
    FROM MergedCostData
),
AllDates AS (
    SELECT DATEADD(day, ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) - 1, StartDate) AS Date
    FROM DateRange
    CROSS JOIN master..spt_values
    WHERE type = 'P'
    AND DATEADD(day, ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) - 1, StartDate) <= EndDate
)
SELECT d.Date AS MissingDate
FROM AllDates d
LEFT JOIN MergedCostData m ON d.Date = m.Date
WHERE m.Date IS NULL
ORDER BY d.Date;

4. Monitor Merge Performance

# Track merge execution time
$startTime = Get-Date
# ... merge operations ...
$endTime = Get-Date
$duration = $endTime - $startTime
Write-Host "Merge completed in $($duration.TotalMinutes) minutes"

5. Use Partitioning for Large Datasets

-- Partition by month for better query performance
CREATE TABLE MergedCostData
WITH (
    DISTRIBUTION = HASH(SubscriptionId),
    PARTITION (Date RANGE RIGHT FOR VALUES (
        '2023-01-01', '2023-02-01', '2023-03-01', ..., '2025-12-01'
    ))
)
AS SELECT * FROM vw_MergedCostData;

6. Implement Incremental Refresh

# Only process new data since last merge
last_merge_date = get_last_merge_timestamp()
new_data = current_df.filter(col("Date") > last_merge_date)
# Append only new records

Troubleshooting

Issue: Schema mismatch between historical and current data

Symptoms: Union/merge fails with column type errors

Resolution:

-- Explicitly cast columns to matching types
SELECT
    CAST(UsageDateTime AS DATE) AS Date,
    CAST(PreTaxCost AS DECIMAL(18,4)) AS Cost,
    ...
FROM HistoricalData
UNION ALL
SELECT
    CAST(Date AS DATE) AS Date,
    CAST(Cost AS DECIMAL(18,4)) AS Cost,
    ...
FROM CurrentData;

Issue: Duplicate records in merged dataset

Symptoms: Total cost sums don't match individual sources

Resolution:

-- Use ROW_NUMBER to keep only first occurrence
WITH Deduplicated AS (
    SELECT *,
        ROW_NUMBER() OVER (
            PARTITION BY Date, SubscriptionId, ResourceId
            ORDER BY DataSource DESC  -- Prefer 'Historical' over 'Current'
        ) AS rn
    FROM MergedData
)
SELECT * FROM Deduplicated WHERE rn = 1;

Issue: Performance degradation with large datasets

Symptoms: Merge queries take hours to complete

Resolution:

# Use Parquet instead of CSV
spark.read.parquet("wasbs://.../costs.parquet")

# Enable adaptive query execution
spark.conf.set("spark.sql.adaptive.enabled", "true")

# Repartition large datasets
df.repartition(100, "Date")

Issue: Missing data in date ranges

Symptoms: Gaps in merged dataset

Resolution:

# Verify source data completeness
$historical = Get-AzStorageBlob -Container "historical" |
    Where-Object { $_.Name -match "(\d{4}-\d{2}-\d{2})" } |
    ForEach-Object { $Matches[1] } | Sort-Object

$current = Get-AzStorageBlob -Container "current" |
    Where-Object { $_.Name -match "(\d{4}-\d{2}-\d{2})" } |
    ForEach-Object { $Matches[1] } | Sort-Object

# Find gaps
$allDates = $historical + $current | Sort-Object -Unique
# Check for missing dates in sequence

Issue: Out of memory errors during merge

Symptoms: PowerShell/Python script crashes

Resolution:

# Process in batches
$batchSize = 10000
for ($i = 0; $i -lt $allData.Count; $i += $batchSize) {
    $batch = $allData[$i..[Math]::Min($i + $batchSize - 1, $allData.Count - 1)]
    # Process batch
    $batch | Export-Csv -Path "batch-$i.csv" -Append
}

Validation and Testing

Data Quality Checks

-- Validate merged data
-- 1. Check total cost matches sum of sources
SELECT
    'Historical' AS Source,
    COUNT(*) AS RecordCount,
    SUM(Cost) AS TotalCost
FROM HistoricalData
UNION ALL
SELECT
    'Current' AS Source,
    COUNT(*) AS RecordCount,
    SUM(Cost) AS TotalCost
FROM CurrentData
UNION ALL
SELECT
    'Merged' AS Source,
    COUNT(*) AS RecordCount,
    SUM(Cost) AS TotalCost
FROM MergedData;

-- 2. Check for duplicates
SELECT Date, SubscriptionId, ResourceId, COUNT(*) AS DuplicateCount
FROM MergedData
GROUP BY Date, SubscriptionId, ResourceId
HAVING COUNT(*) > 1;

-- 3. Validate date continuity
WITH DateGaps AS (
    SELECT
        Date,
        LEAD(Date) OVER (ORDER BY Date) AS NextDate,
        DATEDIFF(day, Date, LEAD(Date) OVER (ORDER BY Date)) AS DaysBetween
    FROM (SELECT DISTINCT Date FROM MergedData) AS Dates
)
SELECT * FROM DateGaps WHERE DaysBetween > 1;

-- 4. Check for null critical fields
SELECT
    COUNT(*) AS TotalRecords,
    SUM(CASE WHEN Date IS NULL THEN 1 ELSE 0 END) AS NullDates,
    SUM(CASE WHEN Cost IS NULL THEN 1 ELSE 0 END) AS NullCosts,
    SUM(CASE WHEN SubscriptionId IS NULL THEN 1 ELSE 0 END) AS NullSubscriptions
FROM MergedData;

Next Steps

Once data is merged:

Create Power BI dashboards: Connect to merged dataset for year-over-year analysis
Set up alerts: Monitor for anomalies using "How to Set Up Cost Alerts and Budgets in Azure"
Automate refresh: Schedule daily merges to keep data current
Optimize storage: Convert to Parquet or Delta format for better performance
Secure data: Apply encryption and access controls per "How to Secure Cost Management Data in Azure Storage and Synapse"

How to Merge Historical and New Cost Export Data in Azure

How to Merge Historical and New Cost Export Data in Azure

Overview

Prerequisites

Understanding Cost Data Schema

Common Fields Across Versions

Schema Evolution Considerations

Method 1: Azure Data Factory Pipeline (Recommended)

Step 1: Create Data Factory Workspace

Step 2: Create Linked Services

Step 3: Create Datasets

Step 4: Create Merge Pipeline

Step 5: Run and Schedule Pipeline

Method 2: Azure Synapse Analytics

Step 1: Create Synapse Workspace

Step 2: Create External Tables for Historical Data

Step 3: Create External Tables for Current Exports

Step 4: Create Unified View with Schema Normalization

Step 5: Materialize Merged Data (Optional)

Step 6: Schedule Refresh with Synapse Pipeline

Method 3: PowerShell Script

Method 4: Azure Databricks

Best Practices

1. Define Clear Cutoff Dates

2. Implement Deduplication Logic

3. Validate Merged Data

4. Monitor Merge Performance

5. Use Partitioning for Large Datasets

6. Implement Incremental Refresh

Troubleshooting

Issue: Schema mismatch between historical and current data

Issue: Duplicate records in merged dataset

Issue: Performance degradation with large datasets

Issue: Missing data in date ranges

Issue: Out of memory errors during merge

Validation and Testing

Data Quality Checks

Next Steps

Related Resources

Frequently Asked Questions

Need Professional Help?

Related Articles

How to Assign the Billing Account Contributor Role in Azure

How to Backfill Missing Azure Billing Data in Synapse or Power BI

How to Change the Storage Account or Export Configuration for Azure Billing Data