Analytix 360

View Original

Querying Row Counts Across Environments in Snowflake: Some Handy SQL for Testing

When working with complex data ecosystems, it's essential to ensure that your data across various layers and environments is consistent. In many modern data platforms, particularly those utilizing Snowflake, you may have tiered environments like Bronze, Silver, and Gold. These environments often serve different purposes:

  • Bronze for raw data ingestion,
  • Silver for cleaned and enriched data,
  • Gold for final, ready-for-analysis datasets.

One useful testing query involves retrieving metadata about tables, specifically row counts, across these environments to verify consistency or identify discrepancies. In this post, I'll walk through a simple but powerful SQL query that helps you achieve that.

The Query Breakdown

Here’s a query that you can use to pull the catalog, schema, table name, and row count for tables across multiple environments:

SELECT
    table_catalog,
    table_schema,
    table_name,
    row_count
FROM 
    prod_bronze.information_schema.tables
WHERE 
    table_schema <> 'INFORMATION_SCHEMA'

UNION ALL

SELECT
    table_catalog,
    table_schema,
    table_name,
    row_count
FROM 
    prod_silver.information_schema.tables
WHERE 
    table_schema <> 'INFORMATION_SCHEMA'

UNION ALL

SELECT
    table_catalog,
    table_schema,
    table_name,
    row_count
FROM 
    prod_gold.information_schema.tables
WHERE 
    table_schema <> 'INFORMATION_SCHEMA'

ORDER BY 
    table_catalog, 
    table_schema, 
    table_name;

Query Explanation:

  1. Use Role AccountAdmin:
    We start by switching to the ACCOUNTADMIN role, which provides full access across databases. This is important because we need visibility into the INFORMATION_SCHEMA across multiple environments.
  2. Union All Across Environments:
    We use UNION ALL to merge data from the PROD_BRONZE, PROD_SILVER, and PROD_GOLD environments. These are the different stages of data processing. The query looks into each environment's INFORMATION_SCHEMA.TABLES to retrieve the table metadata.
  3. Excluding INFORMATION_SCHEMA:
    The WHERE table_schema <> 'INFORMATION_SCHEMA' condition ensures that we only get tables that hold actual data, ignoring metadata tables. Snowflake's INFORMATION_SCHEMA is useful for querying metadata, but we don't want it in the results for this particular use case.
  4. Ordering the Results:
    Finally, the ORDER BY TABLE_CATALOG, TABLE_SCHEMA, TABLE_NAME ensures that the output is neatly organized by catalog, schema, and table name, making it easier to navigate and interpret.

Practical Use Cases

Here’s how you can use this query in real-world scenarios:

1. Testing Data Consistency:

When data is moved or transformed across environments (Bronze → Silver → Gold), it's critical to ensure that row counts match expected values at each stage. This query provides a quick check to identify any significant differences in row counts between environments.

2. Monitoring Data Pipelines:

In a development or testing environment, running this query regularly helps ensure that pipelines between the various environments (like Matillion, dbt, or native Snowflake tasks) are functioning as expected.

3. Database Audits:

For auditing purposes, particularly when dealing with data governance policies, tracking row counts and table names across different environments helps ensure compliance and provide a clear lineage of data movements.

4. Debugging Pipeline Issues:

If a job fails or produces unexpected results, this query helps debug issues by allowing you to quickly check if the data across environments is what it should be, pinpointing which environment might have encountered problems.

Conclusion

This query is a simple yet effective tool for tracking table metadata and row counts across environments in Snowflake. Whether you're testing, auditing, or debugging data pipelines, it provides a quick way to understand the shape and size of your data at each stage of processing. Feel free to adapt and expand this query to suit your needs—adding more environments or columns to get further insights into your data.

Happy querying!