Querying Row Counts Across Environments in Snowflake: Some Handy SQL for Testing
When working with complex data ecosystems, it's essential to ensure that your data across various layers and environments is consistent. In many modern data platforms, particularly those utilizing Snowflake, you may have tiered environments like Bronze, Silver, and Gold. These environments often serve different purposes:
- Bronze for raw data ingestion,
- Silver for cleaned and enriched data,
- Gold for final, ready-for-analysis datasets.
One useful testing query involves retrieving metadata about tables, specifically row counts, across these environments to verify consistency or identify discrepancies. In this post, I'll walk through a simple but powerful SQL query that helps you achieve that.
The Query Breakdown
Here’s a query that you can use to pull the catalog, schema, table name, and row count for tables across multiple environments:
SELECT
table_catalog,
table_schema,
table_name,
row_count
FROM
prod_bronze.information_schema.tables
WHERE
table_schema <> 'INFORMATION_SCHEMA'
UNION ALL
SELECT
table_catalog,
table_schema,
table_name,
row_count
FROM
prod_silver.information_schema.tables
WHERE
table_schema <> 'INFORMATION_SCHEMA'
UNION ALL
SELECT
table_catalog,
table_schema,
table_name,
row_count
FROM
prod_gold.information_schema.tables
WHERE
table_schema <> 'INFORMATION_SCHEMA'
ORDER BY
table_catalog,
table_schema,
table_name;
Query Explanation:
- Use Role AccountAdmin:
We start by switching to theACCOUNTADMIN
role, which provides full access across databases. This is important because we need visibility into theINFORMATION_SCHEMA
across multiple environments. - Union All Across Environments:
We useUNION ALL
to merge data from thePROD_BRONZE
,PROD_SILVER
, andPROD_GOLD
environments. These are the different stages of data processing. The query looks into each environment'sINFORMATION_SCHEMA.TABLES
to retrieve the table metadata. - Excluding INFORMATION_SCHEMA:
TheWHERE table_schema <> 'INFORMATION_SCHEMA'
condition ensures that we only get tables that hold actual data, ignoring metadata tables. Snowflake'sINFORMATION_SCHEMA
is useful for querying metadata, but we don't want it in the results for this particular use case. - Ordering the Results:
Finally, theORDER BY TABLE_CATALOG, TABLE_SCHEMA, TABLE_NAME
ensures that the output is neatly organized by catalog, schema, and table name, making it easier to navigate and interpret.
Practical Use Cases
Here’s how you can use this query in real-world scenarios:
1. Testing Data Consistency:
When data is moved or transformed across environments (Bronze → Silver → Gold), it's critical to ensure that row counts match expected values at each stage. This query provides a quick check to identify any significant differences in row counts between environments.
2. Monitoring Data Pipelines:
In a development or testing environment, running this query regularly helps ensure that pipelines between the various environments (like Matillion, dbt, or native Snowflake tasks) are functioning as expected.
3. Database Audits:
For auditing purposes, particularly when dealing with data governance policies, tracking row counts and table names across different environments helps ensure compliance and provide a clear lineage of data movements.
4. Debugging Pipeline Issues:
If a job fails or produces unexpected results, this query helps debug issues by allowing you to quickly check if the data across environments is what it should be, pinpointing which environment might have encountered problems.
Conclusion
This query is a simple yet effective tool for tracking table metadata and row counts across environments in Snowflake. Whether you're testing, auditing, or debugging data pipelines, it provides a quick way to understand the shape and size of your data at each stage of processing. Feel free to adapt and expand this query to suit your needs—adding more environments or columns to get further insights into your data.
Happy querying!