AWS Glue Data Catalog now automates generating statistics for new tables. These statistics are integrated with cost-based optimizer (CBO) from Amazon Redshift and Amazon Athena, resulting in improved query performance and potential cost savings.
Table statistics are used by a query engine, such as Amazon Redshift and Amazon Athena, to determine the most efficient way to execute a query. Previously, creating statistics for Apache Iceberg tables in AWS Glue Data Catalog required you to continuously monitor and update configurations for your tables. Now, AWS Glue Data Catalog lets you generate statistics automatically for new tables with one time catalog configuration. You can get started by selecting default catalog in the Lake Formation console and enabling table statistics in the table optimization configuration tab. As new tables are created or existing tables are updated, statistics are generated using a sample of rows for all columns and will be refreshed periodically. For Apache Iceberg tables, these statistics include the number of distinct values (NDVs). For other file formats like Parquet, additional statistics are collected, such as the number of nulls, maximum and minimum values, and average length. Amazon Redshift and Amazon Athena use the updated statistics to optimize queries, using optimizations such as optimal join order or cost based aggregation pushdown. Glue Catalog console provides you visibility into the updated statistics and statistics generation runs.
The support for automation for AWS Glue Catalog statistics is generally available in the following AWS regions: US East (N. Virginia, Ohio), US West (N. California, Oregon), Europe (Ireland), Asia Pacific (Tokyo) regions. Read the blog post
and visit AWS Glue Catalog documentation
to learn more.