AWS Lake Formation vs AWS Glue – What are the differences?
Last updated: October 2022
As you start building your analytics stack in AWS, there are several AWS technologies to understand as you begin. In this article we’ll discuss two key technologies:
- AWS Lake Formation for security and governance; and
- AWS Glue. a data catalog.
While both of these services are typically used to build, manage, and operationalize AWS data lakes, they fulfil completely different roles. AWS Lake Formation is built around AWS Glue, and both services share the same AWS Glue Data Catalog; however, Lake Formation provides a wider breadth of governance and data management functionality, whereas Glue is focused on ETL and data processing.
What is AWS Lake Formation?
AWS Lake Formation makes it easier for you to build, secure, and manage data lakes. It provides a means to address some of the challenges around unstructured data lake storage – including security, access control, governance, and performance.
How it works: AWS Lake Formation gives you a central console where you can discover data sources, set up transformation jobs to move data to an Amazon Simple Storage Service (S3) data lake, remove duplicates and match records, catalog data for access by analytic tools, configure data access and security policies, and audit and control access from AWS analytic and ML services
For AWS users who want to get governance on their data lake, AWS Lake Formation makes it easy to set up a secure data lake very quickly (in a matter of days).
In order to provide better query performance when using services such as Athena or Presto, Lake Formation creates Glue workflows that integrates source tables, extract the data, and load it to Amazon S3 data lake.
When should you use AWS Lake Formation?
At its core, Lake Formation is built to simplify the process of moving your data to a data lake, cataloging the data, and making it available for querying. Typical scenarios where this comes into play include:
- Build data lakes quickly – this means days not months. You can move, store, update and catalog your data faster, plus automatically organize and optimize your data.
- Add Authorization on your Data Lake – You can centrally define and enforce security, governance, and auditing policies.
- Make data easy to discover and share – Catalog all of your company’s data assets and easily share datasets between consumers.
To understand how this works in practice, check out our article on using Redshift Spectrum in Lake Formation.
What is AWS Glue?
AWS Glue is a serverless data integration service that makes it easy to discover, prepare, and join data for analytics, machine learning, and application development. AWS Glue consists of a central metadata repository known as the AWS Glue Data Catalog which discovers and catalogs metadata about your data stores or data lake. Using the AWS Glue Data Catalog, users can easily find and access data.
Glue ETL can be used to run managed Apache Spark jobs in order to prepare the data for analytics, perform transformations, compact data, and convert it into columnar formats such as Apache Parquet.
Read more: What’s the difference between Athena and Glue?
When should you use AWS Glue?
To make data in your data lake accessible, some type of data catalog is essential. Glue is often the default option as it’s well-integrated into the broader AWS ecosystem, although you could consider open-source alternatives such as Apache Iceberg. Glue ETL is one option to process data, where alternatives might include running your own Spark cluster on Amazon EMR or using Databricks.
Typical scenarios where you might use include:
- Create a unified data catalog to find data across multiple data stores – View the Data Catalog to quickly search and discover the datasets that you own, and maintain the relevant metadata in one central repository.
- Data Catalog for data lake analytics with S3 – Organize, cleanse, validate, and format data for storage in a data warehouse or data lake
- Build ETL pipelines to ingest data into your S3 data lake.
The data workflows initiated from AWS Lake Formation blueprint are executed as AWS Glue jobs. You can view and manage these workflows in either the Lake Formation console and the AWS Glue console.
AWS Lake Formation vs AWS Glue: A Summary
AWS Lake formation simplifies security and governance on the Data Lake whereas AWS Glue simplifies the metadata and data discovery for Data Lake Analytics. While both of these services are used as data lake building blocks, they are complimentary. Glue provides basic functionality needed in order to enable analytics, including data cataloging and ETL; Lake Formation offers a simplified way to manage your data lake, including the underlying Glue jobs.
Check out our community roundtable where we discuss how you can build simple data lake with the new stack: Presto + Apache Hudi + AWS Glue and S3 = The PHAS3 stack