Lake-Formation-architecture

How to Build a Data Lake Using Lake Formation on AWS

Introduction

Briefly, AWS lake formation helps users to build, manage and secure their data lakes in a very short amount of time, meaning days instead of months as is common with a traditional data lake approach. AWS data lake formation builds on and works with the capabilities found in AWS Glue.

How it Works

Your root user can’t be your administrator for your data lake, so the first thing you want to do is create a new user that has full admin rights. Go to IAM and create that user and give them AdministratorAccess capability. Next, create an S3 bucket and any data directories you are going to use if you don’t already have something configured. Do that in the S3 segment of AWS as you would normally. If you already have an S3 location setup, you can skip that step. In either case, we then need to register that data lake location in Lake Formation. The Lake Formation menu looks like this:

Menu - AWS Lake Formation

Now with your Lake Formation registered data sources, you can create a database from those sources in Lake Formation, and from there, create your Glue Crawlers. The crawler will take that database that you created, and go into the S3 bucket, read the directory structure and files to create your tables and fields within the database. Once you’ve run your Crawler, you’ll see the tables and fields reflected under “Tables”. The crawler creates a meta-data catalog that provides the descriptions of the underlying data that is then presented to other tools to access, such as AWS Quicksight and Ahana Presto. Amazon provides this diagram:

AWS Lake Formation architecture

To summarize thus far, we’ve 

  • Created an admin user
  • Created an S3 bucket
  • Created three directories in the S3 bucket
  • Registered the S3 bucket as a data lake location

Benefits

Having your data repositories registered and then created as a database in Lake Formation provides a number of advantages in terms of centralization of work. Fundamentally, the role of Lake Formation is to control access to data that you register. A combination of IAM roles and “Data lake permissions” is how you control this on a more macro level. Amazon shows the flow this way:

data lake permissions flow

Where the major advantages lie however, are with the “LF-Tags” and “LF-tag permissions”. This is where your granular security can be applied in a way that will greatly simplify your life. Leveraging Lake Formation we have two ways to assign and manage permissions to our catalog resources. There is “Named” based access and “Tag” based access.

manage permissions in catalog

Named-based access is what most people are familiar with. You select the principal, which can be an AWS user or group of users, and assign it access to a specific database or table. The Tag-based access control method uses Lake Formation tags, called “LF Tags”. These are attributes that are assigned to the data catalog resources, such as databases, tables, and even columns, to principals in our AWS account to manage authorizations to these resources. This is especially helpful with environments that are growing and/or changing rapidly where policy management can be onerous. Tags are essentially Key/Value stores that define these permissions:

  • Tags can be up to 128 characters long
  • Values can be up to 256 characters long
  • Up to 15 values per tag
  • Up to 50 LF-Tags per resource

Use Cases

If we wanted to control access to an employee table for example, such that HR could see everything, everyone in the company could see the names, titles, and departments of employees, and the outside world could only see job titles, we could set that up as:

  • Key = Employees
  • Values = HR, corp, public

Using this simplified view as an example:

lake formation use case flow_example

We have resources “employees” and “sales”, each with multiple tables, with multiple named rows. In a conventional security model, you would give the HR group full access to the employees resource, but all of the corp group would only have access to the “details” table. What if you needed to give access to position.title and payroll.date to the corp group? We would simply add the corp group LF Tag to those fields in addition to the details table, and now they can read those specific fields out of the other two tables, in addition to everything they can read in the details table. The corp group LF Tag permissions would look like this:

  • employees.details
  • employees.position.title
  • employees.payroll.date

If we were to control by named resources, it would require that each named person would have to be specifically allocated access to those databases and tables, and often there is no ability to control by column, so that part wouldn’t even be possible at a data level.

Summary

AWS Lake Formation really simplifies the process whereby you set up and manage your data lake infrastructure. Where it really shines is in the granular security that can be applied through the use of LF Tags. An AWS Lake Formation tutorial that really gets into the nitty-gritty can be found online from AWS or any number of third parties on YouTube. The open-source data lake has many advantages over a data warehouse and Lake Formation can help establish best practices and simplify getting started dramatically.