I use RDS Postgres databases and need some complex queries done which tend to slow down my databases for everyone else on the system. What do I need to consider as I add a data lake for the analytics?
Many medium-sized companies start out using one of the six flavors of Amazon RDS: Amazon Aurora, PostgreSQL, MySQL, MariaDB, Oracle Database, and SQL Server. As they grow they can have numerous instances of RDS with numerous databases for each customer, whether internal or external. When one customer tries a large analytic query, that workload can cause problems with the RDS cluster, perhaps making it to drop other workloads, fail, or slow down for others. As the needs for processing huge amounts of data increases, so does the need to take your analytics to the next level.
In addition to your operational databases, the idea is to have a much more open analytics stack where you have the ability to run different kinds of processing on the same data. A modern analytics stack lets your organization have much more insights without impacting your operational side. And doing that with open data formats is another key consideration.
There’s a couple options for evolving your analytics stack. One would be to use a cloud data warehouse like Amazon Redshift or Snowflake. Another would be to use open formats in a data lake with a modern SQL query engine. In the first case, there are some advantages of having the highest performance possible on your data but it comes at a certain cost as well as an amount of lock-in, as you cannot easily get at the data in proprietary formats. Considering the data lake with query engine option, we believe that Presto is one of the best choices because of its performance, scalability, and flexibility to connect to S3-based data lakes and federate other data sources as well.
So our recommendation would be to run Presto on top of data stored in an open Parquet or ORC format in S3. Doing it this way, you can put other engines on top of the data as needed, do you’re not going to face a lot of rework in the future should you decide to change something.
From OLTP to Data Lake Analytics
The high level concept is to have an initial one-time bulk migration of your data in OLTP databases to get a copy moved in to S3. After that, as your operational databases will continue to generate or change data, you’ll need establish a pipeline, a stream, or a Change Data Capture (CDC) process in place to get those into S3. BTW, not often will you want data going back from S3 into your relational databases.
While there are different ways to pipe data into S3, one AWS recommended approach is to use the AWS Database Migration Service a.k.a. DMS(much in the same way you may have used it when you migrated off-prem). With AWS Database Migration Service, you can get the first one-time bulk load and then continuously replicate your data with high availability and stream data to Amazon S3. AWS DMS would run in the background and handle all of the data changes for you. You can pick the instance and what period of time you’d like it to run, for example, you may want it to run hourly partitions or daily partitions. That’ll depend on how fast your data is changing and what your requirements are for analytics as well.
Next you’ll want to install Presto on top, and for that you can build a Presto cluster yourself, or simply use Ahana Cloud for Presto to create pre-configured clusters in about 30 minutes.
It’s also worth noting that after you’ve ingested the data into S3 based on what you think the most optimized format or folder structure would be, you may find out that you you need it different. In that case, not to worry, you can use Presto itself to do data lake transformations as well, using SQL you can do a CTAS, Create Table As Select:
The CREATE TABLE AS SELECT (CTAS) statement is one of the most important SQL features available. CTAS is a parallel operation that creates a new table based on the output of a SELECT statement. CTAS is the simplest and fastest way to create and insert data into a table with a single command.
By adding a modern analytics stack to your operational databases, you evolve your analytics capabilities and deliver more insights for better decisions. We suggest moving to an open data lake analytics reference architecture with Presto. This allow a meta analysis of all your data, giving a look at broader trends across databases and other data sources.