The Fundamental Problems with Amazon Redshift
In the last article we discussed the Difference Between the data warehouse and Redshift Spectrum. To continue on this topic, let’s understand the problems with Amazon Redshift and some of the available alternatives for that data teams can explore further.
Amazon Redshift made it easy for anyone to implement data warehouse use cases in the cloud. However, It is unable to provide the same benefits to newer, more advanced cloud data warehouses. When it was a relatively new technology, everyone was going through a learning curve.
Here are some of the fundamental Problems with Amazon Redshift:
AWS Redshift’s Cost
Amazon Redshift is a traditional MPP platform where the compute is closely integrated with the storage. The advantage of the cloud is that theoretically compute and storage are completely independent of each other, and storage is virtually unlimited. If you want more storage with this data warehouse you will have to purchase more compute power. As data volumes increase, the cost of storage and compute in the warehouse becomes challenging. AWS products, particularly the warehouse of topic and Spectrum come with a premium cost. This is especially true if you use Spectrum outside of AWS Redshift.This results in one of the most expensive cloud data warehouse solutions.
Vendor lock-in with Redshift
Data warehouse vendors, like AWS, make it difficult to use your data outside of their services. Data would need to be pulled out of the warehouse and duplicated, further driving up compute costs.
Proprietary data formats
Data architects, data engineers, and analysts are required to use the data format supported by the data warehouse. No flexible or open data formats available.
No Staging Area in Redshift
It is expensive to host data with Amazon, so duplication of data has to be avoided at all cost. In traditional RDBMS systems, we tend to have landing, staging layers and warehouse layers in the same database. But for Amazon’s data warehouse the landing and staging layer has to be on S3. Only the data on which reports and analytics will be built should be loaded in Redshift. This task should happen on a need basis rather than keeping the entire dataset in the warehouse
No Index support in Amazon Redshift
This warehouse does not support indexes like other data warehouse systems, hence it is designed to perform the best when you select only the columns that you absolutely need to query. As Amazon’s data warehouse is columnar storage, a construct called Distribution Key needs to be used, which is nothing but a column based on which data is distributed across different nodes of the clusters.
Performance based issues that need to be handled in proper maintenance like Vacuum and Analyze, SORT Keys, Compressions, Distribution styles, etc.
Tasks like VACUUM and ANALYZE need to be run regularly which are expensive and time consuming tasks. There’s no good frequency to run this that suits all. This requires a quick cost-benefit analysis before deciding on the frequency.
Disk space capacity planning
Control over disk space is a must with Amazon Redshift especially when you’re dealing with analytical workloads. There are high chances you oversubscribe the system, and not just reduced disk space degrades the performance of the query but also makes it cost prohibitive. Having a cluster filled above 75% isn’t good for performance.
Concurrent query limitation
Above 10 concurrent queries, you start seeing issues. Concurrency scaling may mitigate queue times during bursts in queries. However, simply enabling concurrency scaling didn’t fix all of our concurrency problems. The limited impact is likely due to the constraints on the types of queries that can use concurrency scaling. For example, we have a lot of tables with interleaved sort keys, and much of our workload is writes.
These were some of the fundamental problems vocalized by users that you need to keep in mind while using or exploring Amazon Redshift. If you are searching for more information about AWS products regarding challenges or benefits check out the next article in this series about AWS and query limitations check out this article.
Comparing AWS Redshift?
See how it the alternatives rank
AWS’ data warehouse is a completely managed cloud service with the ability to scale on-demand. However, the pricing is not simple, since it tries to accommodate different use cases and customers.
At its heart, it is an Amazon petabyte-scale data warehouse product that is based on PostgreSQL version 8.0.2.