The Fundamental Problems with Amazon Redshift
In the last article we discussed the Difference Between Redshift and Redshift Spectrum, in this article let’s understand the problems with Amazon Redshift and some of the available alternatives for Amazon Redshift.
Amazon Redshift made it easy for anyone to implement data warehouse use cases in the cloud. However, It is unable to provide the same benefits to newer, more advanced cloud data warehouses. When Redshift was a relatively new technology, everyone was going through a learning curve.
Here are some of the fundamental Problems with Amazon Redshift:
AWS Redshift’s Cost
Amazon Redshift is a traditional MPP platform where the compute is closely integrated with the storage. The advantage of the cloud is that theoretically compute and storage are completely independent of each other, and storage is virtually unlimited. you want more storage with Redshift you will have to purchase more compute power. As data volumes increase, the cost of storage and compute in the warehouse becomes challenging. AWS Redshift and Redshift Spectrum come with a premium cost, especially if you use Spectrum outside of AWS Redshift.This makes Amazon Redshift one of the most expensive cloud data warehouse solutions.
Vendor lock-in with Redshift
Data warehouse vendors, like AWS Redshift, make it difficult to use your data outside of their services. Data would need to be pulled out of the warehouse and duplicated, further driving up compute costs.
Proprietary data formats
Data architects, data engineers, and analysts are required to use the data format supported by the data warehouse. No flexible or open data formats available.
No Staging Area in Redshift
It is expensive to host data in Amazon Redshift so duplication of data has to be avoided at all cost. In traditional RDBMS systems, we tend to have landing, staging layers and warehouse layers in the same database. But for Amazon Redshift, the landing and staging layer has to be on S3. Only the data on which reports/analytics will be built should be loaded in Redshift on a need basis and can’t keep the entire dataset in Redshift
No Index support in Amazon Redshift
Redshift doesn’t support indexes like other data warehouse systems hence Redshift is designed to perform the best when you select only the columns that you absolutely need to query. As Amazon Redshift is columnar storage, a construct called Distribution Key needs to be used which is nothing but a column based on which data is distributed across different nodes of the Redshift cluster.
Performance based issues that need to be handled in proper maintenance like Vacuum and Analyze, SORT Keys, Compressions, Distribution styles, etc.
Tasks like VACUUM and ANALYZE need to be run regularly which are expensive and time consuming tasks. There’s no good frequency to run this that suits all. This requires a quick cost-benefit analysis before deciding on the frequency.
Disk space capacity planning
Control over disk space is a must with Amazon Redshift especially when you’re dealing with analytical workloads. There are high chances you oversubscribe the system, and not just reduced disk space degrades the performance of the query but also makes it cost prohibitive. Having a cluster filled above 75% isn’t good for performance.
Concurrent query limitation
Above 10 concurrent queries, you start seeing issues. Concurrency scaling may mitigate queue times during bursts in queries. However, simply enabling concurrency scaling didn’t fix all of our concurrency problems. The limited impact is likely due to the constraints on the types of queries that can use concurrency scaling. For example, we have a lot of tables with interleaved sort keys, and much of our workload is writes.
These were some of the fundamental problems that you need to keep in mind while using Amazon Redshift. Also for more information about the AWS Redshift Query limitations check out this article.
Comparing AWS Redshift?
See how it the alternatives rank
AWS Redshift is a completely managed cloud data warehouse service with the ability to scale on-demand. However, the pricing is not simple, since it tries to accommodate different use cases and customers.
At its heart, Redshift is an Amazon petabyte-scale data warehouse product that is based on PostgreSQL version 8.0.2.