Advice: Want to test out Presto outside of AWS and EMR? Learn how you can run Presto as the SQL engine to your Open Data Lakehouse with the forever free offering of Ahana Community Edition.
Why am I getting a Presto EMR S3 timeout error?
If you are making use of AWS EMR Presto and you want to understand this error better there are a few places you can start your investigation. First, you can use the S3 select pushdown feature to push down compute operations (i.e.
SELECT) and predicate operations (i.e.
WHERE) to S3. Pushdown makes query performance much noticeably faster. This is because it means queries will only retrieve the required data from S3; so, there is essentially less to slow down or hinder the performance of the SQL query. It also helps in reducing the amount of data transferred between EMR Presto and S3.
Now, If you are using pushdown for EMR Presto and seeing a timeout error, there might be a few reasons for that. Because Presto uses EMRFS as its file system, there’s a maximum allowable number of client connections to S3 through EMRFS for Presto (500). When using S3 Select Pushdown, you bypass EMRFS when you access S3 for predicate operations so the value of
hive.s3select-pushdown.max-connections is what will determine the max number of client connections allowed by worker nodes. Requests that aren’t pushed down use the value of
At this point you might get an error that says “timeout waiting for connection from pool”. That is because you need to increase the value of both of those values above. Once you do that, and have this completed, that should help solve this problem.
Errors like these are more than common with Presto EMR, as reported by EMR users. EMR is a very complex and resource-intensive option. Because of this, there is a lot you have to understand and be prepared for when it comes to the specific config and turning parameters for Hadoop. Because of the complexity associated, many companies have now switched from EMR Presto to Ahana Cloud. These users thought to save time and reduce work fatigue by making the switch to a managed service for Presto on AWS through Ahana. This path is much easier for the user and requires less of their energy to get up and running. Ahana Cloud is a non-Hadoop deployment in a fully managed environment. Users see up to 23x performance with Ahana’s built-in caching. The users have also reported a 5.5x increase in price performance. Now, they are achieving goals with the same or better performance, and at a fraction of the cost.
Additional EMR Presto Resources
Check out some of the differences between Presto EMR and Ahana Cloud. If you’re using EMR Presto today, Ahana Cloud might help with some of those pain points. Additionally, Ahana is pay-as-you-go pricing and it’s easy to get started if you’re already an EMR user.
Want to take a deeper dive into Presto? Learn what PrestoDB is, how it started, and the benefits users report.
AWS lake formation helps users to build, manage and secure their data lakes in a very short amount of time, meaning days instead of months as is common with a traditional data lake approach.
In this article we’ll look at the contextual requirements of a data warehouse, which are the five components of a data warehouse. We’ll break down the pros and cons and compare it to a data lake.