Why am I getting a Presto EMR S3 timeout error?
If you’re using AWS EMR Presto, you can use the S3 select pushdown feature to push down compute operations (i.e.
SELECT) and predicate operations (i.e.
WHERE) to S3. Pushdown makes query performance much faster because it means queries will only retrieve required data from S3. It also helps in reducing the amount of data transferred between EMR Presto and S3.
If you’re using pushdown for EMR Presto and seeing a timeout error, there might be a few reasons for that. Because Presto uses EMRFS as its file system, there’s a maximum allowable number of client connections to S3 through EMRFS for Presto (500). When using S3 Select Pushdown, you bypass EMRFS when you access S3 for predicate operations so the value of
hive.s3select-pushdown.max-connections is what will determine the max number of client connections allowed by worker nodes. Requests that aren’t pushed down use the value of
At this point you might get an error that says “timeout waiting for connection from pool”. That’s because you need to increase the value of both of those values above. Once you do that, that should help solve this problem.
Errors like these are common with Presto EMR. EMR is complex and resource-intensive, and there’s a lot you have to understand when it comes to the specific config and turning parameters for Hadoop. Many companies have switched from EMR Presto to Ahana Cloud, a managed service for Presto on AWS that is much easier to use. Ahana Cloud is a non-Hadoop deployment in a fully managed environment. Users see up to 23x performance with Ahana’s built-in caching.
Check out some of the differences between Presto EMR and Ahana Cloud. If you’re using EMR Presto today, Ahana Cloud might help with some of those pain points. Additionally, Ahana is pay-as-you-go pricing and it’s easy to get started if you’re already an EMR user.