M+E Daily

AWS: EMR Studio Enables Companies to Achieve Data Analytics at Petabyte Scale

EMR Studio is an integrated development environment (IDE) that makes it easy for data engineers and data scientists to develop, visualise and debug data engineering and data science applications written in PySpark, Python, R and Scala and using fully managed Jupyter notebooks, according to Kalyan Janaki, senior big data analytics consultant at Amazon Web Services (AWS).

“Amazon EMR is a way to run big data frameworks inside AWS,” he said 22nd Aug, during the AWS Online Tech Talks webinar “Enable Interactive Data Analytics at Petabyte Scale With EMR Studio.”

“Essentially, with a single click of a button, you can spin up a cluster and run it in AWS with these different frameworks installed on it,” he told viewers, noting: “This cluster can be one node, 10 nodes [or] thousands of nodes and it can scale up and scale down dynamically in response to your workloads.”

One important thing about EMR to keep in mind is “we keep up to date with the latest open source frameworks generally within 30 to 60 days,” he said, adding: “With EMR runtime of Apache Spark or runtime of Presto you get performance benefits that helps your workload sometimes run 2.5 times faster as compared to the open source versions. You can also get reduction in the cost by using easy to spot instances and, of course, there is per-second billing so you can spin up the cluster and perform your workloads and spin down the cluster so you get charged for the amount of resource used” only.

EMR Studio “provides fully managed scaling and, without any configurations from the administrators, the cluster can scale up and scale down based upon your Spark and other Apache workloads,” he explained.

EMR also allows users to choose different types of Amazon Elastic Compute Cloud (EC2) instances, he said, adding users can choose compute-optimised instances for their compute entrance of jobs and “then you can choose memory-optimised instances for Spark jobs and also you have an option to choose GPU instances” for machine learning (ML) workloads.

He went on to discuss different EMR deployment options and noted that Amazon previously announced it was running Kubernetes on AWS. “Now you can run Spark jobs using EMR Runtime directly on top” of Amazon Elastic Kubernetes Service (EKS), he said.

Last year, he noted, AWS announced it was encouraging EMR Server-less where users can run Spark and Hive jobs on a “completely server-less EMR cluster where you don’t have to manage any hardware provisioning and EMR takes care of that.”

His advice for viewers: “If your organisation needs the ability to allocate the cost per cluster, then EMR and EC2 would be the best option, but if you are looking into cost per application, then EMR and EKS” would be best and, “if you are looking into cost per job, then EMR server-less might be [a] suitable option.”

Regardless, he went on to say: “All the deployment options get the same data plane bits. That means they all get the same distribution and they all get the same benefits of EMR optimised runtimes.”

He went on to provide a virtual demo of EMR Studio.