RStudio is an integrated development environment (IDE) for R, a language and environment for statistical computing and graphics. As a data scientist, you may integrate R and Spark (a big data processing framework) to analyze large datasets. You can use an R package called sparklyr to offload filtering and aggregation of large datasets from your R script to Spark and use R’s native strength to further analyze and visualize the results from Spark.
An R script running in RStudio uses sparklyr to submit Spark jobs to the cluster. Typically, an R script (along with sparklyr) runs in an RStudio environment that is installed on a machine that’s separate from the cluster of machines (in Amazon EMR) that runs Spark. To enable sparklyr to submit Spark jobs, you need to establish network connectivity between the RStudio machine and the cluster running Spark. One way to do that is to run RStudio on an edge node, which is a machine that is part of the cluster’s private network and runs client applications like RStudio. Edge nodes let you run client applications separately from the nodes that run the core Hadoop services. Edge nodes also offer convenient access to local Spark and Hive shells.
However, edge nodes are not easy to deploy. They must have the same versions of Hadoop, Spark, Java, and other tools as the Hadoop cluster, and require the same Hadoop configuration as nodes in the cluster.
This post demonstrates an automated way to create an edge node with RStudio installed using AWS Systems Manager.
Deploying an edge node for an EMR cluster
One method to deploy an edge node involves creating an Amazon EC2 AMI directly from the EMR master node. For more information, see Launch an edge node for Amazon EMR to run RStudio. This post offers an SSM automation document that simplifies on-demand edge node deployment. Systems Manager gives you visibility and control of your AWS infrastructure, and Systems Manager Automation lets you safely automate common and repetitive tasks, like creating edge nodes on demand.
This post walks you through the process of installing the SSM document and how to use the document to create an edge node. For more information about the code, see the GitHub repo.
Creating the automation document
First, you use Terraform to create the automation document. You can download Terraform from the Terraform website. Alternatively, AWS CloudFormation works equally well.
After you install Terraform, go to the directory where you cloned the repo and edit the file vars.tf
. For more information, see the GitHub repo. This file defines several input parameters, and the comments in the file should be self-explanatory. You can provide default values in vars.tf
or override using one of the other supported techniques. For more information, see Input Variables on the Terraform website.
Next, enter the following code:
The code runs a Terraform plan to create the document. Your environment should already be configured to access your AWS account with privileges to do the following:
- Create SSM documents. For more information, see Create Non-Admin IAM Users and Groups for Systems Manager.
- Create IAM roles and policies. For more information, see Actions, Resources, and Condition Keys for Identity And Access Management.
- Upload an object to an Amazon S3 For more information, see Specifying Permissions in a Policy.
To make updates to the Terraform plan going forward, use Terraform’s shared state feature. For more information, see Remote State on the Terraform website.
The Terraform plan loads your automation document from a local template file and registers it with Systems Manager. See the following code:
The rest of the Terraform plan does the following:
- Uploads an Ansible template for the SSM document to use
- Sets up IAM roles and policies that let Systems Manager and a new edge node assume the correct privileges
What’s in the automation document?
The automation document has three main steps. First, it creates and launches a new AMI from the existing EMR master node. See the following code:
Next, it updates the SSM agent and runs an Ansible playbook to install RStudio. You can examine the Ansible playbook in GitHub; it installs RStudio and dependencies and handles some initial configuration. See the following code:
Finally, it adds an Amazon CloudWatch alarm to trigger EC2 instance recovery if the edge node fails. See the following code:
Using the automation document
To start using the automation document, complete the following steps:
- On the Systems Manager console, choose Automation.
- Choose Execute automation.
- On the Owned by me tab, choose the document create_edge_node.
- Choose Next.
On the next page, you need to fill in three pieces of information. You may want to get some advice from your cloud operations team, or whomever manages your EMR clusters. For instructions on creating a cluster with the latest EMR version and Spark, see Launch Your Sample Amazon EMR Cluster. - In the Input parameters section, provide the following information:
- For MasterNodeId, enter the EC2 instance ID of the master node of the EMR cluster you want to connect to.In most cases, your operations team can provide this information, but you can also find the instance ID by going to the Hardware tab of your EMR cluster and drilling into the master node group. Your EMR cluster must have Spark installed because you want to use sparklyr with RStudio.The following screenshot shows where to find your EC2 instance ID on the Hardware tab.
- For SubnetId, enter the subnet that the edge node should live in. Your operations team should provide this information, or you can see it on the Summary tab of the EMR cluster. The edge node must live in the same VPC as the cluster. It does not need to be in a public subnet because you connect via Session Manager.The following screenshot shows where to find your subnet ID on the Summary tab of your cluster.
- For QuickIdentifier, enter a user-friendly name to help you remember this edge node; for example,
Edge Node with RStudio
.
- For MasterNodeId, enter the EC2 instance ID of the master node of the EMR cluster you want to connect to.In most cases, your operations team can provide this information, but you can also find the instance ID by going to the Hardware tab of your EMR cluster and drilling into the master node group. Your EMR cluster must have Spark installed because you want to use sparklyr with RStudio.The following screenshot shows where to find your EC2 instance ID on the Hardware tab.
When the execution is finished, you will see the completed steps, as in the following screenshot.
If you choose the last step in the list (step 8), you see the DNS name and EC2 instance ID for your new edge node. See the following screenshot.
You can now connect to this edge node by using another feature of Systems Manager: Session Manager. Session Manager lets you open an SSH tunnel for port forwarding without having to use SSH keys or expose the SSH port to the internet. For instructions on opening a port forwarding session, see Starting a Session (Port Forwarding). You need the Session Manager plugin installed locally. See the following code:
For more information, see Install the Session Manager Plugin for the AWS CLI.
You can now access RStudio at http://localhost:8787
. See the following screenshot.
You can also access the node directly and use the local Hive and Spark shells through the Session Manager console.
This post sets up the SSM document to create single-user edge nodes. The default user name to log in to RStudio is ruser
. You must set the password by changing the password for the ruser
account directly in the operating system, because RStudio uses PAM authentication by default. For more information, see What is my username on my RStudio Server? To change the password, open another Session Manager session and enter the following code:
You should keep any valuable files like R scripts in a GitHub repo and store any output data in an S3 bucket for long-term persistence.
Configuring security
The Terraform plan sets up three important IAM roles:
- A role that Systems Manager assumes when running the automation document. This role needs to perform actions in Amazon EC2, like creating new AMIs, CloudWatch, and Systems Manager.
- An EC2 instance profile for the edge nodes. Theprofile has the permissions necessary for the SSM agent to run and for the edge node to perform tasks typical of an EMR node.
- A role for CloudWatch to perform instance recovery.
In your environment, you may want to review the IAM roles and policies and tighten their scope based on tags or other conditions.
Conclusion
This post described an automated way to deploy an EMR edge node with RStudio using an SSM document. EMR edge nodes with RStudio give you a familiar working environment with access to large datasets via Spark and sparklyr. For information about deploying a new edge node and installing the necessary Hadoop libraries with an AWS CloudFormation template, see Launch an edge node for Amazon EMR to run RStudio.
About the Author
Randy DeFauw is a principal solutions architect at Amazon Web Services. He works with the AWS customers to provide guidance and technical assistance on database projects, helping them improve the value of their solutions when using AWS.