How to test Amazon EMR Serverless jobs locally using Docker?
Amazon EMR Serverless offers a managed, serverless platform for big data processing. In this article, we will see the advantages of this service over classical EMR clusters and Glue jobs, why you would need to run an EMR Serverless job locally, and finally, how to do it.
You will find a working demo of this article in this github repository.
Why use EMR Serverless?
When you want to execute Spark jobs in AWS using a service that is at least somewhat managed, you do not have many choices.
Historically, you start with EMR, the grandpa. But you have to manage the underlying infrastructure yourself. Environment management is hard to industrialize (bootstrap, Python versions, ad-hoc dependencies for jobs, etc.), and log management is an overly complicated mess (but that is still one of my favourite technical interview questions to ask on this subject).
If you do not want to maintain EC2s for your use case and want to go serverless, the Go-To solution is Glue jobs. Log management is way simpler than EMR’s, which is a great advantage. But you quickly realize that as your data context scales, Glue jobs can reveal to be expensive (as you cannot scale memory and vcpus separately), and environment management is better, but still clunky.
For example, you cannot install C compiled library on your system and package them in a zip to be included in your job, as you can for Lambda functions. You have to specify them using the –additional-python-modules parameter (which is fine (sorta), but makes it difficult to control the coherence of your dependencies in your job).
EMR Serverless presents the advantage of being fully serverless (who could have guessed from the name ?) while being more granular than Glue jobs to scale resources. It also presents a huge advantage: you can dockerize your job. This is a huge game changer, as you can now coherently build your job execution environment while being sure to control precisely what is installed and which version. I could also mention the interface of the EMR Serverless Studio, which is an absolute pleasure to use, either in the build phase or the run phase.
Why Run EMR Serverless Locally?
When you build Spark jobs to be deployed in a Cloud environment, especially serverless, you often realize that if not done well, it can be a painful process (this sentence is also valid for non-Spark applications, though).
When you build your code, you have two choices to test it.
You can test it locally in a Jupyter Notebook (even using Spark, as a previous article from ours was addressing). Still, you have no guarantee that your local compute environment will match the AWS one, so you expose yourself to the risk of starting over due to a dependency mismatch, for example.
You can test it remotely by deploying the code in AWS and running the job with the code to test. This presents the advantage of resolving the risk of dependency mismatch as you test your code directly in an environment identical to the production one. But then arise the problem of the cold start (which is, for EMR Serverless, significant, 3 or 4 minutes at least to start the application and get a job running) and, more importantly the problem of the cost. Indeed, leaving an EMR Serverless application constantly running while you iterate over your code can bring bad surprises at the end of the month to your FinOps Team (and if you don’t have one, a nasty surprise to your CFO at the end of the year).
Local execution using Docker to the rescue:
- Consistent environment: you have hard guarantees that your local development environment matches your AWS compute environment.
- Iterate Faster: Test configurations, code, and dependencies without deploying to AWS, and without having to wait for EMR Serverless cold-start.
- Reduce Costs: Avoid the cost of deploying incomplete jobs to the cloud and iterating using often costly AWS resources.
It is important to note that the point of executing your Pyspark code locally is not to test your processing job at scale, because you would probably lack sufficient resources to do so (and if you don’t, you may not need to use Spark for this task). The point is to test your code on a data subset, and once you ensured that it is functional, you can deploy and test it at scale in your AWS environment.
How do you run an EMR serverless job locally using Docker ?
Prerequisites
- Docker Installed: Download and install Docker from docker.com (tested with 27.4.0).
- EMR Serverless Application: A configured EMR Serverless application ready for deployment with the relevant IAM role attached (tested with EMR 7.1.0).
Step 1: Define the Application Code
For this example, let’s assume you’re using a PySpark job to process a dataset. Save the following Python script as main.py:
from pyspark.sql import SparkSession
from pyspark.sql.types import *
# test custom dependency with c compiled library import
import pandas
def main():
spark_session = (
SparkSession.builder.enableHiveSupport()
.config("spark.sql.catalogImplementation", "hive") # setup hive catalog implementation to work with AWS Glue data catalog
.getOrCreate()
)
# create dummy Spark dataframe
data = [("Alice", 34), ("Bob", 45), ("Cathy", 29)]
columns = ["Name", "Age"]
df = spark_session.createDataFrame(data, columns)
database_name = "test_database"
spark_session.sql(f"CREATE DATABASE IF NOT EXISTS {database_name}")
s3_bucket_name = "CHANGE_BUCKET_NAME"
# write dataframe in AWS
df.write.mode("overwrite").option("path", f"s3a://{s3_bucket_name}/{database_name}/my_table/").saveAsTable(f"{database_name}.my_table")
# try to request dataframe
spark_session.sql(f"SELECT * FROM {database_name}.my_table").show()
# setup script entrypoint
if __name__ == "__main__":
main()
Step 2: Create a Custom Dockerfile
As described in the AWS documentation, we use the AWS EMR Serverless Docker image as the base for our custom image. This Dockerfile is not to be considered optimized; it is only used as a showcase example.
# syntax=docker/dockerfile:1
FROM --platform=linux/amd64 public.ecr.aws/emr-serverless/spark/emr-7.1.0:20240823
# https://docs.aws.amazon.com/emr/latest/EMR-Serverless-UserGuide/application-custom-image.html
USER root
# MODIFICATIONS GO HERE
# install python 3.10 as the default python version in the base image is a bit outdated
RUN yum install -y gcc openssl-devel bzip2-devel sqlite-devel libffi-devel tar gzip wget make zlib-static && \
yum clean all && \
wget https://www.python.org/ftp/python/3.10.15/Python-3.10.15.tgz && \
tar xzf Python-3.10.15.tgz && cd Python-3.10.15 && \
./configure --enable-optimizations --enable-loadable-sqlite-extensions && \
make altinstall && \
ln -sf /usr/local/bin/python3.10 /usr/bin/python3 && \
ln -sf /usr/bin/pip3 /usr/bin/pip
# setting custom work path in the system
# If you don't do that you will have a Java AccessDeniedException when trying to copy the iceberg jar file to the /usr/app/src directory in the spark executors instances
ENV HOME="/usr/app/src"
RUN mkdir -p $HOME && chown -R hadoop:hadoop $HOME
WORKDIR $HOME
ENV PYTHONPATH="${PYTHONPATH}:$HOME"
# setup jupyter notebook for local execution
RUN pip install jupyter notebook
# the following line is to setup files required by the source docker image which are mounted directly by EMR Serverless service when executing in AWS. We create empty file because we want to be able to run this image locally, and the scripts do not really need to do anything in this context
RUN mkdir -p /var/loggingConfiguration/spark/ && touch /var/loggingConfiguration/spark/run-fluentd-spark.sh && touch /var/loggingConfiguration/spark/run-adot-collector.sh
# install custom dependencies with c compiled library
RUN pip install pandas
# end of modifications
# EMRS will run the image as hadoop
USER hadoop:hadoop
Step 3: Build and Run the Docker Image
1. Build the Image: Run the following command to build the Docker image:
docker buildx build . -t emr_serverless_local_image:1 --provenance=false
2. Run the Container: Execute the container to test your PySpark job
mkdir logs # create directory which will contain Spark logs
# AWS Authentication using profiles configured in your credentials do not work in the EMR Serverless image, you can only use environment variables to connect to AWS
export CREDENTIALS=$(aws configure export-credentials)
docker run \
-e AWS_ACCESS_KEY_ID=$(echo $CREDENTIALS | jq -r '.Credentials.AccessKeyId') \
-e AWS_SECRET_ACCESS_KEY=$(echo $CREDENTIALS | jq -r '.Credentials.SecretAccessKey') \
-e AWS_SESSION_TOKEN=$(echo $CREDENTIALS | jq -r '.Credentials.SessionToken // ""') \
-e AWS_REGION="eu-west-1" \
-e AWS_DEFAULT_REGION="eu-west-1" \
--mount type=bind,source=$(pwd)/logs,target=/var/log/spark/user/ \
-v $(pwd)/main.py:/usr/app/src/main.py:rw \
-p 8888:8888 \
-e PYSPARK_DRIVER_PYTHON=jupyter \
-e PYSPARK_DRIVER_PYTHON_OPTS='notebook --ip="0.0.0.0" --no-browser' \
emr_serverless_local_image:1 \
pyspark --master local \
--conf spark.hadoop.fs.s3a.endpoint=s3.eu-west-1.amazonaws.com \
--conf spark.hadoop.hive.metastore.client.factory.class=com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory
You should see debug printing in your terminal, and the process should hang (if it does not and exit, you will find the error, or at least more context, in the $HOME/logs/stderr file).
Now see the $HOME/logs/stderr file and you will find the URL and the token allowing you to connect to the Jupyter notebook instance running in your docker container (you are looking for something like this “http://127.0.0.1:8888/tree?token=[TOKEN]”). Copy and paste this link in your browser, and create a new notebook.
3. Test your code: Test your main.py directly from the Jupyter notebook instance
Note that in the docker run command I added a main.py file in your docker container $HOME directory. It allows you to change your main.py code either inside or outside your docker container while keeping both versions synchronized, which allows you to iterate without having to rebuild and re-run your Docker container at each code change.
Now you can create a cell and run your main function from the main.py file :
from main import main
main()
From now on, you can modify your code either inside or outside your Docker container and iterate locally, which will be faster and cheaper than testing every time in AWS.
Step 4: Deploy your Docker image to ECR and link it to EMR Serverless
Once you validated your code, you can push the docker image to your ECR.
# login to the ECR
aws ecr get-login-password --region ${aws_region_name}| docker login -u AWS ${aws_account_id}.dkr.ecr.${aws_region_name}.amazonaws.com --password-stdin
# tag the Docker image correctly
docker image tag emr_serverless_local_image:1 ${aws_account_id}.dkr.ecr.${aws_region_name}.amazonaws.com/${ecr_name}:${image_tag}
# push the Docker image to ECR
docker push ${aws_account_id}.dkr.ecr.${aws_region_name}.amazonaws.com/${ecr_name}:${image_tag}
You may then set up your EMR Serverless application to use your Docker image stored in your ECR (see this documentation for a more precise how-to on this part). Do not forget to check that the EMR version of your EMR Serverless application matches the one of the base Docker image used in your Dockerfile (in my example 7.1.0) and that so does the architecture (here x84_64). Also, set a resource policy on your ECR to allow EMR Serverless to pull the image.
Then you need to upload your main.py file to a S3 bucket :
aws s3 cp main.py s3://${aws_s3_bucket_name}/main.py
Finally, submit a new job to your EMR Serverless application (see this documentation for a more precise how-to on this part), do not forget to specify the s3 key of your main.py file.
Conclusion
Testing EMR Serverless jobs locally with Docker provides a reliable, cost-effective way to develop and debug your applications. By replicating the environment locally, you can ensure smoother deployments and faster iterations, saving time and resources. With the steps outlined, you’re ready to bridge local development and cloud execution seamlessly.
If you want to dive deeper into the code, you will find a working demo in this github repository.