Apache Spark is pivotal in the big data ecosystem, processing vast amounts of data efficiently and saving significant resources and time. Kubernetes plays an increasingly crucial role in managing these big data applications. AWS EKS (Elastic Kubernetes Service) simplifies the deployment of such technologies, although integrating Spark applications within Kubernetes, especially under AWS, can present challenges due to complexities in accessing AWS resources.
The AWS Java SDK supports various authentication methods, with IRSA (IAM Role for Service Account) being the most prevalent. For those interested in learning more about IRSA, a detailed explanation is available here. Traditionally, utilising Spark with IRSA has been a cumbersome process, requiring the generation of WebIdentityToken, setting up access keys, and more. Fortunately, AWS EKS Pod Identity introduces a streamlined approach, enabling straightforward access to the necessary AWS resources without the need for complex authentication configurations in Spark. This article will guide you through deploying the latest Spark Application using AWS EKS Pod Identity in the AWS EKS environment, simplifying the previously intricate process.
Deploying Spark applications in our Kubernetes environment with AWS EKS Pod Identity involves a straightforward process:
Before diving into Spark on AWS EKS, ensure you've followed the AWS EKS Pod Identity setup steps outlined earlier. Our task will involve reading data from a raw_data folder, converting it to Parquet format, and then uploading it to a parquet_data folder, all using a sample CSV file for demonstration.
Our first step is to establish a Private S3 Bucket. This bucket will be uniquely configured to allow both read and write permissions across various pods. For those new to AWS, I recommend reviewing the EKS Pod Identity Blog Post for a detailed walkthrough on creating an AWS S3 Bucket. Within this bucket, we'll set up two distinct folders named raw_data and parquet_data. Then, we'll proceed to upload our sample CSV file into the raw_data folder, setting the stage for our data processing tasks.
You are in the right place now if you have already configured Pod Identity steps. We will create an IAM Role to read data from the raw_data folder and write data to the parquet_data folder.
You can find the example role below:
{ "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Resource": [ "Arn:aws:s3:::", "arn:aws:s3:::/raw_data/*" ], "Action": [ "s3:ListBucket", "s3:GetBucketLocation", "s3:GetObject" ] }, { "Effect": "Allow", "Resource": [ "arn:aws:s3:::/parquet_data/*" ], "Action": [ "s3:*" ] } ] }
After that, we need to attach this role to our node group role and attach the AmazonEC2ContainerRegistryReadOnly policy to access AWS ECR (Elastic Container Registry). We will discuss AWS ECR later.
Google Cloud Platform has donated spark-k8s-operator project to kubeflow. You can reach kubeflow README.md document from https://github.com/kubeflow/spark-operator
Install with:
helm repo add spark-operator https://kubeflow.github.io/spark-operator helm repo update
When you deploy spark-k8s-operator via Helm, you might see the example supported Spark version is 3.1.1, and there needs to be some clarification about using the latest Spark version. We will create a Docker image with spark v3.5.0, which is the latest version.
First, we need to create our Python file, which reads data from the bucket and writes as we discussed. Let's create a Python file with the name "spark_s3.py"
from pyspark.sql import SparkSession # Configure Spark session spark = SparkSession.builder \ .appName("S3 Pod Identity Example") \ .getOrCreate() # Set the S3 bucket and key where data is located source_bucket = "" source_key = "raw_data/sample_submission.csv" # Replace with your S3 path # Set the S3 bucket and key where data will be uploaded destination_bucket = "" destination_key = "parquet_data/" # Replace with your desired destination S3 path # Read data from S3 df = spark.read.csv(f"s3a://{source_bucket}/{source_key}", header=True) # Perform necessary transformations or processing on df if needed # Write data back to S3 df.write.parquet(f"s3a://{destination_bucket}/{destination_key}") # Stop Spark session spark.stop()
And create a Dockerfile;
FROM spark:python3 WORKDIR /app COPY ./spark_s3.py .
To build the Dockerfile and upload it to AWS ECR, you can follow the steps below:
-> aws ecr get-login-password --region eu-west-2 | docker login --username AWS --password-stdin .dkr.ecr.eu-west-2.amazonaws.com -> docker build -t mustafa-pyspark . -> docker tag mustafa-pyspark:latest .dkr.ecr.eu-west-2.amazonaws.com/mustafa-pyspark:latest -> docker push .dkr.ecr.eu-west-2.amazonaws.com/mustafa-pyspark:latest
Before creating a Spark Application, we must create a service account and a cluster role for our application. You can find the example YAMLs below;
To create a service account;
kubectl create serviceaccount pod-identity-assoc-sa -n spark-operator
Let's create a "permission.yaml" file to create a cluster role;
apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRole metadata: name: full-access-role rules: - apiGroups: [""] resources: ["*"] verbs: ["*"] --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRoleBinding metadata: name: full-access-binding subjects: - kind: ServiceAccount name: pod-identity-assoc-sa namespace: spark-operator roleRef: kind: ClusterRole name: full-access-role apiGroup: rbac.authorization.k8s.io
After creating the file, execute;
kubectl apply -f permission.yaml
Below is a sample YAML configuration for a Spark Application. It's important to note the compatibility requirements: Apache Spark v3.5.0 works with Hadoop-aws v3.3.4, which in turn is compatible with aws-java-sdk versions from v1.12.262. To leverage the AWS EKS Pod Identity feature, we'll ensure the use of the supported aws-java-sdk version, which is v1.12.641, in our configuration.
We need to install the aws-java-sdk version with version "1.12.596" or later, which includes EKS Pod Identity.
Example_pyspark.yaml:
apiVersion: "sparkoperator.k8s.io/v1beta2" kind: SparkApplication metadata: name: pyspark-pi namespace: spark-operator spec: type: Python pythonVersion: "3" mode: cluster image: ".dkr.ecr.eu-west-2.amazonaws.com/mustafa-pyspark:latest" imagePullPolicy: Always mainApplicationFile: local:///app/spark_s3.py sparkVersion: "3.5.0" restartPolicy: type: OnFailure onFailureRetries: 3 onFailureRetryInterval: 10 onSubmissionFailureRetries: 5 onSubmissionFailureRetryInterval: 20 hadoopConf: fs.s3a.path.style.access: 'true' fs.s3a.impl: org.apache.hadoop.fs.s3a.S3AFileSystem fs.s3a.endpoint: s3.eu-west-2.amazonaws.com driver: cores: 1 coreLimit: "1200m" memory: "512m" labels: version: 3.5.0 serviceAccount: pod-identity-assoc-sa executor: cores: 1 instances: 1 memory: "512m" labels: version: 3.5.0 serviceAccount: pod-identity-assoc-sa deps: jars: - https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/3.3.4/hadoop-aws-3.3.4.jar - https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-bundle/1.12.641/aws-java-sdk-bundle-1.12.641.jar
Lastly, execute:
kubectl apply -f "Example_pyspark.yaml".
After that, the spark-k8s-operator will create a new pod that includes a Spark driver, and after that, the spark-executor pod will also be created automatically.
You can list pods using the "kubectl get pods -n spark-operator" command. However, if you want to see logs inside your pods (driver or executor), you can execute "kubectl logs <pod_name> -n spark-operator"
You will see the pod status below:
After that, we can check our parquet_data folder inside of the S3 Bucket.
Working with Spark in a Kubernetes setting can be tough, especially when dealing with complex setups like IRSA for data processing in the big data field. These traditional methods demand a lot of complex setup work. However, the introduction of AWS EKS Pod Identity has changed the game. It simplifies how we connect our pods to AWS services, allowing us to set up Spark on AWS through a few easy steps. This development makes managing Spark in Kubernetes much simpler.