RDS as Hive metastore for EMR on EKS

This is a project developed in Python CDKv2. It includes few Spark examples that create external hive tables on top of sample dataset stored in S3. These jobs will run with EMR on EKS.

The infrastructure deployment includes the following:

A new S3 bucket to store sample data and job code
An EKS cluster in a new VPC across 2 AZs
A RDS Aurora database (MySQL engine) in the same VPC
A small EMR on EC2 cluster in the same VPC
- 1 master & 1 core node (m5.xlarge)
- use master node to query the remote hive metastore database
An EMR virtual cluster in the same VPC
- registered to emr namespace in EKS
- EMR on EKS configuration is done
- Connect to RDS and initialize metastore schema via schematool
A standalone Hive metastore service (HMS) in EKS
- Helm Chart hive-metastore-chart is provided.
- run in the same emr namespace
- thrift server is provided for client connections
- doesn't initialize/upgrade metastore schemas via schematool

Spark Examples

1. Connect remote Hive metastore via JDBC
2. Connect Hive via EMR on EC2
3. Connect Hive via EMR on EKS
4. Connect Hive via hms sidecar
5. Hudi with hms sidecar
6. Hudi with Glue catalog
7. Run Hive SQL with EMR on EKS

Key Artifacts

Job source code - deployment/app_code/job.
HMS sidecar pod template - deployment/app_code/job/sidecar_hms_pod_template.yaml.
Standalone hive-metastore docker image - Follow the README instruction to build your own. Don't forget to update your sidecar pod template or helm chart value file with your own ECR URL.

Deploy Infrastructure

The provisioning takes about 30 minutes to complete. Two ways to deploy:

AWS CloudFormation template (CFN)
AWS Cloud Development Kit (AWS CDK)

Prerequisites

NOTE: HMS helm chart requires k8s >= 1.23, ie. EKS version must be 1.23+.

Install the folowing tools:

AWS CLI. Configure the CLI by aws configure.
kubectl & jq

Can use AWS CloudShell that has included all the neccessary software for a quick start.

CloudFormation Deployment

Region	Launch Template
---------------------------	-----------------------
US East (N. Virginia)

To launch in a different AWS Region, check out the following customization section, or use the CDK deployment option.

Customization

You can customize the solution, for example deploy to a different AWS region:

export BUCKET_NAME_PREFIX=<my-bucket-name> # bucket where customized code will reside
export AWS_REGION=<your-region>
export SOLUTION_NAME=hive-emr-on-eks
export VERSION=v2.0.0 # version number for the customized code

./deployment/build-s3-dist.sh $BUCKET_NAME_PREFIX $SOLUTION_NAME $VERSION

# OPTIONAL: create the bucket where customized code will reside
aws s3 mb s3://$BUCKET_NAME_PREFIX-$AWS_REGION --region $AWS_REGION

# Upload deployment assets to the S3 bucket
aws s3 cp ./deployment/global-s3-assets/ s3://$BUCKET_NAME_PREFIX-$AWS_REGION/$SOLUTION_NAME/$VERSION/ --recursive --acl bucket-owner-full-control
aws s3 cp ./deployment/regional-s3-assets/ s3://$BUCKET_NAME_PREFIX-$AWS_REGION/$SOLUTION_NAME/$VERSION/ --recursive --acl bucket-owner-full-control

echo -e "\nIn web browser, paste the URL to launch the CFN template: https://console.aws.amazon.com/cloudformation/home?region=$AWS_REGION#/stacks/quickcreate?stackName=HiveEMRonEKS&templateURL=https://$BUCKET_NAME_PREFIX-$AWS_REGION.s3.amazonaws.com/$SOLUTION_NAME/$VERSION/HiveEMRonEKS.template\n"

CDK Deployment

Alternatively, deploy the infrastructure via CDK. It requires to pre-install the following tools as once-off tasks:

Python 3.6+
Nodejs 10.3.0+
CDK toolkit
Run the CDK bootstrap after the 'pip install' requirement step as below.

python3 -m venv .env
source .env/bin/activate
pip install -r requirements.txt
cdk deploy

Post-deployment

Make sure AWS CLI, kubectl and jq are installed.

One-off setup:

Set environment variables in .bash_profile and connect to EKS cluster.

curl https://raw.githubusercontent.com/aws-samples/hive-emr-on-eks/main/deployment/app_code/post-deployment.sh | bash

source ~/.bash_profile

Can use Cloud9 or Cloudshell, if you don’t want to install anything on your computer or change your bash_profile,

[OPTIONAL] Build HMS docker image and replace the hive metastore docker image name in hive-metastore-chart/values.yaml by the new one if needed:

cd docker
export DOCKERHUB_USERNAME=<your_dockerhub_name_OR_ECR_URL>
docker build -t $DOCKERHUB_USERNAME/hive-metastore:3.0.0 .
docker push $DOCKERHUB_USERNAME/hive-metastore:3.0.0

Copy sample data to your S3 bucket: NOTE: amazon-reviews-pds is not a public dataset anymore. Either skip this step or copy your own review data or use other public dataset you know of.

aws s3 cp s3://aws-bigdata-blog/generated_synthetic_reviews/data/product_category=Toys_Games/ s3://$S3BUCKET/app_code/data/toy --recursive

1.1 Connect Hive metastore via JDBC

hivejdbc.py:

import sys
from pyspark.sql import SparkSession
spark = SparkSession \
    .builder \
    .config("spark.sql.warehouse.dir", sys.argv[1]+"/warehouse/" ) \
    .enableHiveSupport() \
    .getOrCreate()
spark.sql("SHOW DATABASES").show()
spark.sql("CREATE DATABASE IF NOT EXISTS `demo`")

spark.sql("CREATE EXTERNAL TABLE IF NOT EXISTS `demo`.`amazonreview`( `marketplace` string,`customer_id`string,`review_id` string,`product_id` string,`product_title` string,`star_rating` integer,`helpful_votes` integer,`total_votes` integer,`insight` string,`review_headline` string,`review_body` string,`review_date` timestamp,`year` integer) STORED AS PARQUET LOCATION '"+sys.argv[1]+"/app_code/data/toy/'")
spark.sql("SELECT count(*) FROM demo.amazonreview").show()
spark.stop()

1.2 Submit hivejdbc.py job to EMR on EKS

run the script:

curl https://raw.githubusercontent.com/aws-samples/hive-emr-on-eks/main/deployment/app_code/job/submit-job-via-jdbc.sh | bash

OR

aws emr-containers start-job-run \
--virtual-cluster-id $VIRTUAL_CLUSTER_ID \
--name spark-hive-via-jdbc \
--execution-role-arn $EMR_ROLE_ARN \
--release-label emr-6.3.0-latest \
--job-driver '{
  "sparkSubmitJobDriver": {
      "entryPoint": "s3://'$S3BUCKET'/app_code/job/hivejdbc.py",
      "entryPointArguments":["s3://'$S3BUCKET'"],
      "sparkSubmitParameters": "--conf spark.jars.packages=mysql:mysql-connector-java:8.0.28 --conf spark.driver.cores=1 --conf spark.executor.memory=4G --conf spark.driver.memory=1G --conf spark.executor.cores=2"}}' \
--configuration-overrides '{
    "applicationConfiguration": [
      {
        "classification": "spark-defaults", 
        "properties": {
          "spark.dynamicAllocation.enabled":"false",
          "spark.hadoop.javax.jdo.option.ConnectionDriverName": "com.mysql.cj.jdbc.Driver",
          "spark.hadoop.javax.jdo.option.ConnectionUserName": "'$USER_NAME'",
          "spark.hadoop.javax.jdo.option.ConnectionPassword": "'$PASSWORD'",
          "spark.hadoop.javax.jdo.option.ConnectionURL": "jdbc:mysql://'$HOST_NAME':3306/'$DB_NAME'?createDatabaseIfNotExist=true" 
        }
      }
    ], 
    "monitoringConfiguration": {
      "s3MonitoringConfiguration": {"logUri": "s3://'$S3BUCKET'/elasticmapreduce/emr-containers"}}}'

^ back to top

2.1 Connect Hive metastore via thrift service hosted on EMR on EC2

hivethrift_emr.py:

from os import environ
import sys
from pyspark.sql import SparkSession
spark = SparkSession \
    .builder \
    .config("spark.sql.warehouse.dir", sys.argv[1]+"/warehouse/" ) \
    .config("hive.metastore.uris","thrift://"+sys.argv[2]+":9083") \
    .enableHiveSupport() \
    .getOrCreate()
spark.sql("SHOW DATABASES").show()
spark.sql("CREATE DATABASE IF NOT EXISTS `demo`")
spark.sql("CREATE EXTERNAL TABLE IF NOT EXISTS `demo`.`amazonreview2`( `marketplace` string,`customer_id`string,`review_id` string,`product_id` string,`product_title` string,`star_rating` integer,`helpful_votes` integer,`total_votes` integer,`insight` string,`review_headline` string,`review_body` string,`review_date` timestamp,`year` integer) STORED AS PARQUET LOCATION '"+sys.argv[1]+"/app_code/data/toy/'")
spark.sql("SELECT count(*) FROM demo.amazonreview2").show()
spark.stop()

2.2 Submit hivethrift_emr.py job to EMR on EKS

Run the script:

curl https://raw.githubusercontent.com/aws-samples/hive-emr-on-eks/main/deployment/app_code/job/submit-job-via-thrift_emr.sh | bash

OR

#!/bin/bash
export STACK_NAME=HiveEMRonEKS
export EMR_MASTER_DNS_NAME=$(aws ec2 describe-instances --filter Name=tag:project,Values=HiveEMRonEKS Name=tag:aws:elasticmapreduce:instance-group-role,Values=MASTER --query Reservations[].Instances[].PrivateDnsName --output text | xargs) 

aws emr-containers start-job-run \
--virtual-cluster-id $VIRTUAL_CLUSTER_ID \
--name spark-hive-via-thrift \
--execution-role-arn $EMR_ROLE_ARN \
--release-label emr-6.3.0-latest \
--job-driver '{
  "sparkSubmitJobDriver": {
      "entryPoint": "s3://'$S3BUCKET'/app_code/job/hivethrift_emr.py",
      "entryPointArguments":["s3://'$S3BUCKET'",'$EMR_MASTER_DNS_NAME],
      "sparkSubmitParameters": "--conf spark.driver.cores=1 --conf spark.executor.memory=4G --conf spark.driver.memory=1G --conf spark.executor.cores=2"}}' \
--configuration-overrides '{
    "monitoringConfiguration": {
      "s3MonitoringConfiguration": {"logUri": "s3://'$S3BUCKET'/elasticmapreduce/emr-containers"}}}'

^ back to top

3.1 Connect Hive metastore via thrift service hosted on EKS

hivethrift_eks.py:

from os import environ
import sys
from pyspark.sql import SparkSession

spark = SparkSession \
    .builder \
    .config("spark.sql.warehouse.dir", sys.argv[1]+"/warehouse/" ) \
    .config("hive.metastore.uris","thrift://"+environ['HIVE_METASTORE_SERVICE_HOST']+":9083") \
    .enableHiveSupport() \
    .getOrCreate()

spark.sql("SHOW DATABASES").show()
spark.sql("CREATE DATABASE IF NOT EXISTS `demo`")
spark.sql("DROP TABLE IF EXISTS demo.amazonreview3")
spark.sql("CREATE EXTERNAL TABLE IF NOT EXISTS `demo`.`amazonreview3`( `marketplace` string,`customer_id`string,`review_id` string,`product_id` string,`product_title` string,`star_rating` integer,`helpful_votes` integer,`total_votes` integer,`insight` string,`review_headline` string,`review_body` string,`review_date` timestamp,`year` integer) STORED AS PARQUET LOCATION '"+sys.argv[1]+"/app_code/data/toy/'")

3.2 Submit hivethrift_eks.py job to EMR on EKS

Run the script:

curl https://raw.githubusercontent.com/aws-samples/hive-emr-on-eks/main/deployment/app_code/job/submit-job-via-thrift_eks.sh | bash

OR

#!/bin/bash
aws emr-containers start-job-run \
--virtual-cluster-id $VIRTUAL_CLUSTER_ID \
--name spark-hive-via-thrift \
--execution-role-arn $EMR_ROLE_ARN \
--release-label emr-6.3.0-latest \
--job-driver '{
  "sparkSubmitJobDriver": {
      "entryPoint": "s3://'$S3BUCKET'/app_code/job/hivethrift_eks.py",
      "entryPointArguments":["s3://'$S3BUCKET'"],
      "sparkSubmitParameters": "--conf spark.driver.cores=1 --conf spark.executor.memory=4G --conf spark.driver.memory=1G --conf spark.executor.cores=2"}}' \
--configuration-overrides '{
    "monitoringConfiguration": {
      "s3MonitoringConfiguration": {"logUri": "s3://'$S3BUCKET'/elasticmapreduce/emr-containers"}}}'

^ back to top

4.1 Run the thrift service as a sidecar in Spark Driver's pod

4.1.1 Prerequisites

NOTE: This repo's CFN/CDK template installs the followings by default.

4.1.1.1 Kubernetes External Secrets controller

Kubernetes External Secrets controller - it fetchs hive metastore DB credentials from AWS Secrets Manager. This is a recommended best practice. Alternatively, without installing the controller, simply modify the HMS sidecar pod template with hard coded DB credentials.

# does it exist?
kubectl get pod -n kube-system

If the controller doesn't exist in your EKS cluster, replace the variable placeholder: YOUR_REGION & YOUR_IAM_ROLE_ARN_TO_GET_SECRETS_FROM_SM in the command, then run the installation. Refer to the IAM permissions used by CDK to create your IAM role.

helm repo add external-secrets https://external-secrets.github.io/kubernetes-external-secrets/
helm install external-secret external-secrets/kubernetes-external-secrets -n kube-system  --set AWS_REGION=YOUR_REGION --set securityContext.fsGroup=65534 --set serviceAccount.annotations."eks\.amazonaws\.com/role-arn"='YOUR_IAM_ROLE_ARN_TO_GET_SECRETS_FROM_SM' --debug

4.1.1.2 EMR on EKS Sidecar Configuration

When using the HMS image as a sidecar in EMR on EKS, you can either:

Use Built-in Templates: Build the image with templates included (BUILD_ENV=templates) which contains:
- Pre-configured metastore-site.xml and core-site.xml templates
- Self-termination script to handle the sidecar lifecycle issue
Use ConfigMaps: Create two ConfigMaps in EKS that point to your custom metastore-site.xml and core-site.xml templates. This approach gives you more flexibility in configuration management.

Using ConfigMaps (Optional - only if not using built-in templates)

Two sidecar ConfigMaps should be created in EKS to configure the standalone HMS. The sidecar termination script is copied from the EMR document to work around the well-known sidecar lifecycle issue in Kubernetes.

Check if ConfigMaps exist:

kubectl get configmap sidecar-hms-conf-templates sidecar-terminate-script -n emr

If they don't exist and you want to use custom configurations, create them:

# get remote metastore RDS secret name
secret_name=$(aws secretsmanager list-secrets --query 'SecretList[?starts_with(Name,`RDSAuroraSecret`) == `true`].Name' --output text)
# download the config and apply to EKS
curl https://raw.githubusercontent.com/aws-samples/hive-emr-on-eks/main/source/app_resources/hive-metastore-config.yaml | sed 's/{SECRET_MANAGER_NAME}/'$secret_name'/g' | kubectl apply -f -

Note: If you're using the HMS image built with templates (BUILD_ENV=templates), you don't need to create these ConfigMaps unless you want to override the built-in configurations.

4.1.2 HMS Sidecar Pod Templates

Two HMS sidecar pod templates are available for different deployment scenarios:

Template with ConfigMap (sidecar_hms_pod_template.yaml):
- Uses external ConfigMaps for HMS configurations
- Suitable when you want to manage configurations separately
- Requires ConfigMaps sidecar-hms-conf-templates and sidecar-terminate-script to be present in the EKS cluster
Standalone Template (sidecar_hms_pod_template_standalone.yaml):
- Uses built-in templates from the Docker image
- Recommended when using the image built with BUILD_ENV=templates
- No external ConfigMaps required
- Simpler deployment as all configurations are packaged in the image

Both templates need to be uploaded to an S3 bucket that your Spark job can access. Choose the appropriate template based on your configuration management preferences.

Note: Both templates use Kubernetes secrets (rds-hms-secret) for database credentials. You can alternatively hardcode these values by uncommenting and updating the corresponding environment variables in the template.

sidecar_hivethrift_eks.py:

import sys
from pyspark.sql import SparkSession

spark = SparkSession \
  .builder \
  .config("spark.sql.warehouse.dir", sys.argv[1]+"/warehouse/" ) \
  .enableHiveSupport() \
  .getOrCreate()

spark.sql("SHOW DATABASES").show()
spark.sql("CREATE DATABASE IF NOT EXISTS `demo`")
spark.sql("DROP TABLE IF EXISTS demo.amazonreview4")
spark.sql("CREATE EXTERNAL TABLE IF NOT EXISTS `demo`.`amazonreview4`( `marketplace` string,`customer_id`string,`review_id` string,`product_id` string,`product_title` string,`star_rating` integer,`helpful_votes` integer,`total_votes` integer,`insight` string,`review_headline` string,`review_body` string,`review_date` timestamp,`year` integer) STORED AS PARQUET LOCATION '"+sys.argv[1]+"/app_code/data/toy/'")

# read from files
sql_scripts=spark.read.text(sys.argv[1]+"/app_code/job/set-of-hive-queries.sql").collect()
cmd_str=' '.join([x[0] for x in sql_scripts]).split(';')
for query in cmd_str:
  if (query != ""):
      spark.sql(query).show()
spark.stop()

4.2 Submit sidecar_hivethrift_eks.py job to EMR on EKS

Assign the sidecar pod template to Spark Driver. Run the script:

curl https://raw.githubusercontent.com/aws-samples/hive-emr-on-eks/main/deployment/app_code/job/sidecar_submit-job-via-thrift_eks.sh | bash

OR

#!/bin/bash
# test HMS sidecar on EKS
aws emr-containers start-job-run \
--virtual-cluster-id $VIRTUAL_CLUSTER_ID \
--name sidecar-hms \
--execution-role-arn $EMR_ROLE_ARN \
--release-label emr-6.3.0-latest \
--job-driver '{
  "sparkSubmitJobDriver": {
      "entryPoint": "s3://'$S3BUCKET'/app_code/job/sidecar_hivethrift_eks.py",
      "entryPointArguments":["s3://'$S3BUCKET'"],
      "sparkSubmitParameters": "--conf spark.driver.cores=1 --conf spark.executor.memory=4G --conf spark.driver.memory=1G --conf spark.executor.cores=2"}}' \
--configuration-overrides '{
    "applicationConfiguration": [
      {
        "classification": "spark-defaults", 
        "properties": {
          "spark.kubernetes.driver.podTemplateFile": "s3://'$S3BUCKET'/app_code/job/sidecar_hms_pod_template.yaml",
          "spark.hive.metastore.uris": "thrift://localhost:9083"
        }
      }
    ], 
    "monitoringConfiguration": {
      "s3MonitoringConfiguration": {"logUri": "s3://'$S3BUCKET'/elasticmapreduce/emr-containers"}}}'

^ back to top

5. Hudi + Remote Hive metastore integration

Sample job - HudiEMRonEKS.py
Job submission script - sidecar_submit-hudi-hms.sh. The sidecar hms container inside your Spark driver will provide the connection to a remote hive metastore DB in RDS.

Note: the latest Hudi-spark3-bundle jar is needed to support the HMS hive sync mode. The jar will be included from EMR 6.5+.

Run the submission script:

curl https://raw.githubusercontent.com/aws-samples/hive-emr-on-eks/main/deployment/app_code/job/sidecar_submit-hudi-hms.sh | bash

OR

aws emr-containers start-job-run \
--virtual-cluster-id $VIRTUAL_CLUSTER_ID \
--name hudi-test1 \
--execution-role-arn $EMR_ROLE_ARN \
--release-label emr-6.3.0-latest \
--job-driver '{
  "sparkSubmitJobDriver": {
      "entryPoint": "s3://'$S3BUCKET'/app_code/job/HudiEMRonEKS.py",
      "entryPointArguments":["s3://'$S3BUCKET'"],
      "sparkSubmitParameters": "--jars https://repo1.maven.org/maven2/org/apache/hudi/hudi-spark3-bundle_2.12/0.9.0/hudi-spark3-bundle_2.12-0.9.0.jar --conf spark.executor.cores=1 --conf spark.executor.instances=2"}}' \
--configuration-overrides '{
    "applicationConfiguration": [
      {
        "classification": "spark-defaults", 
        "properties": {
          "spark.serializer": "org.apache.spark.serializer.KryoSerializer",
          "spark.sql.hive.convertMetastoreParquet": "false",
          "spark.hive.metastore.uris": "thrift://localhost:9083",
	        "spark.kubernetes.driver.podTemplateFile": "s3://'$S3BUCKET'/app_code/job/sidecar_hms_pod_template.yaml"
        }}
    ], 
    "monitoringConfiguration": {
      "s3MonitoringConfiguration": {"logUri": "s3://'$S3BUCKET'/elasticmapreduce/emr-containers"}}}'

^ back to top

6. Hudi + Glue Catalog Integration

Note: make esure the database ** default ** exists in your Glue catalog

Same Hudi job - HudiEMRonEKS.py
Job submission with Glue catalog - submit-hudi-glue.sh

Run the submission script:

curl https://raw.githubusercontent.com/aws-samples/hive-emr-on-eks/main/deployment/app_code/job/submit-hudi-glue.sh | bash

OR

aws emr-containers start-job-run \
--virtual-cluster-id $VIRTUAL_CLUSTER_ID \
--name hudi-test1 \
--execution-role-arn $EMR_ROLE_ARN \
--release-label emr-6.3.0-latest \
--job-driver '{
  "sparkSubmitJobDriver": {
      "entryPoint": "s3://'$S3BUCKET'/app_code/job/HudiEMRonEKS.py",
      "entryPointArguments":["s3://'$S3BUCKET'"],
      "sparkSubmitParameters": "--jars https://repo1.maven.org/maven2/org/apache/hudi/hudi-spark3-bundle_2.12/0.9.0/hudi-spark3-bundle_2.12-0.9.0.jar --conf spark.executor.cores=1 --conf spark.executor.instances=2"}}' \
--configuration-overrides '{
    "applicationConfiguration": [
      {
        "classification": "spark-defaults", 
        "properties": {
          "spark.serializer": "org.apache.spark.serializer.KryoSerializer",
          "spark.sql.hive.convertMetastoreParquet": "false",
          "spark.hadoop.hive.metastore.client.factory.class": "com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory"
        }}
    ], 
    "monitoringConfiguration": {
      "s3MonitoringConfiguration": {"logUri": "s3://'$S3BUCKET'/elasticmapreduce/emr-containers"}}}'

^ back to top

7. Run Hive SQL with EMR on EKS

We can run Hive SQL script with multiple lines using the Spark execution engine. From EMR 6.7, EMR on EKS now supports the ability to run Spark SQL, using a .sql file as the entrypoint script in the StartJobRun API. Make sure your AWS CLI version is 2.7.31+ or 1.25.70+.

See the full version of the sample Hive sql script. code snippet:

DROP DATABASE IF EXISTS hiveonspark CASCADE;
CREATE DATABASE hiveonspark;
USE hiveonspark;

--create hive managed table
CREATE TABLE IF NOT EXISTS testtable (`key` INT, `value` STRING) using hive;
LOAD DATA LOCAL INPATH '/usr/lib/spark/examples/src/main/resources/kv1.txt' OVERWRITE INTO TABLE testtable;
SELECT * FROM testtable WHERE key=238;

Run the submission script:

curl https://raw.githubusercontent.com/aws-samples/hive-emr-on-eks/main/deployment/app_code/job/submit-sparksql.sh | bash

OR run the following:

aws emr-containers start-job-run \
--virtual-cluster-id $VIRTUAL_CLUSTER_ID \
--name sparksql-test \
--execution-role-arn $EMR_ROLE_ARN \
--release-label emr-6.8.0-latest \
--job-driver '{
  "sparkSqlJobDriver": {
      "entryPoint": "s3://'$S3BUCKET'/app_code/job/set-of-hive-queries.sql",
      "sparkSqlParameters": "-hivevar S3Bucket='$S3BUCKET' -hivevar Key_ID=238"}}' \
--configuration-overrides '{
    "applicationConfiguration": [
      {
        "classification": "spark-defaults", 
        "properties": {
          "spark.sql.warehouse.dir": "s3://'$S3BUCKET'/warehouse/",
          "spark.hadoop.hive.metastore.client.factory.class": "com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory"
        }
      }
    ], 
    "monitoringConfiguration": {
      "s3MonitoringConfiguration": {"logUri": "s3://'$S3BUCKET'/elasticmapreduce/emr-containers"}}}'

In the spark-defaults config, we use Glue catalog as the hive metastore for a serverless design, so the table can be queried in Athena. Alternatively, we can replace the config by a standalone HMS setting "spark.hive.metastore.uris": "thrift://hive-metastore:9083" which is running as a k8s pod in the namespace emr. It is pointing to the Remote RDS hive metastore database created by this project.

NOTE: to directly submit Hive scripts to EMR on EKS, replace the following 2 attributes in the job submission script:

change from sparkSubmitJobDriver to sparkSqlJobDriver
change from sparkSubmitParameters to sparkSqlParameters

^ back to top

Verify the job is running in EKS

kubectl get po -n emr
kubectl logs -n emr -c spark-kubernetes-driver <YOUR-DRIVER-POD-NAME>

Will see the count result in the driver log: Total records on S3:

+--------+
|count(1)|
+--------+
| 4981601|
+--------+

Validate HMS and hive tables on EMR master node

Hive metastore login info:

echo -e "\n host: $HOST_NAME\n DB: $DB_NAME\n passowrd: $PASSWORD\n username: $USER_NAME\n"

Find EMR master node EC2 instance:

aws ec2 describe-instances --filter Name=tag:project,Values=$stack_name Name=tag:aws:elasticmapreduce:instance-group-role,Values=MASTER --query Reservations[].Instances[].InstanceId

Go to EC2 console, connect the instance via Session Manager without a SSH key.
Check the remote hive metastore in mysqlDB:

mysql -u admin -P 3306 -p -h <YOUR_HOST_NAME>
Enter password:<YOUR_PASSWORD>

# Query in the metastore
MySQL[(none)]> Use HiveEMRonEKS;
MySQL[HiveEMRonEKS]> select * from DBS;
MySQL[HiveEMRonEKS]> select * from TBLS;

Query Hive tables:

sudo su
hive
hive> use demo;
hive> select count(*) from amazonreview2;

+--------+
Launching Job 1 out of 1
........
OK
4981601
Time taken: 23.742 seconds, Fetched: 1 row(s)
+--------+

Get logs from S3

s3://$S3BUCKET/elasticmapreduce/emr-containers/$VIRTUAL_CLUSTER_ID/jobs/<YOUR_JOB_ID>/containers/spark-<YOUR-JOB-ID>-driver/

Useful commands

kubectl get pod -n emr list running Spark jobs
kubectl delete pod --all -n emr delete all Spark jobs
kubectl logs -n emr -c spark-kubernetes-driver YOUR-DRIVER-POD-NAME job logs in realtime
kubectl get node --label-columns=eks.amazonaws.com/capacityType,topology.kubernetes.io/zone check EKS compute capacity types and AZ distribution.

Clean up

Run the clean-up script with:

curl https://raw.githubusercontent.com/aws-samples/hive-emr-on-eks/main/deployment/app_code/delete_all.sh | bash

Go to the CloudFormation console, manually delete the remaining resources if needed.

Name		Name	Last commit message	Last commit date
Latest commit History 67 Commits
.github/workflows		.github/workflows
deployment		deployment
docker		docker
hive-metastore-chart		hive-metastore-chart
source		source
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
app.py		app.py
cdk.json		cdk.json
requirements.txt		requirements.txt
setup.py		setup.py

License

aws-samples/hive-emr-on-eks

Folders and files

Latest commit

History

Repository files navigation

RDS as Hive metastore for EMR on EKS

Spark Examples

Key Artifacts

Deploy Infrastructure

Prerequisites

CloudFormation Deployment

Customization

CDK Deployment

Post-deployment

1.1 Connect Hive metastore via JDBC

1.2 Submit hivejdbc.py job to EMR on EKS

2.1 Connect Hive metastore via thrift service hosted on EMR on EC2

2.2 Submit hivethrift_emr.py job to EMR on EKS

3.1 Connect Hive metastore via thrift service hosted on EKS

3.2 Submit hivethrift_eks.py job to EMR on EKS

4.1 Run the thrift service as a sidecar in Spark Driver's pod

4.1.1 Prerequisites

4.1.1.1 Kubernetes External Secrets controller

4.1.1.2 EMR on EKS Sidecar Configuration

Using ConfigMaps (Optional - only if not using built-in templates)

4.1.2 HMS Sidecar Pod Templates

4.2 Submit sidecar_hivethrift_eks.py job to EMR on EKS

5. Hudi + Remote Hive metastore integration

6. Hudi + Glue Catalog Integration

7. Run Hive SQL with EMR on EKS

Verify the job is running in EKS

Validate HMS and hive tables on EMR master node

Get logs from S3

Useful commands

Clean up

About

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 3

Uh oh!

Languages

Packages