This is a project developed in Python CDKv2. It includes few Spark examples that create external hive tables on top of sample dataset stored in S3. These jobs will run with EMR on EKS.
The infrastructure deployment includes the following:
- A new S3 bucket to store sample data and job code
- An EKS cluster in a new VPC across 2 AZs
- A RDS Aurora database (MySQL engine) in the same VPC
- A small EMR on EC2 cluster in the same VPC
- 1 master & 1 core node (m5.xlarge)
- use master node to query the remote hive metastore database
- An EMR virtual cluster in the same VPC
- registered to
emrnamespace in EKS - EMR on EKS configuration is done
- Connect to RDS and initialize metastore schema via schematool
- registered to
- A standalone Hive metastore service (HMS) in EKS
- Helm Chart hive-metastore-chart is provided.
- run in the same
emrnamespace - thrift server is provided for client connections
- doesn't initialize/upgrade metastore schemas via schematool
- 1. Connect remote Hive metastore via JDBC
- 2. Connect Hive via EMR on EC2
- 3. Connect Hive via EMR on EKS
- 4. Connect Hive via hms sidecar
- 5. Hudi with hms sidecar
- 6. Hudi with Glue catalog
- 7. Run Hive SQL with EMR on EKS
- Job source code - deployment/app_code/job.
- HMS sidecar pod template - deployment/app_code/job/sidecar_hms_pod_template.yaml.
- Standalone hive-metastore docker image - Follow the README instruction to build your own. Don't forget to update your sidecar pod template or helm chart value file with your own ECR URL.
The provisioning takes about 30 minutes to complete. Two ways to deploy:
NOTE: HMS helm chart requires k8s >= 1.23, ie. EKS version must be 1.23+.
Install the folowing tools:
- AWS CLI. Configure the CLI by
aws configure. - kubectl & jq
Can use AWS CloudShell that has included all the neccessary software for a quick start.
| Region | Launch Template |
|---|---|
| --------------------------- | ----------------------- |
| US East (N. Virginia) | ![]() |
- To launch in a different AWS Region, check out the following customization section, or use the CDK deployment option.
You can customize the solution, for example deploy to a different AWS region:
export BUCKET_NAME_PREFIX=<my-bucket-name> # bucket where customized code will reside
export AWS_REGION=<your-region>
export SOLUTION_NAME=hive-emr-on-eks
export VERSION=v2.0.0 # version number for the customized code
./deployment/build-s3-dist.sh $BUCKET_NAME_PREFIX $SOLUTION_NAME $VERSION
# OPTIONAL: create the bucket where customized code will reside
aws s3 mb s3://$BUCKET_NAME_PREFIX-$AWS_REGION --region $AWS_REGION
# Upload deployment assets to the S3 bucket
aws s3 cp ./deployment/global-s3-assets/ s3://$BUCKET_NAME_PREFIX-$AWS_REGION/$SOLUTION_NAME/$VERSION/ --recursive --acl bucket-owner-full-control
aws s3 cp ./deployment/regional-s3-assets/ s3://$BUCKET_NAME_PREFIX-$AWS_REGION/$SOLUTION_NAME/$VERSION/ --recursive --acl bucket-owner-full-control
echo -e "\nIn web browser, paste the URL to launch the CFN template: https://console.aws.amazon.com/cloudformation/home?region=$AWS_REGION#/stacks/quickcreate?stackName=HiveEMRonEKS&templateURL=https://$BUCKET_NAME_PREFIX-$AWS_REGION.s3.amazonaws.com/$SOLUTION_NAME/$VERSION/HiveEMRonEKS.template\n"Alternatively, deploy the infrastructure via CDK. It requires to pre-install the following tools as once-off tasks:
- Python 3.6+
- Nodejs 10.3.0+
- CDK toolkit
- Run the CDK bootstrap after the 'pip install' requirement step as below.
python3 -m venv .env
source .env/bin/activate
pip install -r requirements.txt
cdk deployMake sure AWS CLI, kubectl and jq are installed.
One-off setup:
- Set environment variables in .bash_profile and connect to EKS cluster.
curl https://raw.githubusercontent.com/aws-samples/hive-emr-on-eks/main/deployment/app_code/post-deployment.sh | bash
source ~/.bash_profileCan use Cloud9 or Cloudshell, if you don’t want to install anything on your computer or change your bash_profile,
- [OPTIONAL] Build HMS docker image and replace the hive metastore docker image name in hive-metastore-chart/values.yaml by the new one if needed:
cd docker
export DOCKERHUB_USERNAME=<your_dockerhub_name_OR_ECR_URL>
docker build -t $DOCKERHUB_USERNAME/hive-metastore:3.0.0 .
docker push $DOCKERHUB_USERNAME/hive-metastore:3.0.0- Copy sample data to your S3 bucket: NOTE: amazon-reviews-pds is not a public dataset anymore. Either skip this step or copy your own review data or use other public dataset you know of.
aws s3 cp s3://aws-bigdata-blog/generated_synthetic_reviews/data/product_category=Toys_Games/ s3://$S3BUCKET/app_code/data/toy --recursivehivejdbc.py:
import sys
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.config("spark.sql.warehouse.dir", sys.argv[1]+"/warehouse/" ) \
.enableHiveSupport() \
.getOrCreate()
spark.sql("SHOW DATABASES").show()
spark.sql("CREATE DATABASE IF NOT EXISTS `demo`")
spark.sql("CREATE EXTERNAL TABLE IF NOT EXISTS `demo`.`amazonreview`( `marketplace` string,`customer_id`string,`review_id` string,`product_id` string,`product_title` string,`star_rating` integer,`helpful_votes` integer,`total_votes` integer,`insight` string,`review_headline` string,`review_body` string,`review_date` timestamp,`year` integer) STORED AS PARQUET LOCATION '"+sys.argv[1]+"/app_code/data/toy/'")
spark.sql("SELECT count(*) FROM demo.amazonreview").show()
spark.stop()run the script:
curl https://raw.githubusercontent.com/aws-samples/hive-emr-on-eks/main/deployment/app_code/job/submit-job-via-jdbc.sh | bashOR
aws emr-containers start-job-run \
--virtual-cluster-id $VIRTUAL_CLUSTER_ID \
--name spark-hive-via-jdbc \
--execution-role-arn $EMR_ROLE_ARN \
--release-label emr-6.3.0-latest \
--job-driver '{
"sparkSubmitJobDriver": {
"entryPoint": "s3://'$S3BUCKET'/app_code/job/hivejdbc.py",
"entryPointArguments":["s3://'$S3BUCKET'"],
"sparkSubmitParameters": "--conf spark.jars.packages=mysql:mysql-connector-java:8.0.28 --conf spark.driver.cores=1 --conf spark.executor.memory=4G --conf spark.driver.memory=1G --conf spark.executor.cores=2"}}' \
--configuration-overrides '{
"applicationConfiguration": [
{
"classification": "spark-defaults",
"properties": {
"spark.dynamicAllocation.enabled":"false",
"spark.hadoop.javax.jdo.option.ConnectionDriverName": "com.mysql.cj.jdbc.Driver",
"spark.hadoop.javax.jdo.option.ConnectionUserName": "'$USER_NAME'",
"spark.hadoop.javax.jdo.option.ConnectionPassword": "'$PASSWORD'",
"spark.hadoop.javax.jdo.option.ConnectionURL": "jdbc:mysql://'$HOST_NAME':3306/'$DB_NAME'?createDatabaseIfNotExist=true"
}
}
],
"monitoringConfiguration": {
"s3MonitoringConfiguration": {"logUri": "s3://'$S3BUCKET'/elasticmapreduce/emr-containers"}}}'hivethrift_emr.py:
from os import environ
import sys
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.config("spark.sql.warehouse.dir", sys.argv[1]+"/warehouse/" ) \
.config("hive.metastore.uris","thrift://"+sys.argv[2]+":9083") \
.enableHiveSupport() \
.getOrCreate()
spark.sql("SHOW DATABASES").show()
spark.sql("CREATE DATABASE IF NOT EXISTS `demo`")
spark.sql("CREATE EXTERNAL TABLE IF NOT EXISTS `demo`.`amazonreview2`( `marketplace` string,`customer_id`string,`review_id` string,`product_id` string,`product_title` string,`star_rating` integer,`helpful_votes` integer,`total_votes` integer,`insight` string,`review_headline` string,`review_body` string,`review_date` timestamp,`year` integer) STORED AS PARQUET LOCATION '"+sys.argv[1]+"/app_code/data/toy/'")
spark.sql("SELECT count(*) FROM demo.amazonreview2").show()
spark.stop()Run the script:
curl https://raw.githubusercontent.com/aws-samples/hive-emr-on-eks/main/deployment/app_code/job/submit-job-via-thrift_emr.sh | bashOR
#!/bin/bash
export STACK_NAME=HiveEMRonEKS
export EMR_MASTER_DNS_NAME=$(aws ec2 describe-instances --filter Name=tag:project,Values=HiveEMRonEKS Name=tag:aws:elasticmapreduce:instance-group-role,Values=MASTER --query Reservations[].Instances[].PrivateDnsName --output text | xargs)
aws emr-containers start-job-run \
--virtual-cluster-id $VIRTUAL_CLUSTER_ID \
--name spark-hive-via-thrift \
--execution-role-arn $EMR_ROLE_ARN \
--release-label emr-6.3.0-latest \
--job-driver '{
"sparkSubmitJobDriver": {
"entryPoint": "s3://'$S3BUCKET'/app_code/job/hivethrift_emr.py",
"entryPointArguments":["s3://'$S3BUCKET'",'$EMR_MASTER_DNS_NAME],
"sparkSubmitParameters": "--conf spark.driver.cores=1 --conf spark.executor.memory=4G --conf spark.driver.memory=1G --conf spark.executor.cores=2"}}' \
--configuration-overrides '{
"monitoringConfiguration": {
"s3MonitoringConfiguration": {"logUri": "s3://'$S3BUCKET'/elasticmapreduce/emr-containers"}}}'hivethrift_eks.py:
from os import environ
import sys
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.config("spark.sql.warehouse.dir", sys.argv[1]+"/warehouse/" ) \
.config("hive.metastore.uris","thrift://"+environ['HIVE_METASTORE_SERVICE_HOST']+":9083") \
.enableHiveSupport() \
.getOrCreate()
spark.sql("SHOW DATABASES").show()
spark.sql("CREATE DATABASE IF NOT EXISTS `demo`")
spark.sql("DROP TABLE IF EXISTS demo.amazonreview3")
spark.sql("CREATE EXTERNAL TABLE IF NOT EXISTS `demo`.`amazonreview3`( `marketplace` string,`customer_id`string,`review_id` string,`product_id` string,`product_title` string,`star_rating` integer,`helpful_votes` integer,`total_votes` integer,`insight` string,`review_headline` string,`review_body` string,`review_date` timestamp,`year` integer) STORED AS PARQUET LOCATION '"+sys.argv[1]+"/app_code/data/toy/'")Run the script:
curl https://raw.githubusercontent.com/aws-samples/hive-emr-on-eks/main/deployment/app_code/job/submit-job-via-thrift_eks.sh | bashOR
#!/bin/bash
aws emr-containers start-job-run \
--virtual-cluster-id $VIRTUAL_CLUSTER_ID \
--name spark-hive-via-thrift \
--execution-role-arn $EMR_ROLE_ARN \
--release-label emr-6.3.0-latest \
--job-driver '{
"sparkSubmitJobDriver": {
"entryPoint": "s3://'$S3BUCKET'/app_code/job/hivethrift_eks.py",
"entryPointArguments":["s3://'$S3BUCKET'"],
"sparkSubmitParameters": "--conf spark.driver.cores=1 --conf spark.executor.memory=4G --conf spark.driver.memory=1G --conf spark.executor.cores=2"}}' \
--configuration-overrides '{
"monitoringConfiguration": {
"s3MonitoringConfiguration": {"logUri": "s3://'$S3BUCKET'/elasticmapreduce/emr-containers"}}}'NOTE: This repo's CFN/CDK template installs the followings by default.
Kubernetes External Secrets controller - it fetchs hive metastore DB credentials from AWS Secrets Manager. This is a recommended best practice. Alternatively, without installing the controller, simply modify the HMS sidecar pod template with hard coded DB credentials.
# does it exist?
kubectl get pod -n kube-systemIf the controller doesn't exist in your EKS cluster, replace the variable placeholder: YOUR_REGION & YOUR_IAM_ROLE_ARN_TO_GET_SECRETS_FROM_SM in the command, then run the installation. Refer to the IAM permissions used by CDK to create your IAM role.
helm repo add external-secrets https://external-secrets.github.io/kubernetes-external-secrets/
helm install external-secret external-secrets/kubernetes-external-secrets -n kube-system --set AWS_REGION=YOUR_REGION --set securityContext.fsGroup=65534 --set serviceAccount.annotations."eks\.amazonaws\.com/role-arn"='YOUR_IAM_ROLE_ARN_TO_GET_SECRETS_FROM_SM' --debugWhen using the HMS image as a sidecar in EMR on EKS, you can either:
-
Use Built-in Templates: Build the image with templates included (
BUILD_ENV=templates) which contains:- Pre-configured
metastore-site.xmlandcore-site.xmltemplates - Self-termination script to handle the sidecar lifecycle issue
- Pre-configured
-
Use ConfigMaps: Create two ConfigMaps in EKS that point to your custom
metastore-site.xmlandcore-site.xmltemplates. This approach gives you more flexibility in configuration management.
Two sidecar ConfigMaps should be created in EKS to configure the standalone HMS. The sidecar termination script is copied from the EMR document to work around the well-known sidecar lifecycle issue in Kubernetes.
Check if ConfigMaps exist:
kubectl get configmap sidecar-hms-conf-templates sidecar-terminate-script -n emrIf they don't exist and you want to use custom configurations, create them:
# get remote metastore RDS secret name
secret_name=$(aws secretsmanager list-secrets --query 'SecretList[?starts_with(Name,`RDSAuroraSecret`) == `true`].Name' --output text)
# download the config and apply to EKS
curl https://raw.githubusercontent.com/aws-samples/hive-emr-on-eks/main/source/app_resources/hive-metastore-config.yaml | sed 's/{SECRET_MANAGER_NAME}/'$secret_name'/g' | kubectl apply -f -Note: If you're using the HMS image built with templates (BUILD_ENV=templates), you don't need to create these ConfigMaps unless you want to override the built-in configurations.
Two HMS sidecar pod templates are available for different deployment scenarios:
-
Template with ConfigMap (sidecar_hms_pod_template.yaml):
- Uses external ConfigMaps for HMS configurations
- Suitable when you want to manage configurations separately
- Requires ConfigMaps
sidecar-hms-conf-templatesandsidecar-terminate-scriptto be present in the EKS cluster
-
Standalone Template (sidecar_hms_pod_template_standalone.yaml):
- Uses built-in templates from the Docker image
- Recommended when using the image built with
BUILD_ENV=templates - No external ConfigMaps required
- Simpler deployment as all configurations are packaged in the image
Both templates need to be uploaded to an S3 bucket that your Spark job can access. Choose the appropriate template based on your configuration management preferences.
Note: Both templates use Kubernetes secrets (rds-hms-secret) for database credentials. You can alternatively hardcode these values by uncommenting and updating the corresponding environment variables in the template.
sidecar_hivethrift_eks.py:
import sys
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.config("spark.sql.warehouse.dir", sys.argv[1]+"/warehouse/" ) \
.enableHiveSupport() \
.getOrCreate()
spark.sql("SHOW DATABASES").show()
spark.sql("CREATE DATABASE IF NOT EXISTS `demo`")
spark.sql("DROP TABLE IF EXISTS demo.amazonreview4")
spark.sql("CREATE EXTERNAL TABLE IF NOT EXISTS `demo`.`amazonreview4`( `marketplace` string,`customer_id`string,`review_id` string,`product_id` string,`product_title` string,`star_rating` integer,`helpful_votes` integer,`total_votes` integer,`insight` string,`review_headline` string,`review_body` string,`review_date` timestamp,`year` integer) STORED AS PARQUET LOCATION '"+sys.argv[1]+"/app_code/data/toy/'")
# read from files
sql_scripts=spark.read.text(sys.argv[1]+"/app_code/job/set-of-hive-queries.sql").collect()
cmd_str=' '.join([x[0] for x in sql_scripts]).split(';')
for query in cmd_str:
if (query != ""):
spark.sql(query).show()
spark.stop()Assign the sidecar pod template to Spark Driver. Run the script:
curl https://raw.githubusercontent.com/aws-samples/hive-emr-on-eks/main/deployment/app_code/job/sidecar_submit-job-via-thrift_eks.sh | bashOR
#!/bin/bash
# test HMS sidecar on EKS
aws emr-containers start-job-run \
--virtual-cluster-id $VIRTUAL_CLUSTER_ID \
--name sidecar-hms \
--execution-role-arn $EMR_ROLE_ARN \
--release-label emr-6.3.0-latest \
--job-driver '{
"sparkSubmitJobDriver": {
"entryPoint": "s3://'$S3BUCKET'/app_code/job/sidecar_hivethrift_eks.py",
"entryPointArguments":["s3://'$S3BUCKET'"],
"sparkSubmitParameters": "--conf spark.driver.cores=1 --conf spark.executor.memory=4G --conf spark.driver.memory=1G --conf spark.executor.cores=2"}}' \
--configuration-overrides '{
"applicationConfiguration": [
{
"classification": "spark-defaults",
"properties": {
"spark.kubernetes.driver.podTemplateFile": "s3://'$S3BUCKET'/app_code/job/sidecar_hms_pod_template.yaml",
"spark.hive.metastore.uris": "thrift://localhost:9083"
}
}
],
"monitoringConfiguration": {
"s3MonitoringConfiguration": {"logUri": "s3://'$S3BUCKET'/elasticmapreduce/emr-containers"}}}'- Sample job - HudiEMRonEKS.py
- Job submission script - sidecar_submit-hudi-hms.sh. The sidecar hms container inside your Spark driver will provide the connection to a remote hive metastore DB in RDS.
Note: the latest Hudi-spark3-bundle jar is needed to support the HMS hive sync mode. The jar will be included from EMR 6.5+.
Run the submission script:
curl https://raw.githubusercontent.com/aws-samples/hive-emr-on-eks/main/deployment/app_code/job/sidecar_submit-hudi-hms.sh | bashOR
aws emr-containers start-job-run \
--virtual-cluster-id $VIRTUAL_CLUSTER_ID \
--name hudi-test1 \
--execution-role-arn $EMR_ROLE_ARN \
--release-label emr-6.3.0-latest \
--job-driver '{
"sparkSubmitJobDriver": {
"entryPoint": "s3://'$S3BUCKET'/app_code/job/HudiEMRonEKS.py",
"entryPointArguments":["s3://'$S3BUCKET'"],
"sparkSubmitParameters": "--jars https://repo1.maven.org/maven2/org/apache/hudi/hudi-spark3-bundle_2.12/0.9.0/hudi-spark3-bundle_2.12-0.9.0.jar --conf spark.executor.cores=1 --conf spark.executor.instances=2"}}' \
--configuration-overrides '{
"applicationConfiguration": [
{
"classification": "spark-defaults",
"properties": {
"spark.serializer": "org.apache.spark.serializer.KryoSerializer",
"spark.sql.hive.convertMetastoreParquet": "false",
"spark.hive.metastore.uris": "thrift://localhost:9083",
"spark.kubernetes.driver.podTemplateFile": "s3://'$S3BUCKET'/app_code/job/sidecar_hms_pod_template.yaml"
}}
],
"monitoringConfiguration": {
"s3MonitoringConfiguration": {"logUri": "s3://'$S3BUCKET'/elasticmapreduce/emr-containers"}}}'Note: make esure the database ** default ** exists in your Glue catalog
- Same Hudi job - HudiEMRonEKS.py
- Job submission with Glue catalog - submit-hudi-glue.sh
Run the submission script:
curl https://raw.githubusercontent.com/aws-samples/hive-emr-on-eks/main/deployment/app_code/job/submit-hudi-glue.sh | bashOR
aws emr-containers start-job-run \
--virtual-cluster-id $VIRTUAL_CLUSTER_ID \
--name hudi-test1 \
--execution-role-arn $EMR_ROLE_ARN \
--release-label emr-6.3.0-latest \
--job-driver '{
"sparkSubmitJobDriver": {
"entryPoint": "s3://'$S3BUCKET'/app_code/job/HudiEMRonEKS.py",
"entryPointArguments":["s3://'$S3BUCKET'"],
"sparkSubmitParameters": "--jars https://repo1.maven.org/maven2/org/apache/hudi/hudi-spark3-bundle_2.12/0.9.0/hudi-spark3-bundle_2.12-0.9.0.jar --conf spark.executor.cores=1 --conf spark.executor.instances=2"}}' \
--configuration-overrides '{
"applicationConfiguration": [
{
"classification": "spark-defaults",
"properties": {
"spark.serializer": "org.apache.spark.serializer.KryoSerializer",
"spark.sql.hive.convertMetastoreParquet": "false",
"spark.hadoop.hive.metastore.client.factory.class": "com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory"
}}
],
"monitoringConfiguration": {
"s3MonitoringConfiguration": {"logUri": "s3://'$S3BUCKET'/elasticmapreduce/emr-containers"}}}'We can run Hive SQL script with multiple lines using the Spark execution engine. From EMR 6.7, EMR on EKS now supports the ability to run Spark SQL, using a .sql file as the entrypoint script in the StartJobRun API. Make sure your AWS CLI version is 2.7.31+ or 1.25.70+.
See the full version of the sample Hive sql script. code snippet:
DROP DATABASE IF EXISTS hiveonspark CASCADE;
CREATE DATABASE hiveonspark;
USE hiveonspark;
--create hive managed table
CREATE TABLE IF NOT EXISTS testtable (`key` INT, `value` STRING) using hive;
LOAD DATA LOCAL INPATH '/usr/lib/spark/examples/src/main/resources/kv1.txt' OVERWRITE INTO TABLE testtable;
SELECT * FROM testtable WHERE key=238;Run the submission script:
curl https://raw.githubusercontent.com/aws-samples/hive-emr-on-eks/main/deployment/app_code/job/submit-sparksql.sh | bashOR run the following:
aws emr-containers start-job-run \
--virtual-cluster-id $VIRTUAL_CLUSTER_ID \
--name sparksql-test \
--execution-role-arn $EMR_ROLE_ARN \
--release-label emr-6.8.0-latest \
--job-driver '{
"sparkSqlJobDriver": {
"entryPoint": "s3://'$S3BUCKET'/app_code/job/set-of-hive-queries.sql",
"sparkSqlParameters": "-hivevar S3Bucket='$S3BUCKET' -hivevar Key_ID=238"}}' \
--configuration-overrides '{
"applicationConfiguration": [
{
"classification": "spark-defaults",
"properties": {
"spark.sql.warehouse.dir": "s3://'$S3BUCKET'/warehouse/",
"spark.hadoop.hive.metastore.client.factory.class": "com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory"
}
}
],
"monitoringConfiguration": {
"s3MonitoringConfiguration": {"logUri": "s3://'$S3BUCKET'/elasticmapreduce/emr-containers"}}}'In the spark-defaults config, we use Glue catalog as the hive metastore for a serverless design, so the table can be queried in Athena.
Alternatively, we can replace the config by a standalone HMS setting "spark.hive.metastore.uris": "thrift://hive-metastore:9083" which is running as a k8s pod in the namespace emr. It is pointing to the Remote RDS hive metastore database created by this project.
NOTE: to directly submit Hive scripts to EMR on EKS, replace the following 2 attributes in the job submission script:
- change from
sparkSubmitJobDrivertosparkSqlJobDriver - change from
sparkSubmitParameterstosparkSqlParameters
kubectl get po -n emr
kubectl logs -n emr -c spark-kubernetes-driver <YOUR-DRIVER-POD-NAME>Will see the count result in the driver log: Total records on S3:
+--------+
|count(1)|
+--------+
| 4981601|
+--------+
- Hive metastore login info:
echo -e "\n host: $HOST_NAME\n DB: $DB_NAME\n passowrd: $PASSWORD\n username: $USER_NAME\n"- Find EMR master node EC2 instance:
aws ec2 describe-instances --filter Name=tag:project,Values=$stack_name Name=tag:aws:elasticmapreduce:instance-group-role,Values=MASTER --query Reservations[].Instances[].InstanceId- Go to EC2 console, connect the instance via Session Manager without a SSH key.
- Check the remote hive metastore in mysqlDB:
mysql -u admin -P 3306 -p -h <YOUR_HOST_NAME>
Enter password:<YOUR_PASSWORD>
# Query in the metastore
MySQL[(none)]> Use HiveEMRonEKS;
MySQL[HiveEMRonEKS]> select * from DBS;
MySQL[HiveEMRonEKS]> select * from TBLS;- Query Hive tables:
sudo su
hive
hive> use demo;
hive> select count(*) from amazonreview2;+--------+
Launching Job 1 out of 1
........
OK
4981601
Time taken: 23.742 seconds, Fetched: 1 row(s)
+--------+
s3://$S3BUCKET/elasticmapreduce/emr-containers/$VIRTUAL_CLUSTER_ID/jobs/<YOUR_JOB_ID>/containers/spark-<YOUR-JOB-ID>-driver/
kubectl get pod -n emrlist running Spark jobskubectl delete pod --all -n emrdelete all Spark jobskubectl logs -n emr -c spark-kubernetes-driver YOUR-DRIVER-POD-NAMEjob logs in realtimekubectl get node --label-columns=eks.amazonaws.com/capacityType,topology.kubernetes.io/zonecheck EKS compute capacity types and AZ distribution.
Run the clean-up script with:
curl https://raw.githubusercontent.com/aws-samples/hive-emr-on-eks/main/deployment/app_code/delete_all.sh | bashGo to the CloudFormation console, manually delete the remaining resources if needed.







