This project contains an Apache Airflow DAG that extracts data from PostgreSQL, cleans it, stores it in HDFS, and creates an external Hive table for Superset visualization.
- Apache Airflow
- PostgreSQL
- HDFS
- Hive
- Python 3.11
-
Clone the repository:
git clone https://github.com/GitHub-Nawatech-Lab/smcc-workshop.git cd smcc-workshop -
Install the required Python packages:
pip install -r requirements.txt
The DAG performs the following tasks:
- Extracts data from PostgreSQL: Uses a PythonOperator to run a Python function that extracts data from PostgreSQL, drops missing values, and saves the cleaned data to a CSV file.
- Uploads cleaned data to HDFS: Uses a BashOperator to upload the cleaned CSV file to HDFS.
- Creates an external Hive table: Uses a HiveOperator to create an external Hive table for Superset visualization.
- Loads data from HDFS into the Hive table: Uses a HiveOperator to load the data from HDFS into the Hive table.
- Cleans up temporary files: Uses a BashOperator to remove the temporary CSV file.
The DAG is defined in dag_airflow.py. It is scheduled to run daily starting from January 1, 2024.
- PostgreSQL: Connection ID
smcc_postgres - Hive: Connection ID
smcc_hive
-
Start the Airflow web server and scheduler:
airflow webserver airflow scheduler
-
Access the Airflow web UI and trigger the
postgres_to_supersetDAG.
This project is licensed under the MIT License. See the LICENSE file for details.