Creating a Simple ETL Pipeline Using RedditAPI and Apache Airflow.

Creating a Simple ETL Pipeline Using RedditAPI and Apache Airflow.

Python scripts can be used to download, extract, and upload files to AWS S3. The challenge is to accomplish this without relying on a local or server machine for downloading the zip folder. To address this, AWS EC2 can be utilized to host the python script responsible for the download, extraction, and upload actions to S3. From a variety of options offered by AWS, a server instance with the appropriate configuration is selected for this purpose on EC2.

With python, we will utilize RedditApi to extract data from Reddit, transform that data and finally load into an S3 bucket. Next, we will set up an Airflow instance and deploy the python scripts to schedule the task.

The above will be carried out in the following steps:

  1. Spin up EC2 instance and create S3 bucket for the project.

  2. Install required packages for the pipeline project

  3. Deploy python scripts and schedule workflow on Airflow.

Here is our architecture

Step 1A: Stand up the EC2 Instance.

To get started with AWS EC2 for your pipeline project, log in to your AWS account and go to Services > EC2. Launch an instance and choose the Ubuntu AMI. On the next screen, select an appropriate instance size considering your workload requirements. It's recommended to choose a larger instance like t2.medium for small workloads or even m5a.xlarge for production versions. Make sure to enable the "Auto-assign Public IP" option. Accept the default settings for the remaining screens.

EC2 Instance

When setting up the security group, add a rule to open port 8080 for public access, which is the port used to connect to the Airflow server.

Security group

Finally, we attach an IAM Role to the EC2 to enable it have access to write to the S3 bucket

IAM Role

Step 1B: Create S3 bucket where the data will be stored.

S3 bucket

Step 2: Install required packages for the ETL pipeline project

We ssh into the EC2 instance, create a virtual environment and install required packages on the EC2 instance.

i. Install ubuntu software updates.

sudo apt-get update

updating dependencies

ii. Then we create a virtual environment for our project in the ec2 home directory and activate it.

#to install python3.10-venv package
sudo apt install python3.10-venv

#to create a virtual  environment
sudo python3 -m venv airflow_reddit/env

#to activate the virtual environment
source ~/airflow_reddit/env/bin/activate

installing python3.10-venv package

iii. Next, we install Apache Airflow, an open-source platform for orchestrating complex data workflows and pipelines.

sudo pip install apache-airflow

Airflow installation

#to confirm airflow installation
airflow

Confirming airflow installantion

iv. Then, we install Pandas which is a python data analysis and manipulation library for efficient data handling.

sudo pip install pandas

Installing Pandas

v. Next we install s3fs, a Python library that provides a file system interface for interacting with Amazon S3, allowing users to read and write data from and to S3 using familiar file system operations.


sudo pip install s3fs

Installing S3fs

Step 3: Deploy Code and shedule our workflow on airflow

This typically involves defining tasks, dependencies, and time-based triggers or intervals within an Airflow DAG (Directed Acyclic Graph), which represents the workflow, and then using Airflow’s scheduling capabilities to automatically trigger the execution of tasks based on the defined schedule. Airflow provides a visual interface for designing, monitoring, and managing workflows, making it a powerful tool for automating and orchestrating data workflows in a scalable and efficient manner.

a) We start a standalone Airflow web server for the ETL pipeline project. This standalone mode allows for running Airflow without any external dependencies such as a database or a scheduler, making it useful for quickly setting up a local instance of Airflow for experimentation or development.

Note: the standalone mode is not recommended for production use, as it does not provide the full functionality and scalability of a production-ready Airflow installation with a separate database, scheduler, and other necessary components.

Run the following command to start a standalone Airflow web server.

airflow standalone

Creating standalone Airflow web server

Note: a login credential in provided in the command line to access the Airflow. copy this and keep this for the next part.

b) For the next step, we access the Airflow console via the EC2 public DNS.

Public IPV4 DNS

Then enter the login credentials stated above to log into the Airflow console.

Airflow Console Login page

Airflow Console Home page

c) To deploy the python scripts on the EC2, navigate to the ‘airflow’ folder in the home directory. Then, we create a directory called ‘etl_reddit_dag’, where we will deploy our python scripts for the project.

visit my github repository for the full code used in this project.

Next, we edit the ‘dags_folder’ variable in the airflow.cfg file and specify the above the absolute path of the ‘etl_reddit_dag’ as its value. To do this, we open the airflow.cfg file with Nano editor and make the changes.

dags_folder = /home/ubuntu/airflow/etl_reddit_dag

d) Click on the play button at the top right corner of the screen and click on “Trigger DAG”.

Triggering ‘ETL_DAG’ DAG

DAG executes successfully

e) Now, we navigate to the S3 bucket to confirm data was successfully stored.

Data stored in S3 bucket

Summary

This article outlines the steps to create a simple ETL (Extract, Transform, Load) pipeline using RedditAPI and Apache Airflow. It involves spinning up an EC2 instance on AWS, installing required packages, deploying python scripts, and scheduling the workflow using Airflow. The article also provides instructions on how to trigger the DAG (Directed Acyclic Graph) and confirm data storage in an S3 bucket.