What is Apache-Airflow and How to install it on Ubuntu

Ertan Çelik
4 min readJul 17, 2023

--

Introduction

Apache Airflow is an open source platform for planning and monitoring workflows. The first version was published by Apache company in 2015. Being Python compatible, having a simple web interface is one of its important features, it provides great convenience in organizing the workflows of Python projects.

Airflow; It is an orchestration tool that ensures that tasks are run at the right time, in the right order, and in the right way.

As many tasks as desired can be easily run in parallel.

It has a useful UI. Errors that occur in data pipelines and where they occur can be easily observed. Tasks with errors can be restarted.

DAG is the core concept of Airflow. It brings the tasks together with their dependencies and relationships and tells them how to work.

If you need to process data every second, Spark or Flink would be a better solution instead of using Airflow.

If terabytes of data is being processed, it is recommended to run the Spark job with the operator over Airflow.

Although Airflow can be installed with docker, kubernetes or different methods, in this study, we can install standalone in the simplest way.

In Airflow, the metadata for connections is kept in the database so we only use conn_id and no code duplication.

Setting up Apache-Airflow in Ubuntu

Apache Airflow must be installed at least Python 3.6 on the machine. I did the installation on Ubuntu machine with python3.7 .

We create our airflow environment in our machine. And we do the following commands. If the proxy right of our machine is limited, we start after setting the proxy.

sudo apt install python3-pip

sudo pip3 install virtualenv

virtualenv airflow

source airflow/bin/activate

We start installing Airflow and installing the necessary libraries.

pip3 install apache-airflow[gcp,sentry,statsd]

pip3 install pyspark

pip3 install sklearn

We install the Airflow database.

cd airflow

airflow db init

Now we create the dags folder. All future dags will now be in this folder.

mkdir dags

The files and directories will look like the image below. It is in this directory again in our airflow.cfg configuration file.

Now let’s create a user.

I created a user with the admin role.

airflow users create — username admin — password admin — firstname admin — lastname admin — role Admin — email admin@domain.com

Let’s run Airflow Scheduler and Airflow Webserver.

airflow scheduler

We open a new terminal and run the webserver as well.

cd airflow

airflow webserver

We make sure that gunicorn is working.

We got our UI implementation up, now we can go to the link below.

· Once the scheduler and webserver get initialized, open any browser and go to http://localhost:8080/.

· Port 8080 should be the default port for Airflow, and you see the following page:

We created Username=”admin”, Password=”admin” user above, we enter the same way.

We entered our application, I will not go into the application interface explanation, the link below can be followed.

https://airflow.apache.org/docs/apache-airflow/2.5.1/

In short, Dags will be the workflows we will run, we will create Security user and role, we will be able to see Browse task instances, resets, triggers etc.., we will create Admin variable, configuration, connections.

Conclusion

That’s it. I hope you liked the explanation of the introduction to apache airflow and learned something valuable.

In the upcoming articles related to the Airflow topic, we will mention a few examples in my next articles.

What’s Next?

--

--