{"id":8144,"date":"2023-11-09T08:29:05","date_gmt":"2023-11-09T16:29:05","guid":{"rendered":"https:\/\/live-cometml.pantheonsite.io\/?p=8144"},"modified":"2025-04-24T17:04:32","modified_gmt":"2025-04-24T17:04:32","slug":"supercharging-your-data-pipeline-with-apache-airflow-part-2","status":"publish","type":"post","link":"https:\/\/www.comet.com\/site\/blog\/supercharging-your-data-pipeline-with-apache-airflow-part-2\/","title":{"rendered":"Supercharging Your Data Pipeline with Apache Airflow (Part 2)"},"content":{"rendered":"\n<link rel=\"canonical\" href=\"https:\/\/www.comet.com\/site\/blog\/supercharging-your-data-pipeline-with-apache-airflow-part-2\">\n\n\n\n<figure class=\"wp-block-image lv lw lx ly lz ma mb mc paragraph-image\"><img decoding=\"async\" src=\"https:\/\/miro.medium.com\/v2\/1*xc3ZcBjXdjJdg4hkSL6-tQ.png\" alt=\"\"\/><figcaption class=\"wp-element-caption\">Image Source \u2014 <a class=\"af ml\" href=\"https:\/\/www.pixelproductionsinc.com\/wp-content\/uploads\/2021\/12\/Supercharge-Your-Content-Pipeline.jpg\" target=\"_blank\" rel=\"noopener ugc nofollow\">Pixel Production Inc<\/a><\/figcaption><\/figure>\n\n\n\n<p class=\"pw-post-body-paragraph mm mn fr be b mo mp mq mr ms mt mu mv mw mx my mz na nb nc nd ne nf ng nh ni fk bj\" id=\"b203\">In the previous article, you were introduced to the intricacies of data pipelines, including the two major types of existing data pipelines. You also learned how to build an Extract Transform Load (ETL) pipeline and discovered the automation capabilities of Apache Airflow for ETL pipelines. In the second part of this series, you will delve into the core components of Apache Airflow and gain insights into building your very first pipeline using Airflow.<\/p>\n\n\n\n<p class=\"pw-post-body-paragraph mm mn fr be b mo mp mq mr ms mt mu mv mw mx my mz na nb nc nd ne nf ng nh ni fk bj\" id=\"fe53\">A quick recap about Apache Airflow from the previous article is that Apache Airflow is an open-source workflow management platform for managing data pipelines. Also, the pros of Airflow are the ease of building data pipelines, setting up and orchestrating complex data workflow with zero cost, and integrating data pipelines with modern cloud providers. You might be curious how a simple tool like Apache Airflow can be powerful for managing complex data pipelines.<\/p>\n\n\n\n<h2 class=\"wp-block-heading nj nk fr be nl nm nn no np nq nr ns nt mw nu nv nw na nx ny nz ne oa ob oc od bj\" id=\"bf7b\"><strong class=\"al\">Diving Deep into the Inner Working of Apache Airflow<\/strong><\/h2>\n\n\n\n<p class=\"pw-post-body-paragraph mm mn fr be b mo oe mq mr ms of mu mv mw og my mz na oh nc nd ne oi ng nh ni fk bj\" id=\"cfab\">The primary concept behind Airflow is what&#8217;s called Directed Acyclic Graph. The Directed Acyclic Graph is a graph structure in which connection is done sequentially without a loop, i.e., the last node in the graph is not connected to the first node. The image below shows an example of DAG; the graph is directed, information flows from A throughout the graph, and it is acyclic since the info from A doesn&#8217;t get back to A.<\/p>\n\n\n\n<figure class=\"wp-block-image ok ol om on oo ma mb mc paragraph-image\"><img decoding=\"async\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:659\/1*jVg7qUfI-HPqOeIAMiQ50Q.png\" alt=\"\"\/><figcaption class=\"wp-element-caption\">A Directed Acyclic Graph (Image by Author)<\/figcaption><\/figure>\n\n\n\n<p class=\"pw-post-body-paragraph mm mn fr be b mo mp mq mr ms mt mu mv mw mx my mz na nb nc nd ne nf ng nh ni fk bj\" id=\"4d81\">The opposite of a DAG is a Directed Cyclic Graph, where there is a bidirectional movement (connection) between the graph nodes. This type of graph creates a loop in which one or more nodes are connected. The below image shows an example of a directed cyclic graph; if you notice, node A is connected to B, and node B is also connected to A. Regarding the movement of data, point B depends on the data from A, and funnily enough, A depends on the data from B.<\/p>\n\n\n\n<figure class=\"wp-block-image ok ol om on oo ma mb mc paragraph-image\"><img decoding=\"async\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:700\/1*MHneiIf9LotOcmbTRYiFkA.png\" alt=\"\"\/><figcaption class=\"wp-element-caption\">A Directed Cyclic Graph (Image by Author)<\/figcaption><\/figure>\n\n\n\n<p class=\"pw-post-body-paragraph mm mn fr be b mo mp mq mr ms mt mu mv mw mx my mz na nb nc nd ne nf ng nh ni fk bj\" id=\"b3f3\">Now that you have learned about DAG and DCG, you might wonder why DAG is important to Airflow. To understand this, imagine you have a pipeline that extracts weather information from an API, cleans the weather information, and loads it into a database. Imagine, if this is a DCG graph, as shown in the image below, that the clean data task depends on the extract weather data task. Ironically, the extract weather data task depends on the clean data task. This creates an endless loop in which the extract weather task can&#8217;t start receiving input from the clean data task, but the clean data task also needs the extract weather data to finish running before it can begin.<\/p>\n\n\n\n<figure class=\"wp-block-image ok ol om on oo ma mb mc paragraph-image\"><img decoding=\"async\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:700\/1*ZOTGm1Xy_C9XrnIPZPSyOw.png\" alt=\"\"\/><figcaption class=\"wp-element-caption\">Weather Pipeline as a Directed Cyclic Graph (DCG)<\/figcaption><\/figure>\n\n\n\n<p class=\"pw-post-body-paragraph mm mn fr be b mo mp mq mr ms mt mu mv mw mx my mz na nb nc nd ne nf ng nh ni fk bj\" id=\"f20a\">So, how does DAG solve this problem? Well, DAG solves the problem by ensuring that the clean data node doesn&#8217;t communicate back to the extracted weather data node. In the image below, you see that the clean data task will only run once the extract weather data task is done running, and the process continues till the end of the pipeline. Using DAG eliminates the weird loop that DCG created.<\/p>\n\n\n\n<figure class=\"wp-block-image ok ol om on oo ma mb mc paragraph-image\"><img decoding=\"async\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:700\/1*IyGBnoQ_iUjCo2v-NGtLPA.png\" alt=\"\"\/><figcaption class=\"wp-element-caption\">Weather Pipeline as a Directed Acyclic Graph (DAG)<\/figcaption><\/figure>\n\n\n\n<p class=\"pw-post-body-paragraph mm mn fr be b mo mp mq mr ms mt mu mv mw mx my mz na nb nc nd ne nf ng nh ni fk bj\" id=\"4486\">Given that you have understood the significant component that makes Apache Airflow powerful, the next step is to learn how Airflow manages these processes. How does Airflow know that extracting weather data is done executing and triggers clean the data, which is the next step? Well, that will be discussed below.<\/p>\n\n\n\n<p class=\"pw-post-body-paragraph mm mn fr be b mo mp mq mr ms mt mu mv mw mx my mz na nb nc nd ne nf ng nh ni fk bj\" id=\"74a6\">Airflow has four major components, which are<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>The Scheduler<\/li>\n\n\n\n<li>The Worker<\/li>\n\n\n\n<li>A Database<\/li>\n\n\n\n<li>A web server<\/li>\n<\/ul>\n\n\n\n<p class=\"pw-post-body-paragraph mm mn fr be b mo mp mq mr ms mt mu mv mw mx my mz na nb nc nd ne nf ng nh ni fk bj\" id=\"2106\">The four major components work in sync to manage data pipelines in Apache Airflow.<\/p>\n\n\n\n<p class=\"pw-post-body-paragraph mm mn fr be b mo mp mq mr ms mt mu mv mw mx my mz na nb nc nd ne nf ng nh ni fk bj\" id=\"62c2\"><strong class=\"be ph\">The Scheduler and Worker<\/strong><\/p>\n\n\n\n<p class=\"pw-post-body-paragraph mm mn fr be b mo mp mq mr ms mt mu mv mw mx my mz na nb nc nd ne nf ng nh ni fk bj\" id=\"1f0f\">To understand the scheduler, you need first to grasp the concept of how Airflow views DAGs. A DAG in Airflow comprises different tasks chained in an acyclic manner. The weather pipeline DAG includes the extract weather task, clean data task, and load data to the Postgres task.<\/p>\n\n\n\n<p class=\"pw-post-body-paragraph mm mn fr be b mo mp mq mr ms mt mu mv mw mx my mz na nb nc nd ne nf ng nh ni fk bj\" id=\"1882\">DAGs in Airflow are defined with two major parameters: the scheduled date and the schedule interval. The date that the DAG is expected to be executed is the scheduled date, and the interval in which the DAG will be performed is the schedule interval, which can be hourly, daily, monthly, etc. Once the DAG has been created, Airflow sends it to the scheduling queue. The scheduler keeps track of the scheduled date and interval and triggers the execution of the DAG once the scheduled date has passed.<\/p>\n\n\n\n<p class=\"pw-post-body-paragraph mm mn fr be b mo mp mq mr ms mt mu mv mw mx my mz na nb nc nd ne nf ng nh ni fk bj\" id=\"72e9\">Technically, the date for scheduling the DAG is one day after the scheduled date, i.e., if the DAG execution date is 01\u201301\u20132023 00:00:00, the scheduler will schedule the dag on 02\u201301\u20132023 00:00:00. Once the scheduler triggers the DAG execution, it is sent to the worker for executing the dag. The worker will complete the first task in the DAG and communicate the result to the scheduler. If the development of the execution is a success, the scheduler will trigger the next job in the DAG since the second task depends on executing the first task. In case the result of the first task execution is a failure, Airflow won&#8217;t complete the following task since the task that it depends on is a failure.<\/p>\n\n\n\n<p class=\"pw-post-body-paragraph mm mn fr be b mo mp mq mr ms mt mu mv mw mx my mz na nb nc nd ne nf ng nh ni fk bj\" id=\"0eb8\"><strong class=\"be ph\">Database<\/strong><\/p>\n\n\n\n<p class=\"pw-post-body-paragraph mm mn fr be b mo mp mq mr ms mt mu mv mw mx my mz na nb nc nd ne nf ng nh ni fk bj\" id=\"2dee\">How does the scheduler keep track of the DAG and task execution? Well, that is where the database comes in. It acts as a storage system for storing information such as the scheduled date, schedule interval, the result from the worker, the status of the DAG, etc. The scheduler gets this information from the database and acts based on the information.<\/p>\n\n\n\n<p class=\"pw-post-body-paragraph mm mn fr be b mo mp mq mr ms mt mu mv mw mx my mz na nb nc nd ne nf ng nh ni fk bj\" id=\"5c64\"><strong class=\"be ph\">Web Server<\/strong><\/p>\n\n\n\n<p class=\"pw-post-body-paragraph mm mn fr be b mo mp mq mr ms mt mu mv mw mx my mz na nb nc nd ne nf ng nh ni fk bj\" id=\"12de\">The web server acts as a graphical user interface for viewing information about the DAG, such as the status of the DAG and the result from each task of a DAG.<\/p>\n\n\n\n<figure class=\"wp-block-image ok ol om on oo ma mb mc paragraph-image\"><img decoding=\"async\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:700\/1*bkRmWf6bIkOgIID89FqGCQ.png\" alt=\"\"\/><figcaption class=\"wp-element-caption\">Overview of Airflow Architecture<a class=\"af ml\" href=\"https:\/\/biconsult.ru\/files\/Data_warehouse\/Bas_P_Harenslak%2C_Julian_Rutger_de_Ruiter_Data_Pipelines_with_Apache.pdf\" target=\"_blank\" rel=\"noopener ugc nofollow\"> (Image from Data Pipelines from Apache Airflow Book)<\/a><\/figcaption><\/figure>\n\n\n\n<p class=\"pw-post-body-paragraph mm mn fr be b mo mp mq mr ms mt mu mv mw mx my mz na nb nc nd ne nf ng nh ni fk bj\" id=\"00fe\">Given that you now understand the core concept behind Airflow and the components that make up Apache Airflow, the next step is a practical hands-on.<\/p>\n\n\n\n<h2 class=\"wp-block-heading nj nk fr be nl nm nn no np nq nr ns nt mw nu nv nw na nx ny nz ne oa ob oc od bj\" id=\"d911\"><strong class=\"al\">Getting Started with Apache Airflow (Practical Session)<\/strong><\/h2>\n\n\n\n<p class=\"pw-post-body-paragraph mm mn fr be b mo oe mq mr ms of mu mv mw og my mz na oh nc nd ne oi ng nh ni fk bj\" id=\"7741\">To get started with Apache Airflow, you need to install Apache Airflow. There are two significant methods of installing Airflow:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Installing on your local laptop via PyPi<\/li>\n\n\n\n<li>Installing with Docker and Docker Compose<\/li>\n<\/ul>\n\n\n\n<p class=\"pw-post-body-paragraph mm mn fr be b mo mp mq mr ms mt mu mv mw mx my mz na nb nc nd ne nf ng nh ni fk bj\" id=\"00b9\"><strong class=\"be ph\">Installing Airflow with Docker and Docker Compose<\/strong><\/p>\n\n\n\n<p class=\"pw-post-body-paragraph mm mn fr be b mo mp mq mr ms mt mu mv mw mx my mz na nb nc nd ne nf ng nh ni fk bj\" id=\"1efb\">The approach for installing Airflow in this tutorial is using docker. This is because the installation process is more straightforward with docker, and you can easily roll back to the default state without issues.<\/p>\n\n\n\n<p class=\"pw-post-body-paragraph mm mn fr be b mo mp mq mr ms mt mu mv mw mx my mz na nb nc nd ne nf ng nh ni fk bj\" id=\"40c5\">You can learn more with this link if you need to familiarize yourself with docker and docker-compose. If you have an idea about both but don&#8217;t have docker or docker-compose installed on your system, you can check out this <a class=\"af ml\" href=\"https:\/\/www.digitalocean.com\/community\/tutorials\/how-to-install-and-use-docker-compose-on-ubuntu-20-04\" target=\"_blank\" rel=\"noopener ugc nofollow\">link<\/a> for installing both on Ubuntu. Windows and Mac have docker and docker-compose packaged into one application, so if you download docker on Windows or Mac, you have both docker and docker-compose. To install docker on Windows, check out this <a class=\"af ml\" href=\"https:\/\/www.youtube.com\/watch?v=cMyoSkQZ41E\" target=\"_blank\" rel=\"noopener ugc nofollow\">link<\/a>, and use this <a class=\"af ml\" href=\"https:\/\/youtu.be\/SGmFGYCuJK4\" target=\"_blank\" rel=\"noopener ugc nofollow\">link<\/a> if you have a MacBook.<\/p>\n\n\n\n<p class=\"pw-post-body-paragraph mm mn fr be b mo mp mq mr ms mt mu mv mw mx my mz na nb nc nd ne nf ng nh ni fk bj\" id=\"aaf3\">Once you have docker and docker-compose installed on your system, the next step is to create a directory(folder) on your system. You can name it <code class=\"cw pj pk pl pm b\">airflow_tutorial<\/code> for keeping the files for this tutorial. Change your directory into the airflow_tutorial folder and open the terminal on your system. The <code class=\"cw pj pk pl pm b\">docker-compose.yaml<\/code> file that will be used is the official file from Apache Airflow. To download it, type this in your terminal <code class=\"cw pj pk pl pm b\">curl -LFO 'https:\/\/airflow.apache.org\/docs\/apache-airflow\/2.6.1\/docker-compose.yaml<\/code>and press enter. If you are on Windows\/Mac, you might need to execute this in Gitbash shell to avoid issues with <code class=\"cw pj pk pl pm b\">curl<\/code> .<\/p>\n\n\n\n<p class=\"pw-post-body-paragraph mm mn fr be b mo mp mq mr ms mt mu mv mw mx my mz na nb nc nd ne nf ng nh ni fk bj\" id=\"d992\"><strong class=\"be ph\">Modifying the Content of the docker-compose file<\/strong><\/p>\n\n\n\n<p class=\"pw-post-body-paragraph mm mn fr be b mo mp mq mr ms mt mu mv mw mx my mz na nb nc nd ne nf ng nh ni fk bj\" id=\"eb1a\">The docker-compose file from Apache Airflow is ideal for production. Hence, for this tutorial, that is not needed, and some configurations will be deleted and modified. The first thing to be changed is the type of Executor that Airflow will use. You might be wondering what an executor is. The discussion about executors was skipped above to avoid information overload while discussing the core components of Airflow.<\/p>\n\n\n\n<p class=\"pw-post-body-paragraph mm mn fr be b mo mp mq mr ms mt mu mv mw mx my mz na nb nc nd ne nf ng nh ni fk bj\" id=\"c8f2\">Executors in Airflow determine how tasks are run once the scheduler schedules the task. Depending on your local system, you should run tasks in sequence or parallel if you have a high-end computer and your pipeline configuration. There are different types of Executors in Apache Airflow:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>SequentialExecutor \u2014 The sequential executor runs a single task at a time. The sequential executor runs inside the Scheduler, and it is the default for Apache Airflow. It executes the task on the same machine that the scheduler runs on, and in case of failure, the task will fail and stop running. Although it is easier to run, it is not ideal for production and is better for routine testing or learning.<\/li>\n\n\n\n<li>LocalExecutor \u2014 The local executor is similar to the SequentialExecutor, given that both the schedule, executor, and worker run on the same machine. However, the significant difference is that the local executor allows multiple tasks to be run simultaneously.<\/li>\n\n\n\n<li>There are other types of executors, such as the Celery Executor, Kubernetes Executor, Dask Executor, etc. These executors decouple the worker machine from the executor machine, so the worker will still process the DAG in case of failure on the executor machine.<\/li>\n<\/ul>\n\n\n\n<p class=\"pw-post-body-paragraph mm mn fr be b mo mp mq mr ms mt mu mv mw mx my mz na nb nc nd ne nf ng nh ni fk bj\" id=\"91a2\">Another question on your mind \u2014 when should you use the SequentialExecutor or other executors?<\/p>\n\n\n\n<p class=\"pw-post-body-paragraph mm mn fr be b mo mp mq mr ms mt mu mv mw mx my mz na nb nc nd ne nf ng nh ni fk bj\" id=\"cae3\">Use the sequential executor when running a single task per time and run it in sequence, as shown in the image below. In the picture below, you can see that the <code class=\"cw pj pk pl pm b\">clean the data<\/code>task needs to run after the <code class=\"cw pj pk pl pm b\">Extract the Weather Data<\/code> Task. Hence, for this type of execution, you should use a SequentialExecutor.<\/p>\n\n\n\n<figure class=\"wp-block-image ok ol om on oo ma mb mc paragraph-image\"><img decoding=\"async\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:700\/1*D-RU2kZKXv-IsBf1reHZmQ.png\" alt=\"\"\/><figcaption class=\"wp-element-caption\">Sequential Execution (Image by Author)<\/figcaption><\/figure>\n\n\n\n<p class=\"pw-post-body-paragraph mm mn fr be b mo mp mq mr ms mt mu mv mw mx my mz na nb nc nd ne nf ng nh ni fk bj\" id=\"d2b5\">Use the other types of executors if you need to run tasks in parallel. You might need to extract the weather and metadata information about the location, after which you will combine both for transformation. This type of execution is shown below. In the image, you can see that the <code class=\"cw pj pk pl pm b\">extract the weather data<\/code> and <code class=\"cw pj pk pl pm b\">extract metadata information about the location<\/code> need to run in parallel.<\/p>\n\n\n\n<figure class=\"wp-block-image ok ol om on oo ma mb mc paragraph-image\"><img decoding=\"async\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:700\/1*WHVf4CdPkFzVtBqJ_qEN_g.png\" alt=\"\"\/><figcaption class=\"wp-element-caption\">Non-Sequential Execution Mode (Image by Author)<\/figcaption><\/figure>\n\n\n\n<p class=\"pw-post-body-paragraph mm mn fr be b mo mp mq mr ms mt mu mv mw mx my mz na nb nc nd ne nf ng nh ni fk bj\" id=\"4892\">The next thing that will be modified is the <code class=\"cw pj pk pl pm b\">Apache Airflow<\/code> image. This is necessary because additional Python modules need to be installed. The Airflow image doesn&#8217;t have the Open Weather SDK, pandas, psycopg2, and sqlalchemy required in the pipeline. So, the image has to be extended by including the necessary library and building a new image. The dockerfile below and the requirement.txt file will be used to develop the image.<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\"># using the official docker image\nFROM apache\/airflow\n# setting the airflow home directory\nENV AIRFLOW_HOME=\/opt\/airflow\n\n#changing user to root for installation of linux packages on the container\nUSER root\n\n# installing git(for pulling the weather API python SDK)\n# build-essential and libpq-dev is for pyscopg2 binary\nRUN apt-get update &amp;&amp; apt-get install -y \\\n    git \\\n    build-essential \\\n    libpq-dev\n\n# create a working directory\nWORKDIR \/app\n# copy the requirements.txt file that contains the python package into the working directory\nCOPY requirements.txt \/app\/requirements.txt\n\n# change the user back to airflow, before installation with pip\nUSER airflow\n\nRUN pip install --no-cache-dir --user -r \/app\/requirements.txt\n\nEXPOSE 8080<\/pre>\n\n\n\n<pre class=\"wp-block-preformatted\">pandas\ngit+https:\/\/github.com\/weatherapicom\/python\npsycopg2-binary\nsqlalchemy<\/pre>\n\n\n\n<p class=\"pw-post-body-paragraph mm mn fr be b mo mp mq mr ms mt mu mv mw mx my mz na nb nc nd ne nf ng nh ni fk bj\"><span style=\"font-family: var(--wpex-body-font-family, var(--wpex-font-sans)); font-size: var(--wpex-body-font-size, 13px);\">The dockerfile and the requirements.txt can be accessed with this <\/span><a class=\"af ml\" style=\"font-family: var(--wpex-body-font-family, var(--wpex-font-sans)); font-size: var(--wpex-body-font-size, 13px);\" href=\"https:\/\/github.com\/Idowuilekura\/apache_airflow_medium_tut\/tree\/master\/docker_related_folder\" target=\"_blank\" rel=\"noopener ugc nofollow\">link<\/a><span style=\"font-family: var(--wpex-body-font-family, var(--wpex-font-sans)); font-size: var(--wpex-body-font-size, 13px);\">, or you can copy and paste the above text into a <\/span><code class=\"cw pj pk pl pm b\" style=\"font-size: var(--wpex-body-font-size, 13px);\">Dockerfile<\/code><span style=\"font-family: var(--wpex-body-font-family, var(--wpex-font-sans)); font-size: var(--wpex-body-font-size, 13px);\"> and <\/span><code class=\"cw pj pk pl pm b\" style=\"font-size: var(--wpex-body-font-size, 13px);\">requirements.txt<\/code><span style=\"font-family: var(--wpex-body-font-family, var(--wpex-font-sans)); font-size: var(--wpex-body-font-size, 13px);\"> file in your directory.<\/span><\/p>\n\n\n\n<p class=\"pw-post-body-paragraph mm mn fr be b mo mp mq mr ms mt mu mv mw mx my mz na nb nc nd ne nf ng nh ni fk bj\" id=\"22c0\"><strong class=\"be ph\">Building the Extended Apache Airflow Image<\/strong><\/p>\n\n\n\n<p class=\"pw-post-body-paragraph mm mn fr be b mo mp mq mr ms mt mu mv mw mx my mz na nb nc nd ne nf ng nh ni fk bj\" id=\"9c16\">Once you have the dockerfile and the requirements.txt file, change your directory into the folder and type this in your terminal <code class=\"cw pj pk pl pm b\">docker build -t extending_airflow_with_pip:latest .<\/code> to build the image. You can choose to tag the photo with any name of your choice.<\/p>\n\n\n\n<p class=\"pw-post-body-paragraph mm mn fr be b mo mp mq mr ms mt mu mv mw mx my mz na nb nc nd ne nf ng nh ni fk bj\" id=\"9bda\">Once you finish the image build, you are ready to modify the docker-compose file.<\/p>\n\n\n\n<p class=\"pw-post-body-paragraph mm mn fr be b mo mp mq mr ms mt mu mv mw mx my mz na nb nc nd ne nf ng nh ni fk bj\" id=\"4ec7\"><code class=\"cw pj pk pl pm b\">image: ${AIRFLOW_IMAGE_NAME:-apache\/airflow:2.6.1}<\/code> should be changed to <code class=\"cw pj pk pl pm b\">image: ${AIRFLOW_IMAGE_NAME:-extending_airflow_with_pip:latest<\/code> assuming you used <code class=\"cw pj pk pl pm b\">extending_airflow_with_pip:latest<\/code> it as the tag for the docker build.<\/p>\n\n\n\n<p class=\"pw-post-body-paragraph mm mn fr be b mo mp mq mr ms mt mu mv mw mx my mz na nb nc nd ne nf ng nh ni fk bj\" id=\"2dd8\"><code class=\"cw pj pk pl pm b\">AIRFLOW__CORE__EXECUTOR: CeleryExecutor<\/code> should be changed to <code class=\"cw pj pk pl pm b\">AIRFLOW__CORE__EXECUTOR: LocalExecutor<\/code> . You can also change it to <code class=\"cw pj pk pl pm b\">SequentialExecutor<\/code> if you wish to use it.<\/p>\n\n\n\n<p class=\"pw-post-body-paragraph mm mn fr be b mo mp mq mr ms mt mu mv mw mx my mz na nb nc nd ne nf ng nh ni fk bj\" id=\"73f7\">Since you aren&#8217;t using a CeleryExecutor, you must delete the Celery worker and the Celery Flower lines. The celery flower is used for managing the celery cluster, which is not needed for a local executor. Go to the docker-compose file, delete the below configurations from the file, and save it. If you have an issue with which lines to delete, you can access the modified docker-compose file at this <a class=\"af ml\" href=\"https:\/\/github.com\/Idowuilekura\/apache_airflow_medium_tut\/blob\/master\/docker_related_folder\/docker-compose.yaml\" target=\"_blank\" rel=\"noopener ugc nofollow\">link<\/a>.<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\"> # line 104 to 114\n redis:\n    image: redis:latest\n    expose:\n      - 6379\n    healthcheck:\n      test: [\"CMD\", \"redis-cli\", \"ping\"]\n      interval: 10s\n      timeout: 30s\n      retries: 50\n      start_period: 30s\n    restart: always<\/pre>\n\n\n\n<pre class=\"wp-block-preformatted\"># delete the airflow worker\nairflow-worker:\n    &lt;&lt;: *airflow-common\n    command: celery worker\n    healthcheck:\n      test:\n        - \"CMD-SHELL\"\n        - 'celery --app airflow.executors.celery_executor.app inspect ping -d \"celery@${HOSTNAME}\"'\n      interval: 30s\n      timeout: 10s\n      retries: 5\n      start_period: 30s\n    environment:\n      &lt;&lt;: *airflow-common-env\n      # Required to handle warm shutdown of the celery workers properly\n      # See https:\/\/airflow.apache.org\/docs\/docker-stack\/entrypoint.html#signal-propagation\n      DUMB_INIT_SETSID: \"0\"\n    restart: always\n    depends_on:\n      &lt;&lt;: *airflow-common-depends-on\n      airflow-init:\n        condition: service_completed_successfully<\/pre>\n\n\n\n<p class=\"pw-post-body-paragraph mm mn fr be b mo mp mq mr ms mt mu mv mw mx my mz na nb nc nd ne nf ng nh ni fk bj\" id=\"cc44\"><strong class=\"be ph\">Starting Apache Airflow<\/strong><\/p>\n\n\n\n<p class=\"pw-post-body-paragraph mm mn fr be b mo mp mq mr ms mt mu mv mw mx my mz na nb nc nd ne nf ng nh ni fk bj\" id=\"4087\">Inside your working directory, create three sub-folders with the name dags for storing your dags, logs for storing the logs from the execution of tasks and scheduler, config for storing configurations, and plugins for storing custom plugins.<\/p>\n\n\n\n<p class=\"pw-post-body-paragraph mm mn fr be b mo mp mq mr ms mt mu mv mw mx my mz na nb nc nd ne nf ng nh ni fk bj\" id=\"7bd6\">If you are on Linux, you need to ensure that the dags, logs, and plugins folder are not owned by <code class=\"cw pj pk pl pm b\">root<\/code> , but by Airflow. To confirm that Airflow owns the folder, type this in the terminal <code class=\"cw pj pk pl pm b\">echo -e \"AIRFLOW_UID=$(id -u)\" &gt; .env<\/code> . This is optional on Windows and Mac, but you can choose to do that to suppress the warning from Airflow.<\/p>\n\n\n\n<p class=\"pw-post-body-paragraph mm mn fr be b mo mp mq mr ms mt mu mv mw mx my mz na nb nc nd ne nf ng nh ni fk bj\" id=\"2cea\">If you are on Windows\/Mac, you must allocate memory to docker-desktop to avoid docker taking all the memory on your system. A rule of thumb is to give about 75% of your RAM size; if you use 8GB RAM, you can allocate 6 GB. You can check out this link to learn how to allocate memory to docker on <a class=\"af ml\" href=\"https:\/\/stackoverflow.com\/a\/62773629\" target=\"_blank\" rel=\"noopener ugc nofollow\">Windows <\/a>or <a class=\"af ml\" href=\"https:\/\/stackoverflow.com\/a\/39720010\" target=\"_blank\" rel=\"noopener ugc nofollow\">Mac<\/a>.<\/p>\n\n\n\n<p class=\"pw-post-body-paragraph mm mn fr be b mo mp mq mr ms mt mu mv mw mx my mz na nb nc nd ne nf ng nh ni fk bj\" id=\"401d\">Type <code class=\"cw pj pk pl pm b\">docker compose up airflow-init<\/code> in your terminal to initialize the database and create the Airflow account. Once you are done, type the <code class=\"cw pj pk pl pm b\">docker compose up<\/code> command to start the Airflow services. To view the web server, type <code class=\"cw pj pk pl pm b\">localhost:8080<\/code> on your browser and click enter. Once you click on enter, you will see an interface similar to the one below. Type in your username, which is <code class=\"cw pj pk pl pm b\">airflow<\/code>, your password, which is <code class=\"cw pj pk pl pm b\">airflow<\/code>, and click on the Sign In button.<\/p>\n\n\n\n<figure class=\"wp-block-image ok ol om on oo ma mb mc paragraph-image\"><img decoding=\"async\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:700\/1*5WkzQvcyCK-JSvsZURdR2g.png\" alt=\"\"\/><figcaption class=\"wp-element-caption\">Apache Airflow Webserver Login Interface (Image by Author)<\/figcaption><\/figure>\n\n\n\n<p class=\"pw-post-body-paragraph mm mn fr be b mo mp mq mr ms mt mu mv mw mx my mz na nb nc nd ne nf ng nh ni fk bj\" id=\"ab9c\">Once you have successfully logged in, you will be presented with a familiar interface that showcases a collection of example Airflow preloaded DAGs. As depicted in the image, these DAGs serve as practical illustrations and can serve as a starting point for your workflow creations.<\/p>\n\n\n\n<figure class=\"wp-block-image ok ol om on oo ma mb mc paragraph-image\"><img decoding=\"async\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:700\/1*asvyE-UGc8fZEUkl4b5Zdw.png\" alt=\"\"\/><figcaption class=\"wp-element-caption\">Apache Airflow Webserver Interface (Image by Author)<\/figcaption><\/figure>\n\n\n\n<p class=\"pw-post-body-paragraph mm mn fr be b mo mp mq mr ms mt mu mv mw mx my mz na nb nc nd ne nf ng nh ni fk bj\" id=\"9e7e\"><strong class=\"be ph\">Working with Example DAGS<\/strong><\/p>\n\n\n\n<p class=\"pw-post-body-paragraph mm mn fr be b mo mp mq mr ms mt mu mv mw mx my mz na nb nc nd ne nf ng nh ni fk bj\" id=\"1b52\">If you click the example_bash_operator DAG, you will see an image similar to the one below. The below image shows information about the DAG. You can check out this link to learn more about what is shown in the image below.<\/p>\n\n\n\n<figure class=\"wp-block-image ok ol om on oo ma mb mc paragraph-image\"><img decoding=\"async\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:700\/1*IlrUJBQmsHX2ONStxV3c3A.png\" alt=\"\"\/><\/figure>\n\n\n\n<p class=\"pw-post-body-paragraph mm mn fr be b mo mp mq mr ms mt mu mv mw mx my mz na nb nc nd ne nf ng nh ni fk bj\" id=\"3164\">To view the structure of the DAG, you can click on the <code class=\"cw pj pk pl pm b\">Graph<\/code> button, which will show an image similar to what is shown below. The image below shows the design of the DAG, the logic, and the dependencies between the DAG.<\/p>\n\n\n\n<figure class=\"wp-block-image ok ol om on oo ma mb mc paragraph-image\"><img decoding=\"async\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:700\/1*oWTEHryf3sOc21LKmnTKnw.png\" alt=\"\"\/><\/figure>\n\n\n\n<p class=\"pw-post-body-paragraph mm mn fr be b mo mp mq mr ms mt mu mv mw mx my mz na nb nc nd ne nf ng nh ni fk bj\" id=\"5482\">To trigger the DAG, click on the <code class=\"cw pj pk pl pm b\">DAG Trigger Button<\/code> , and click the <code class=\"cw pj pk pl pm b\">Trigger DAG<\/code>option. Once you click on the <code class=\"cw pj pk pl pm b\">Trigger DAG<\/code> alternative, you will see an image similar to what is shown below, which displays the information about the DAG run. The dark green under the Task Run Information shows success, while the red is for failed tasks.<\/p>\n\n\n\n<figure class=\"wp-block-image ok ol om on oo ma mb mc paragraph-image\"><img decoding=\"async\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:700\/1*iv_7e0HE2K71gPGZ1QG79A.png\" alt=\"\"\/><\/figure>\n\n\n\n<p class=\"pw-post-body-paragraph mm mn fr be b mo mp mq mr ms mt mu mv mw mx my mz na nb nc nd ne nf ng nh ni fk bj\" id=\"cc90\"><strong class=\"be ph\">Writing your First Apache Airflow DAG<\/strong><\/p>\n\n\n\n<p class=\"pw-post-body-paragraph mm mn fr be b mo mp mq mr ms mt mu mv mw mx my mz na nb nc nd ne nf ng nh ni fk bj\" id=\"143e\">Before you write your first DAG, you need to understand what is required for writing a DAG in Airflow:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Operators \u2014 Operators are the tasks needed to run an operation. If you want to cook anything, you need a Pot operator. If one of your tasks needs to perform a bash operation, use the BashOperator in Apache Airflow. Similarly, if you need to complete a Python process, you will need the Python operator.<\/li>\n\n\n\n<li>DAG object \u2014 This is needed to instantiate class.<\/li>\n<\/ul>\n\n\n\n<p class=\"pw-post-body-paragraph mm mn fr be b mo mp mq mr ms mt mu mv mw mx my mz na nb nc nd ne nf ng nh ni fk bj\" id=\"43b8\">The logic for your first DAG is this: you will write a DAG that will ingest a CSV file using pandas, save the file to your local Airflow directory, and clean up the directory afterward. If you think you will need the Python and Bash operators to write this DAG and define the task, then you are correct.<\/p>\n\n\n\n<p class=\"pw-post-body-paragraph mm mn fr be b mo mp mq mr ms mt mu mv mw mx my mz na nb nc nd ne nf ng nh ni fk bj\" id=\"86d1\">The first step in writing the DAG is to import the operators needed and the libraries used.<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\"># importing the DAG class\nfrom airflow import DAG\nfrom airflow.operators.python import PythonOperator\nfrom airflow.operators.bash import BashOperator\nimport os\nimport pandas as pd\nfrom datetime import datetime, timedelta<\/pre>\n\n\n\n<p class=\"pw-post-body-paragraph mm mn fr be b mo mp mq mr ms mt mu mv mw mx my mz na nb nc nd ne nf ng nh ni fk bj\" id=\"7704\">The second step is to customize the DAG with information you know, such as the scheduled date, schedule interval, etc. For defining a DAG, you need the default_args. This dictionary contains information such as the owner of the DAG, the number of retries in case of failure of any of the tasks, and the time to wait before triggering the tasks again in case of failure, which is given by the retry_delaly argument.<\/p>\n\n\n\n<p class=\"pw-post-body-paragraph mm mn fr be b mo mp mq mr ms mt mu mv mw mx my mz na nb nc nd ne nf ng nh ni fk bj\" id=\"56d9\">You might be wondering why you have a default_args. They are arguments that can be reusable between different DAGs and help to save time. Other specific arguments are defined inside the DAG object, as shown below. The DAG object needs an ID to identify the DAG, a description, the start_date to schedule the DAG, the schedule interval, the default args, and the end_date. The end_date argument is optional, but if you don&#8217;t specify it, Airflow will keep scheduling your DAG.<\/p>\n\n\n\n<p class=\"pw-post-body-paragraph mm mn fr be b mo mp mq mr ms mt mu mv mw mx my mz na nb nc nd ne nf ng nh ni fk bj\" id=\"625d\">By default, Airflow will start running a DAG from the start_date. The parameter that instructs Airflow to do this is the catchup parameter. If your start_date is 2021, then Airflow will start running from this time. To turn this off, you must set the catchup argument to False.<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">default_args = {\n    'owner': 'idowu',\n    'retries': 1,\n    'retry_delay': timedelta(minutes=2)\n}\n\nfirst_dag = DAG(dag_id='first_medium_dag',\n                description = 'A simple DAG to ingest data with Pandas, save it locally and clean up the directory',\n                start_date = datetime(2023, 6,19),\n                schedule_interval = '@once',\n                default_args = default_args,\n                end_date = datetime(2023,6,20),\n                catchup = False)\n\n)<\/pre>\n\n\n\n<p class=\"pw-post-body-paragraph mm mn fr be b mo mp mq mr ms mt mu mv mw mx my mz na nb nc nd ne nf ng nh ni fk bj\" id=\"7e6b\">The next step is to define the variables used and write a Python function for downloading the CSV file, reading it with pandas, and saving it to the Airflow home directory. The dataset that will be used is from <a class=\"af ml\" href=\"https:\/\/sample-videos.com\/download-sample-csv.php#google_vignette\" target=\"_blank\" rel=\"noopener ugc nofollow\">Sample Videos<\/a>, a website that provides free CSV files for testing.<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">airflow_home = os.getenv('AIRFLOW_HOME')\ndataset_link = 'https:\/\/sample-videos.com\/csv\/Sample-Spreadsheet-100-rows.csv'\noutput_file_name = 'sample_file.csv'\n\ndef download_file_save_local(dataset_link : str,output_file_name):\n    data_df = pd.read_csv(dataset_link)\n\n    data_df.to_csv(airflow_home + '\/' + f'{output_file_name}', index=False)<\/pre>\n\n\n\n<p class=\"pw-post-body-paragraph mm mn fr be b mo mp mq mr ms mt mu mv mw mx my mz na nb nc nd ne nf ng nh ni fk bj\" id=\"658e\">The next step is to define the task objects for downloading the file, saving it locally, and cleaning it up. The two operators that you need for the task are Python and Bash operators. Both operators require a task ID and the DAG to be tied to each task.<\/p>\n\n\n\n<p class=\"pw-post-body-paragraph mm mn fr be b mo mp mq mr ms mt mu mv mw mx my mz na nb nc nd ne nf ng nh ni fk bj\" id=\"433f\">The Python operator requires specific parameters, such as the Python function, to be called and the arguments to the function defined by the op_kwargs argument. The bash operator requires the bash_command argument, instructing it on what bash command to run.<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">download_file_save_local_task = PythonOperator(task_id = 'download_file_save_local_task',\n                                               python_callable=download_file_save_local,\n                                               op_kwargs = {'dataset_link':dataset_link,'output_location_file_name':output_location_file_name},\n                                               dag=first_dag)\n\nclean_directory_task = BashOperator(task_id='clean_directory_task',\n                                    bash_command =f'rm {output_location_file_name}', dag=first_dag)<\/pre>\n\n\n\n<p class=\"pw-post-body-paragraph mm mn fr be b mo mp mq mr ms mt mu mv mw mx my mz na nb nc nd ne nf ng nh ni fk bj\" id=\"3a90\">The final step is to define the dependency between each of the tasks. The initial logic is to call the task that will download the file, after which the clean_directory_task is called. There are two ways to set dependency in Airflow:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Using the <code class=\"cw pj pk pl pm b\">set_upstream<\/code> for defining the upstream dependency. In this scenario, the download_file_save_local_task is upstream to the clean_directory_task. An example of defining this is <code class=\"cw pj pk pl pm b\">clean_directory_task.set_upstream(download_file_save_local_task<\/code> .<\/li>\n\n\n\n<li>Another option is to use the <code class=\"cw pj pk pl pm b\">set_downstream<\/code> for defining the downstream dependency. The clean_directory_task is a downstream task for the download_file_save_local_task since it will be called after the downlaod_file_save_local_task.<\/li>\n\n\n\n<li>For setting dependency upstream, you can use <code class=\"cw pj pk pl pm b\">&lt;&lt;<\/code> the bitwise operator. For setting downstream, you need to use the <code class=\"cw pj pk pl pm b\">>><\/code> bitwise operator.<\/li>\n<\/ul>\n\n\n\n<p class=\"pw-post-body-paragraph mm mn fr be b mo mp mq mr ms mt mu mv mw mx my mz na nb nc nd ne nf ng nh ni fk bj\" id=\"6cdd\">To set the dependency between the two tasks, you can use the code below. This tells Airflow that the clean_directory_task should be run after the download_file_save_local_task runs.<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">download_file_save_local_task &gt;&gt; clean_directory_task<\/pre>\n\n\n\n<p class=\"pw-post-body-paragraph mm mn fr be b mo mp mq mr ms mt mu mv mw mx my mz na nb nc nd ne nf ng nh ni fk bj\" id=\"5808\">The complete code is shown below. You can copy and paste it into a file of your choice. The file should be located inside the DAG directory that you created earlier.<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\"># importing the DAG class\nfrom airflow import DAG\nfrom airflow.operators.python import PythonOperator\nfrom airflow.operators.bash import BashOperator\nimport os\nimport pandas as pd\nfrom datetime import datetime, timedelta\n\n\n\ndefault_args = {\n    'owner': 'idowu',\n    'retries': 1,\n    'retry_delay': timedelta(minutes=2)\n}\n\nfirst_dag = DAG(dag_id='first_medium_dag',\n                description = 'A simple DAG to ingest data with Pandas, save it locally and clean up the directory',\n                # change the start_date to your preferred date\n                start_date = datetime(2023, 6,19),\n                schedule_interval = '@once',\n                default_args = default_args,\n                # change the end_date to your preferred date\n                end_date = datetime(2023,6,20)\n\n)\n\n\nairflow_home = os.getenv('AIRFLOW_HOME')\ndataset_link = 'https:\/\/sample-videos.com\/csv\/Sample-Spreadsheet-100-rows.csv'\noutput_file_name = 'sample_file.csv'\noutput_location_file_name = airflow_home + '\/' + f'{output_file_name}'\n\ndef download_file_save_local(dataset_link : str,output_location_file_name):\n    data_df = pd.read_csv(dataset_link, encoding= 'latin-1')\n\n    data_df.to_csv(output_location_file_name, index=False)\n\n    print(data_df.head())\n\n\ndownload_file_save_local_task = PythonOperator(task_id = 'download_file_save_local_task',\n                                               python_callable=download_file_save_local,\n                                               op_kwargs = {'dataset_link':dataset_link,'output_location_file_name':output_location_file_name},\n                                               dag=first_dag)\n\nclean_directory_task = BashOperator(task_id='clean_directory_task',\n                                    bash_command =f'rm {output_location_file_name}', dag=first_dag)\n\n\ndownload_file_save_local_task &gt;&gt; clean_directory_task<\/pre>\n\n\n\n<p class=\"pw-post-body-paragraph mm mn fr be b mo mp mq mr ms mt mu mv mw mx my mz na nb nc nd ne nf ng nh ni fk bj\" id=\"a641\"><strong class=\"be ph\">Viewing your first DAG<\/strong><\/p>\n\n\n\n<p class=\"pw-post-body-paragraph mm mn fr be b mo mp mq mr ms mt mu mv mw mx my mz na nb nc nd ne nf ng nh ni fk bj\" id=\"b822\">To view your DAG, go to the web server and search for the dag with the name <code class=\"cw pj pk pl pm b\">first_medium_dag<\/code> . This is the variable that was passed to the dag_id. Once you find the DAG, click on it to see something similar to what is shown below.<\/p>\n\n\n\n<figure class=\"wp-block-image ok ol om on oo ma mb mc paragraph-image\"><img decoding=\"async\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:700\/1*-meKkmi4exW9coQB_wIvYQ.png\" alt=\"\"\/><figcaption class=\"wp-element-caption\">First Medium Dag Viewer (Image by Author)<\/figcaption><\/figure>\n\n\n\n<p class=\"pw-post-body-paragraph mm mn fr be b mo mp mq mr ms mt mu mv mw mx my mz na nb nc nd ne nf ng nh ni fk bj\" id=\"d392\">Click on the <code class=\"cw pj pk pl pm b\">trigger dag button<\/code> and you will see an image similar to what is shown below. The image below shows that the two tasks were successfully run based on the color.<\/p>\n\n\n\n<figure class=\"wp-block-image ok ol om on oo ma mb mc paragraph-image\"><img decoding=\"async\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:700\/1*ZP1Vef1YMMHUJxUuwluqyg.png\" alt=\"\"\/><\/figure>\n\n\n\n<p class=\"pw-post-body-paragraph mm mn fr be b mo mp mq mr ms mt mu mv mw mx my mz na nb nc nd ne nf ng nh ni fk bj\" id=\"63d7\">You can check the task logs by clicking on the task and clicking on records. Also, you can go through the interface and see the result of your first written DAG run.<\/p>\n\n\n\n<p class=\"pw-post-body-paragraph mm mn fr be b mo mp mq mr ms mt mu mv mw mx my mz na nb nc nd ne nf ng nh ni fk bj\" id=\"529f\">Now you have understood the nitty-gritty of Apache Airflow, its internals, how to trigger a DAG, write a DAG from scratch, and run it. You have come to the end of the second article in this series.<\/p>\n\n\n\n<p class=\"pw-post-body-paragraph mm mn fr be b mo mp mq mr ms mt mu mv mw mx my mz na nb nc nd ne nf ng nh ni fk bj\" id=\"8eb3\"><strong class=\"be ph\">Conclusion<\/strong><\/p>\n\n\n\n<p class=\"pw-post-body-paragraph mm mn fr be b mo mp mq mr ms mt mu mv mw mx my mz na nb nc nd ne nf ng nh ni fk bj\" id=\"dc9c\">The second article in this series has provided you with a comprehensive understanding of the inner workings of Airflow and the key components that drive its functionality. You have learned how to trigger a DAG in Airflow, create a DAG from scratch, and initiate its execution. In the upcoming part of this series, we will delve into advanced concepts of Airflow, including backfilling techniques and building an ETL pipeline in Airflow for data ingestion into Postgres and Google Cloud BigQuery.<\/p>\n\n\n\n<p class=\"pw-post-body-paragraph mm mn fr be b mo mp mq mr ms mt mu mv mw mx my mz na nb nc nd ne nf ng nh ni fk bj\" id=\"0bc5\">You can connect with me on <a class=\"af ml\" href=\"https:\/\/www.linkedin.com\/in\/ilekuraidowu\/\" target=\"_blank\" rel=\"noopener ugc nofollow\">LinkedIn<\/a> or <a class=\"af ml\" href=\"https:\/\/twitter.com\/idowuilekura\" target=\"_blank\" rel=\"noopener ugc nofollow\">Twitter<\/a> to continue the conversation or drop any query in the comment box.<\/p>\n\n\n\n<p class=\"pw-post-body-paragraph mm mn fr be b mo mp mq mr ms mt mu mv mw mx my mz na nb nc nd ne nf ng nh ni fk bj\" id=\"4aa8\"><strong class=\"be ph\">References \/ Further Resources<\/strong><\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Astronomer Documentation on Apache Airflow \u2014 <a class=\"af ml\" href=\"https:\/\/docs.astronomer.io\/learn\/category\/airflow-concepts\" target=\"_blank\" rel=\"noopener ugc nofollow\">https:\/\/docs.astronomer.io\/learn\/category\/airflow-concepts<\/a><\/li>\n\n\n\n<li>A comparison between Apache Airflow Executors \u2014 <a class=\"af ml\" href=\"https:\/\/maxcotec.com\/learning\/apache-airflow-architecture\/\" target=\"_blank\" rel=\"noopener ugc nofollow\">https:\/\/maxcotec.com\/learning\/apache-airflow-architecture\/<\/a><\/li>\n\n\n\n<li>Types of Executors in Apache Airflow \u2014 <a class=\"af ml\" href=\"https:\/\/medium.com\/international-school-of-ai-data-science\/executors-in-apache-airflow-148fadee4992#:~:text=Airflow%20executors%20are%20the%20mechanism,execute%20tasks%20serially%20or%20parallelly\" rel=\"noopener\">https:\/\/medium.com\/international-school-of-ai-data-science\/executors-in-apache-airflow-148fadee4992#:~:text=Airflow%20executors%20are%20the%20mechanism,execute%20tasks%20serially%20or%20parallelly<\/a>.<\/li>\n\n\n\n<li>A deep dive into creating a DAG in Airflow \u2014 <a class=\"af ml\" href=\"https:\/\/marclamberti.com\/blog\/airflow-dag-creating-your-first-dag-in-5-minutes\/?utm_content=cmp-true\" target=\"_blank\" rel=\"noopener ugc nofollow\">https:\/\/marclamberti.com\/blog\/airflow-dag-creating-your-first-dag-in-5-minutes\/?utm_content=cmp-true<\/a><\/li>\n\n\n\n<li>Apache Airflow Official Documentation on Airflow \u2014 <a class=\"af ml\" href=\"https:\/\/airflow.apache.org\/docs\/apache-airflow\/1.10.6\/tutorial.html?highlight=email\" target=\"_blank\" rel=\"noopener ugc nofollow\">https:\/\/airflow.apache.org\/docs\/apache-airflow\/1.10.6\/tutorial.html?highlight=email<\/a><\/li>\n\n\n\n<li>Data Pipelines with Apache Airflow E-book (Highly Recommended ) \u2014 <a class=\"af ml\" href=\"https:\/\/www.manning.com\/books\/data-pipelines-with-apache-airflow\" target=\"_blank\" rel=\"noopener ugc nofollow\">https:\/\/www.manning.com\/books\/data-pipelines-with-apache-airflow<\/a><\/li>\n<\/ul>\n","protected":false},"excerpt":{"rendered":"<p>In the previous article, you were introduced to the intricacies of data pipelines, including the two major types of existing data pipelines. You also learned how to build an Extract Transform Load (ETL) pipeline and discovered the automation capabilities of Apache Airflow for ETL pipelines. In the second part of this series, you will delve [&hellip;]<\/p>\n","protected":false},"author":109,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"customer_name":"","customer_description":"","customer_industry":"","customer_technologies":"","customer_logo":"","footnotes":""},"categories":[7],"tags":[],"coauthors":[207],"class_list":["post-8144","post","type-post","status-publish","format-standard","hentry","category-tutorials"],"yoast_head":"<!-- This site is optimized with the Yoast SEO Premium plugin v25.9 (Yoast SEO v25.9) - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>Supercharging Your Data Pipeline with Apache Airflow (Part 2)<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/www.comet.com\/site\/blog\/supercharging-your-data-pipeline-with-apache-airflow-part-2\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Supercharging Your Data Pipeline with Apache Airflow (Part 2)\" \/>\n<meta property=\"og:description\" content=\"In the previous article, you were introduced to the intricacies of data pipelines, including the two major types of existing data pipelines. You also learned how to build an Extract Transform Load (ETL) pipeline and discovered the automation capabilities of Apache Airflow for ETL pipelines. In the second part of this series, you will delve [&hellip;]\" \/>\n<meta property=\"og:url\" content=\"https:\/\/www.comet.com\/site\/blog\/supercharging-your-data-pipeline-with-apache-airflow-part-2\" \/>\n<meta property=\"og:site_name\" content=\"Comet\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/cometdotml\" \/>\n<meta property=\"article:published_time\" content=\"2023-11-09T16:29:05+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2025-04-24T17:04:32+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/miro.medium.com\/v2\/1*xc3ZcBjXdjJdg4hkSL6-tQ.png\" \/>\n<meta name=\"author\" content=\"Ilekura Idowu\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@Cometml\" \/>\n<meta name=\"twitter:site\" content=\"@Cometml\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Ilekura Idowu\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"20 minutes\" \/>\n<!-- \/ Yoast SEO Premium plugin. -->","yoast_head_json":{"title":"Supercharging Your Data Pipeline with Apache Airflow (Part 2)","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/www.comet.com\/site\/blog\/supercharging-your-data-pipeline-with-apache-airflow-part-2","og_locale":"en_US","og_type":"article","og_title":"Supercharging Your Data Pipeline with Apache Airflow (Part 2)","og_description":"In the previous article, you were introduced to the intricacies of data pipelines, including the two major types of existing data pipelines. You also learned how to build an Extract Transform Load (ETL) pipeline and discovered the automation capabilities of Apache Airflow for ETL pipelines. In the second part of this series, you will delve [&hellip;]","og_url":"https:\/\/www.comet.com\/site\/blog\/supercharging-your-data-pipeline-with-apache-airflow-part-2","og_site_name":"Comet","article_publisher":"https:\/\/www.facebook.com\/cometdotml","article_published_time":"2023-11-09T16:29:05+00:00","article_modified_time":"2025-04-24T17:04:32+00:00","og_image":[{"url":"https:\/\/miro.medium.com\/v2\/1*xc3ZcBjXdjJdg4hkSL6-tQ.png","type":"","width":"","height":""}],"author":"Ilekura Idowu","twitter_card":"summary_large_image","twitter_creator":"@Cometml","twitter_site":"@Cometml","twitter_misc":{"Written by":"Ilekura Idowu","Est. reading time":"20 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/www.comet.com\/site\/blog\/supercharging-your-data-pipeline-with-apache-airflow-part-2#article","isPartOf":{"@id":"https:\/\/www.comet.com\/site\/blog\/supercharging-your-data-pipeline-with-apache-airflow-part-2\/"},"author":{"name":"Ilekura Idowu","@id":"https:\/\/www.comet.com\/site\/#\/schema\/person\/3b235c2e92480cdaeb3300be2a77f89d"},"headline":"Supercharging Your Data Pipeline with Apache Airflow (Part 2)","datePublished":"2023-11-09T16:29:05+00:00","dateModified":"2025-04-24T17:04:32+00:00","mainEntityOfPage":{"@id":"https:\/\/www.comet.com\/site\/blog\/supercharging-your-data-pipeline-with-apache-airflow-part-2\/"},"wordCount":3433,"publisher":{"@id":"https:\/\/www.comet.com\/site\/#organization"},"image":{"@id":"https:\/\/www.comet.com\/site\/blog\/supercharging-your-data-pipeline-with-apache-airflow-part-2#primaryimage"},"thumbnailUrl":"https:\/\/miro.medium.com\/v2\/1*xc3ZcBjXdjJdg4hkSL6-tQ.png","articleSection":["Tutorials"],"inLanguage":"en-US"},{"@type":"WebPage","@id":"https:\/\/www.comet.com\/site\/blog\/supercharging-your-data-pipeline-with-apache-airflow-part-2\/","url":"https:\/\/www.comet.com\/site\/blog\/supercharging-your-data-pipeline-with-apache-airflow-part-2","name":"Supercharging Your Data Pipeline with Apache Airflow (Part 2)","isPartOf":{"@id":"https:\/\/www.comet.com\/site\/#website"},"primaryImageOfPage":{"@id":"https:\/\/www.comet.com\/site\/blog\/supercharging-your-data-pipeline-with-apache-airflow-part-2#primaryimage"},"image":{"@id":"https:\/\/www.comet.com\/site\/blog\/supercharging-your-data-pipeline-with-apache-airflow-part-2#primaryimage"},"thumbnailUrl":"https:\/\/miro.medium.com\/v2\/1*xc3ZcBjXdjJdg4hkSL6-tQ.png","datePublished":"2023-11-09T16:29:05+00:00","dateModified":"2025-04-24T17:04:32+00:00","breadcrumb":{"@id":"https:\/\/www.comet.com\/site\/blog\/supercharging-your-data-pipeline-with-apache-airflow-part-2#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/www.comet.com\/site\/blog\/supercharging-your-data-pipeline-with-apache-airflow-part-2"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.comet.com\/site\/blog\/supercharging-your-data-pipeline-with-apache-airflow-part-2#primaryimage","url":"https:\/\/miro.medium.com\/v2\/1*xc3ZcBjXdjJdg4hkSL6-tQ.png","contentUrl":"https:\/\/miro.medium.com\/v2\/1*xc3ZcBjXdjJdg4hkSL6-tQ.png"},{"@type":"BreadcrumbList","@id":"https:\/\/www.comet.com\/site\/blog\/supercharging-your-data-pipeline-with-apache-airflow-part-2#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/www.comet.com\/site\/"},{"@type":"ListItem","position":2,"name":"Supercharging Your Data Pipeline with Apache Airflow (Part 2)"}]},{"@type":"WebSite","@id":"https:\/\/www.comet.com\/site\/#website","url":"https:\/\/www.comet.com\/site\/","name":"Comet","description":"Build Better Models Faster","publisher":{"@id":"https:\/\/www.comet.com\/site\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/www.comet.com\/site\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/www.comet.com\/site\/#organization","name":"Comet ML, Inc.","alternateName":"Comet","url":"https:\/\/www.comet.com\/site\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.comet.com\/site\/#\/schema\/logo\/image\/","url":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/01\/logo_comet_square.png","contentUrl":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/01\/logo_comet_square.png","width":310,"height":310,"caption":"Comet ML, Inc."},"image":{"@id":"https:\/\/www.comet.com\/site\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/cometdotml","https:\/\/x.com\/Cometml","https:\/\/www.youtube.com\/channel\/UCmN63HKvfXSCS-UwVwmK8Hw"]},{"@type":"Person","@id":"https:\/\/www.comet.com\/site\/#\/schema\/person\/3b235c2e92480cdaeb3300be2a77f89d","name":"Ilekura Idowu","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.comet.com\/site\/#\/schema\/person\/image\/5d438e5e9b180fd4da90db89e3fe9fc6","url":"https:\/\/secure.gravatar.com\/avatar\/761e6e727594cef4b3a1492abc4aaf6ca722954d596962da5e3d2b924a4a046b?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/761e6e727594cef4b3a1492abc4aaf6ca722954d596962da5e3d2b924a4a046b?s=96&d=mm&r=g","caption":"Ilekura Idowu"},"url":"https:\/\/www.comet.com\/site\/blog\/author\/ilekuraidowugmail-com\/"}]}},"_links":{"self":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts\/8144","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/users\/109"}],"replies":[{"embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/comments?post=8144"}],"version-history":[{"count":1,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts\/8144\/revisions"}],"predecessor-version":[{"id":15454,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts\/8144\/revisions\/15454"}],"wp:attachment":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/media?parent=8144"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/categories?post=8144"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/tags?post=8144"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/coauthors?post=8144"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}