

While it is certainly possible to use a simple PythonOperator and boto3 to interact with S3 and download a file, why not encapsulate this behavior in a new Operator that we could then use in other DAGs? We made use of the built-in S3 Hook to make things simpler.


After writing a few DAGs we noticed we had a pattern of downloading a file from our data lake (S3), processing it, and then uploading it to Salesforce or back to S3. Airflow allows this by giving developers the ability to create their own Operators and Hooks which they can build according to their specific needs/use cases. But like any good framework, the real power comes from the ability to customize and extend. When first working with Airflow, you might be impressed by the number of built in Operators and Hooks available to you, things such as RedshiftToS3Transfer, PostgresOperator, MySqlHook, and SlackHook. These are the basic tools in the Airflow toolkit. So Operators are the workhorses that execute the tasks that define a DAG, and Operators make use of Hooks to communicate with external databases or systems. Hooks implement a common interface when possible, and act as a building block for operators.” “… interfaces to external platforms and databases like Hive, S3, MySQL, Postgres, HDFS, and Pig. Operators are usually (but not always) atomic, meaning they can stand on their own and don’t need to share resources with any other operators.” “… describes a single task in a workflow. One of the reasons we were attracted to Airflow were its concepts of Operators and Hooks. We have built Airflow pipelines for jobs such as moving data out of our CRM (Salesforce) into our data warehouse and ingesting real estate sale and mortgage data from external sources. Airflow - Writing your own Operators and HooksĪt B6 we use Apache Airflow to manage the scheduled jobs that move data into and out of our various systems.
