Created:        2020-05-25 Mon
Last modified:  2022-05-14 Sat

Big Data Workflows on AWS

This post is an overview of few articles from AWS blog regarding Big Data Workflows

Orchestrate big data workflows with Apache Airflow, Genie, and Amazon EMR (Oct 2019)

../../../_images/LaunchStack.png

CloudFormation Stack files:

Option 1

Apache Airflow + Genie + Amazon EMR and Amazon S3

../../../_images/AirflowGenieEMRPart2_1.png

Cluster

  • always-on clusters

  • transient clusters

Custom Airflow Operators:

  • GenieOperator

  • EMR Airflow Operator can be used to spin up Amazon EMR clusters that register with Genie, run a job, and tear them down

Option 2

Apache Airflow + Apache Livy + Amazon EMR

Build a Concurrent Data Orchestration Pipeline Using Amazon EMR and Apache Livy (Jul 2018)

../../../_images/LaunchStack.png

Apache Airflow + Apache Livy + Amazon EMR

../../../_images/Livy1.png
  • Apache Livy allow sending code (Scala or Python) over REST API calls

  • Spark allows parralel execution, EMR Step API runs jobs serially

  • Sample code uses non-reliable cluster termination, probably better to try EMR auto-termination

Orchestrate an ETL process using AWS Step Functions for Amazon Redshift (Jul 2019)

../../../_images/LaunchStack.png

AWS Step Functions + AWS Lambda + AWS Batch

../../../_images/D.png

My notes

  • Complicated workflow management

Other Useful/Useless Stuff

AWS Data Pipeline

AWS Glue

Well-known workflow schedule tools

  • Apache Oozie

  • Apache Airflow

  • Azkaban

  • Cron

Other