Redshift is a fully managed, petabyte-scale data warehouse service in the cloud.
It's a very large relational database traditionally used in big data applications.
Extract, transform, load. There was an accompanying example of gold mining for how ETL works.
A managed big data platform that allows you to process vast amounts of data using open-source tools such as Spark, Hive, HBase, Flink, Hudi and Presto.
It is AWS's ETL tool.
When you spin up a EMR cluster, it will live inside of a VPC. The focus is running it on EC2 instances, although it can also run on EKS or Outpost.
EMR will spin up the instances and manage them for us, and store the processed values in S3.
In the dashboard after spinning up the cluster, you can access the application user interface to find the relevant UIs corresponding to the cluster.
From there, you could start organising your ETL workloads.
If you look at the EC2 setup, you'll see that there are 4 EC2 instances (one is a bastion).
Exam tip: it is a collection of open source services. It is managed architecture that AWS helps you get up and running.
Other tips:
Allows you to ingest, process and analyze real-time streaming data.
Analogy given was to think of a highway to get you from point A to point B.
There are two major forms:
When provisioning data streams, you need to provision and scale how many shards are required to handle messages from the producers.
You also need to manage all the consumers. Consumers are something that take that data in, process it and put it somewhere that you want it.
Data Firehose is simpler.
With Firehose, you send data to the service and it will send it to the services that it supports.
This can help with analysing the data as it goes through.
If we're looking for a message broker, which do we pick?
SQS is a messaging broker that is simple to use and doesn't require much configuration. It doesn't offer real-time message delivery.
Kinesis is more complicated to configure and it mostly used in big data apps. It does provided real-time communication.
Exam tips:
An interactive query service that makes it easy to analyze data in S3 using SQL.
This allows you to directly query data in your S3 bucket without loading it into a database.
A serverless data integration service that makes it easy to discover, prepare and combine data.
It allows you to perform ETL workloads without managing underlying servers. Effectively, it replaces EMR.
The architecture diagram for this is simple.
Basically it is an S3 datalake -> AWS Glue Crawlers -> AWS Glue Data Catalog
.
From here, we have a few options. One is Amazon Redshift Spectrum
, the other is Amazon Athena -> Amazon QuickSight
.
Exam tips:
Fully managed BI data visualization service. Easily make dashboards and share them with your company.
Where does it fit in? It sits in front of Athena.
The exam tips:
Fully managed version of the open-source application Elasticsearch.
It allows you to quickly search over your stored data and analyze the data you get back. Commonly used as part of the ELK stack.
The exam tips: