Athena Workshop

Know How Guide and Hands on Guide for AWS

Athena Workshop

Preparation

Using AWS CloudFormation to launch the envrionment of Labs

  1. Labs - Athena Basics
  2. Labs - Federated Queries
  3. Labs - User Defined Functions
  4. Labs - Custom Connector
  5. Labs - Machine Learning

Labs - Athena Basics

  1. Prepare Datasets
  2. Create Athena Table
  3. Create Tables with Glue
  4. Athena Create View
  5. ETL using Athena CTAS
  6. Use Athena workgroups to separate users, teams, applications, or workloads, to set limits on amount of data each query or the entire workgroup can process, and to track costs
  7. Visualize with QuickSight using Athena

Labs - Federated Queries

  1. Athena Federation - Query Data Anywhere

Athena uses data source connectors that run on AWS Lambda to execute federated queries. A data source connector is a piece of code that can translate between your target data source and Athena. athena_federation

  1. Prepare TPCH Database & Tables and Data Sources data_sources

  2. Aurora Connector setup by using Serverless Application Repository AthenaJdbcConnector

This will deploy Athena JDBC connector for Aurora Mysql and you can refer to this lambda function in your queries as “lambda:mysql”

  1. DynamoDB Connector setup by using Serverless Application Repository AthenaDynamoDBConnector

This will deploy Athena DynamoDB connector and you can refer to this lambda function in your queries as “lambda:dynamo”

  1. HBase Connector setup by using Serverless Application Repository AthenaHBaseConnector

This will deploy Athena HBase connector and you can refer to this lambda function in your queries as “lambda:hbase”

  1. Redis Connector setup by using Serverless Application Repository

This will deploy Athena Redis connector and you can refer to this lambda function in your queries as “lambda:redis”

Since Redis doesn’t have a schema of it’s own the Redis Connector can’t infer the columns or data type from Redis. Redis connector needs a Glue database and tables to be setup to be able to associate the data to the schema.

  1. Run Federated Queries

No needSwitch workgroup to AmazonAthenaPreviewFunctionality, you can keep in primary workgroup

Labs - User Defined Functions

Athena User Defined Function enables customers to write custom scalar functions and invoke them in SQL queries.

When a UDF is used in a SQL query submitted to Athena, it is invoked and executed on AWS Lambda. UDFs can be used in both SELECT and FILTER clauses of a SQL query. Users can invoke multiple UDFs in the same query.

UDFs enable customers to perform custom processing such as compressing and decompressing data, redacting sensitive data, or applying customized decryption.

  1. Setup the Cloud9
    • Resize the disk to 20GB.
    • Setting Development Environment
       git clone https://github.com/awslabs/aws-athena-query-federation.git 
       cd aws-athena-query-federation
       sudo chown ec2-user:ec2-user ~/.profile
       ./tools/prepare_dev_env.sh
       source ~/.profile
       mvn clean install -DskipTests=true > /tmp/log
      
  2. Create UDF Code and Publish to Serverless Application Repository

  3. UDF Connector Setup

  4. Querying with UDF to mask the sensitive data To learn more about UDF syntax click here : UDF Query Syntax

udf-result

Labs - Custom Connector

Deploy a custom Athena connector and then configure it and run Athena Federated query against this connector.

A custom connector can be useful when you have a legacy datasource or you want to have a connector to a datastore which doesn’t currently exist. In these cases you can write and deploy your own custom connector to Private Serverless Application Repository and use it.

Here is an example custom connector

  1. Develop Custom Connector Code
    ../tools/publish.sh S3_BUCKET_NAME athena-example
    
  2. Custom Connector Setup

  3. Querying with Connector
    • Upload Sample Data to the S3 Bucket for the Example Connector
       aws s3 cp ~/environment/aws-athena-query-federation/athena-example/sample_data.csv s3://athena-federation-workshop-744670048729/2017/11/1/sample_data.csv
      

Labs - Machine Learning

You can use Athena to write SQL statements that run Machine Learning (ML) inference using Amazon SageMaker. This feature simplifies access to ML models for data analysis, eliminating the need to use complex programming methods to run inference.

To use ML with Athena, you define an ML with Athena function with the USING FUNCTION clause.

  1. Setup SageMaker Notebook

  2. Download the Jupyter Notebook

  3. Run Jupyter Notebook

Train a RandomCutForest Model to detect anomalies and deploy the model to a SageMaker endpoint that the application or Athena can call

  1. Run Athena ML Query

sagemaker-result

As you can see from the results, With each data point, Random Cut Forest algorithm associates an anomaly score. Low score values indicate that the data point is considered “normal.” High values indicate the presence of an anomaly in the data. The definitions of “low” and “high” depend on the application but common practice suggests that scores beyond three standard deviations from the mean score are considered anomalous.

Reference

Workshop URL