AWS Athena Basics – Drive DataScience

AWS Athena is an interactive query service that allows you to analyze data directly in Amazon S3 using standard SQL. With Athena, you don’t need to set up or manage infrastructure. Instead, you can directly query your S3 data, making it a convenient tool for data analysis.

### Presto/Trino Engine

Athena is built on top of the Presto distributed SQL query engine, which originated as a project within Facebook. Presto was designed for fast queries and operates on a wide variety of data formats. In Athena, this engine enables the service to perform queries efficiently. Recently, some features in Athena have transitioned towards using Trino (a fork of Presto), which continues the innovation with improved features and bug fixes, although the core query capacity remains built on Presto’s foundation.

### Schema-on-Read

One of the unique aspects of Athena is its “schema-on-read” capability. Unlike traditional databases where you define schemas at the time of data writing (“schema-on-write”), Athena allows you to define schemas at the time of query execution. This means you can apply different schemas to the same data stored in S3 when querying it, which provides flexibility and ease of use, especially for semi-structured or unstructured data.

### SQL Syntax

Athena supports standard SQL syntax which means that users can write queries that are compatible with the SQL language most are familiar with. This includes operations such as `SELECT`, `JOIN`, `GROUP BY`, `ORDER BY`, and more.

### Querying Parquet and CSV Data Examples

#### Example 1: Querying Parquet Data

Parquet is an optimized columnar storage format available in the Hadoop ecosystem. Athena can efficiently query Parquet files.

1. **Create a Table for Parquet Data**

Assume you have Parquet files in an S3 bucket structured for Athena. First, you need to create a table that points to your S3 Parquet data:

“`sql
CREATE EXTERNAL TABLE IF NOT EXISTS my_database.my_parquet_table (
id INT,
name STRING,
age INT
)
STORED AS PARQUET
LOCATION ‘s3://my-bucket/path/to/parquet/’;
“`

2. **Query the Parquet Data**

Once the table is created, you can perform standard SQL queries on it:

“`sql
SELECT id, name
FROM my_database.my_parquet_table
WHERE age > 25;
“`

#### Example 2: Querying CSV Data

CSV is a ubiquitous data storage format, and Athena handles it effectively.

1. **Create a Table for CSV Data**

Suppose your CSV data is also stored in S3:

“`sql
CREATE EXTERNAL TABLE IF NOT EXISTS my_database.my_csv_table (
id INT,
name STRING,
age INT
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ‘,’
STORED AS TEXTFILE
LOCATION ‘s3://my-bucket/path/to/csv/’;
“`

2. **Query the CSV Data**

With the table defined, you can write queries like:

“`sql
SELECT name, COUNT(*) as num_entries
FROM my_database.my_csv_table
GROUP BY name
ORDER BY num_entries DESC;
“`

### Conclusion

AWS Athena makes querying and analyzing large datasets in S3, such as Parquet and CSV, straightforward and cost-effective. By leveraging the power of the Presto/Trino engine and using schema-on-read, Athena provides flexibility in data warehousing and analytics, all while enabling users to employ SQL for querying data.