# Athena

## Apache Hive&#x20;

* data warehouse software facilitates **reading, writing**, and managing **large datasets** residing **in distributed storage using SQL**
* built on top of Hadoop
* initially developed by Facebook
* uses HiveQL (SQL-like queries) to query
* converts HiveQL to MapReduce jobs
* used by Athena as a metastore or catalog
* Athena provides an abstraction over Hive

<figure><img src="https://415484505-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-LxtoAXZwwOc4XGto8vb%2Fuploads%2Fzvp6i7n948FiD22BcR6h%2FScreenshot%202022-12-16%20at%2015.03.28.png?alt=media&#x26;token=536bb7ef-7d11-4a0e-9c2c-28ffb1d1e223" alt=""><figcaption></figcaption></figure>

## Athena databases and tables

* managed with Hive DDL statements
  \*

  ```
  <figure><img src="https://415484505-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-LxtoAXZwwOc4XGto8vb%2Fuploads%2FYmPB8dELdEz6AGzq6Gpp%2FScreenshot%202022-12-16%20at%2015.15.53.png?alt=media&#x26;token=a5250bdb-7b35-448d-8e01-d508ff044ea1" alt=""><figcaption></figcaption></figure>
  ```
* stored in a catalog or metastore
* SerDes libs grant data format support
  * Serializer / Deserializer libs. E.g. for json data => json SerDes lib, for csv => csv SerDes
* Schema is projected on to your data files
* Schame changes are ACID-compliant
* Tables are always external
* Table partitioning is supported

## [Data formats](https://docs.aws.amazon.com/athena/latest/ug/supported-serdes.html)

* json
* csv
* tsv
* CloudTrail logs
* Apache Web logs
* Parquet
* ORC
* Avro

### [Compression](https://docs.aws.amazon.com/athena/latest/ug/compression-formats.html)

* Snappy
* Zlib
* GZIP
* LZO
* BZip2

#### [Compression considerations](https://aws.amazon.com/blogs/big-data/top-10-performance-tuning-tips-for-amazon-athena/):

* Splittable formats (e.g. BZip2)
  * Files can be scanned by multiple readers allowing you to take advantage of parallel processing
* Optimum file size (advice >120Mb)
  * Strike a balance between having too many small data files and having too few large data files
* Compression ratio
  * algorithms with high compression ratios make files smaller but also increase decompress CPU consumption

If you compress => you reduce money consumed by Athena (as there is less data and you pay for queried data. You pay for data size in compressed state)

## Presto SQL Engine

* Open-source distributed SQL engine
* ANSI-Compliant or standard SQL
* Reads info about schema from remote Hive metastore (Athena catalog)
  * Presto needs to know where the data is and what of data it is
* :warning:Does not use MapReduce
* Size of dataset does not matter

## Columnar Format considerations

* Analytic query workloads
  * Queries that concern large numbers of rows, but only a few columns will benefit from this kind of format
* Compression algorithm
  * Parquet and ORC compress by default. However, the algorithms they use may be experimented with.

## Other solutions

* Redshift
  \*

  ```
  <figure><img src="https://415484505-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-LxtoAXZwwOc4XGto8vb%2Fuploads%2FZ7y42KfQYdfF3yc50PRt%2FScreenshot%202022-12-23%20at%2015.48.16.png?alt=media&#x26;token=e62aa97c-c30c-48ad-9053-4c8d277fc072" alt=""><figcaption></figcaption></figure>
  ```
* Elastic MapReduce
  \*

  ```
  <figure><img src="https://415484505-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-LxtoAXZwwOc4XGto8vb%2Fuploads%2FfcZkoTAJaEtIzQrm5Dmb%2FScreenshot%202022-12-23%20at%2016.00.46.png?alt=media&#x26;token=fc4b4999-f5e2-42fa-b77f-93047e2b7953" alt=""><figcaption></figcaption></figure>
  ```


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://amartyushov.gitbook.io/tech/platforms/aws/athena.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
