Ctrlk

Athena

Apache Hive

data warehouse software facilitates reading, writing, and managing large datasets residing in distributed storage using SQL
built on top of Hadoop
initially developed by Facebook
uses HiveQL (SQL-like queries) to query
converts HiveQL to MapReduce jobs
used by Athena as a metastore or catalog
Athena provides an abstraction over Hive

Athena databases and tables

managed with Hive DDL statements
stored in a catalog or metastore
SerDes libs grant data format support
- Serializer / Deserializer libs. E.g. for json data => json SerDes lib, for csv => csv SerDes
Schema is projected on to your data files
Schame changes are ACID-compliant
Tables are always external
Table partitioning is supported

Data formats

json
csv
tsv
CloudTrail logs
Apache Web logs
Parquet
ORC
Avro

Compression

Snappy
Zlib
GZIP
LZO
BZip2

Compression considerations:

Splittable formats (e.g. BZip2)
- Files can be scanned by multiple readers allowing you to take advantage of parallel processing
Optimum file size (advice >120Mb)
- Strike a balance between having too many small data files and having too few large data files
Compression ratio
- algorithms with high compression ratios make files smaller but also increase decompress CPU consumption

If you compress => you reduce money consumed by Athena (as there is less data and you pay for queried data. You pay for data size in compressed state)

Presto SQL Engine

Open-source distributed SQL engine
ANSI-Compliant or standard SQL
Reads info about schema from remote Hive metastore (Athena catalog)
- Presto needs to know where the data is and what of data it is
⚠️Does not use MapReduce
Size of dataset does not matter

Columnar Format considerations

Analytic query workloads
- Queries that concern large numbers of rows, but only a few columns will benefit from this kind of format
Compression algorithm
- Parquet and ORC compress by default. However, the algorithms they use may be experimented with.

Other solutions

Redshift
Elastic MapReduce

PreviousCloudFront NextEKS

Last updated 2 years ago

Was this helpful?