Athena
Last updated
Was this helpful?
Last updated
Was this helpful?
data warehouse software facilitates reading, writing, and managing large datasets residing in distributed storage using SQL
built on top of Hadoop
initially developed by Facebook
uses HiveQL (SQL-like queries) to query
converts HiveQL to MapReduce jobs
used by Athena as a metastore or catalog
Athena provides an abstraction over Hive
managed with Hive DDL statements
stored in a catalog or metastore
SerDes libs grant data format support
Serializer / Deserializer libs. E.g. for json data => json SerDes lib, for csv => csv SerDes
Schema is projected on to your data files
Schame changes are ACID-compliant
Tables are always external
Table partitioning is supported
json
csv
tsv
CloudTrail logs
Apache Web logs
Parquet
ORC
Avro
Snappy
Zlib
GZIP
LZO
BZip2
Splittable formats (e.g. BZip2)
Files can be scanned by multiple readers allowing you to take advantage of parallel processing
Optimum file size (advice >120Mb)
Strike a balance between having too many small data files and having too few large data files
Compression ratio
algorithms with high compression ratios make files smaller but also increase decompress CPU consumption
If you compress => you reduce money consumed by Athena (as there is less data and you pay for queried data. You pay for data size in compressed state)
Open-source distributed SQL engine
ANSI-Compliant or standard SQL
Reads info about schema from remote Hive metastore (Athena catalog)
Presto needs to know where the data is and what of data it is
⚠️Does not use MapReduce
Size of dataset does not matter
Analytic query workloads
Queries that concern large numbers of rows, but only a few columns will benefit from this kind of format
Compression algorithm
Parquet and ORC compress by default. However, the algorithms they use may be experimented with.
Redshift
Elastic MapReduce