Athena

Apache Hive

  • data warehouse software facilitates reading, writing, and managing large datasets residing in distributed storage using SQL

  • built on top of Hadoop

  • initially developed by Facebook

  • uses HiveQL (SQL-like queries) to query

  • converts HiveQL to MapReduce jobs

  • used by Athena as a metastore or catalog

  • Athena provides an abstraction over Hive

Athena databases and tables

  • managed with Hive DDL statements

  • stored in a catalog or metastore

  • SerDes libs grant data format support

    • Serializer / Deserializer libs. E.g. for json data => json SerDes lib, for csv => csv SerDes

  • Schema is projected on to your data files

  • Schame changes are ACID-compliant

  • Tables are always external

  • Table partitioning is supported

  • json

  • csv

  • tsv

  • CloudTrail logs

  • Apache Web logs

  • Parquet

  • ORC

  • Avro

  • Snappy

  • Zlib

  • GZIP

  • LZO

  • BZip2

  • Splittable formats (e.g. BZip2)

    • Files can be scanned by multiple readers allowing you to take advantage of parallel processing

  • Optimum file size (advice >120Mb)

    • Strike a balance between having too many small data files and having too few large data files

  • Compression ratio

    • algorithms with high compression ratios make files smaller but also increase decompress CPU consumption

If you compress => you reduce money consumed by Athena (as there is less data and you pay for queried data. You pay for data size in compressed state)

Presto SQL Engine

  • Open-source distributed SQL engine

  • ANSI-Compliant or standard SQL

  • Reads info about schema from remote Hive metastore (Athena catalog)

    • Presto needs to know where the data is and what of data it is

  • ⚠️Does not use MapReduce

  • Size of dataset does not matter

Columnar Format considerations

  • Analytic query workloads

    • Queries that concern large numbers of rows, but only a few columns will benefit from this kind of format

  • Compression algorithm

    • Parquet and ORC compress by default. However, the algorithms they use may be experimented with.

Other solutions

  • Redshift

  • Elastic MapReduce

Last updated