🔏
Tech
  • 🟢App aspects
    • Software architecture
      • Caching
      • Anti-patterns
      • System X-ability
      • Coupling
      • Event driven architecture
        • Command Query Responsibility Segregation (CQRS)
        • Change Data Capture (CDC)
      • Distributed transactions
      • App dev notes
        • Architecture MVP
      • TEMP. Check list
      • Hexagonal arch
      • Communication
        • REST vs messaging
        • gRPC
        • WebSocket
      • Load balancers
      • Storage limits
      • Event storming
    • Authentication
    • Deployment strategy
  • Databases
    • Classification
    • DB migration tools
    • PostreSQL
    • Decision guidance
    • Index
      • Hash indexes
      • SSTable, LSM-Trees
      • B-Tree
      • Engines, internals
    • Performance
  • System design
    • Interview preparation
      • Plan
        • Instagram
        • Tinder
        • Digital wallet
        • Dropbox
        • Live video streaming
        • Uber
        • Whatsup
        • Tiktok
        • Twitter
        • Proximity service
    • Algorithms
    • Acronyms
  • 🟢Programming languages
    • Java
      • Features
        • Field hiding
        • HashCode() and Equals()
        • Reference types
        • Pass by value
        • Atomic variables
      • Types
      • IO / NIO
        • Java NIO
          • Buffer
          • Channel
        • Java IO: Streams
          • Input streams
            • BufferedInputStream
            • DataInputStream
            • ObjectInputStream
            • FilterInputStream
            • ByteArrayInputStream
        • Java IO: Pipes
        • Java IO: Byte & Char Arrays
        • Java IO: Input Parsing
          • PushbackReader
          • StreamTokenizer
          • LineNumberReader
          • PushbackInputStream
        • System.in, System.out, System.error
        • Java IO: Files
          • FileReader
          • FileWriter
          • FileOutputStream
          • FileInputStream
      • Multithreading
        • Thread liveness
        • False sharing
        • Actor model
        • Singleton
        • Future, CompletableFuture
        • Semaphore
      • Coursera: parallel programming
      • Coursera: concurrent programming
      • Serialization
      • JVM internals
      • Features track
        • Java 8
      • Distributed programming
      • Network
      • Patterns
        • Command
      • Garbage Collectors
        • GC Types
        • How GC works
        • Tools for GC
    • Kotlin
      • Scope functions
      • Inline value classes
      • Coroutines
      • Effective Kotlin
    • Javascript
      • Javascript vs Java
      • TypeScript
    • SQL
      • select for update
    • Python
  • OS components
    • Network
      • TCP/IP model
        • IP address in action
      • OSI model
  • 🟢Specifications
    • JAX-RS
    • REST
      • Multi part
  • 🟢Protocols
    • HTTP
    • OAuth 2.0
    • LDAP
    • SAML
  • 🟢Testing
    • Selenium anatomy
    • Testcafe
  • 🟢Tools
    • JDBC
      • Connection pool
    • Gradle
    • vim
    • git
    • IntelliJ Idea
    • Elastic search
    • Docker
    • Terraform
    • CDK
    • Argo CD
      • app-of-app setup
    • OpenTelemetry
    • Prometheus
    • Kafka
      • Consumer lag
  • 🟢CI
    • CircleCi
  • 🟢Platforms
    • AWS
      • VPC
      • EC2
      • RDS
      • S3
      • IAM
      • CloudWatch
      • CloudTrail
      • ELB
      • SNS
      • Route 53
      • CloudFront
      • Athena
      • EKS
    • Kubernetes
      • Networking
      • RBAC
      • Architecture
      • Pod
        • Resources
      • How to try
      • Kubectl
      • Service
      • Tooling
        • ArgoCD
        • Helm
        • Istio
    • GraalVM
    • Node.js
    • Camunda
      • Service tasks
      • Transactions
      • Performance
      • How it executes
  • 🟢Frameworks
    • Hibernate
      • JPA vs Spring Data
    • Micronaut
    • Spring
      • Security
      • JDBC, JPA, Hibernate
      • Transactions
      • Servlet containers, clients
  • 🟢Awesome
    • Нейробиология
    • Backend
      • System design
    • DevOps
    • Data
    • AI
    • Frontend
    • Mobile
    • Testing
    • Mac
    • Books & courses
      • Path: Java Concurrency
    • Algorithms
      • Competitive programming
    • Processes
    • Finance
    • Electronics
  • 🟢Electronics
    • Arduino
    • IoT
  • Artificial intelligence
    • Artificial Intelligence (AI)
  • 🚀Performance
    • BE
  • 📘Computer science
    • Data structures
      • Array
      • String
      • LinkedList
      • Tree
    • Algorithms
      • HowTo algorithms for interview
  • 🕸️Web dev (Frontend)
    • Trends
    • Web (to change)
  • 📈Data science
    • Time series
Powered by GitBook
On this page
  • Apache Hive
  • Athena databases and tables
  • Data formats
  • Compression
  • Presto SQL Engine
  • Columnar Format considerations
  • Other solutions

Was this helpful?

  1. Platforms
  2. AWS

Athena

PreviousCloudFrontNextEKS

Last updated 2 years ago

Was this helpful?

Apache Hive

  • data warehouse software facilitates reading, writing, and managing large datasets residing in distributed storage using SQL

  • built on top of Hadoop

  • initially developed by Facebook

  • uses HiveQL (SQL-like queries) to query

  • converts HiveQL to MapReduce jobs

  • used by Athena as a metastore or catalog

  • Athena provides an abstraction over Hive

Athena databases and tables

  • managed with Hive DDL statements

  • stored in a catalog or metastore

  • SerDes libs grant data format support

    • Serializer / Deserializer libs. E.g. for json data => json SerDes lib, for csv => csv SerDes

  • Schema is projected on to your data files

  • Schame changes are ACID-compliant

  • Tables are always external

  • Table partitioning is supported

  • json

  • csv

  • tsv

  • CloudTrail logs

  • Apache Web logs

  • Parquet

  • ORC

  • Avro

  • Snappy

  • Zlib

  • GZIP

  • LZO

  • BZip2

  • Splittable formats (e.g. BZip2)

    • Files can be scanned by multiple readers allowing you to take advantage of parallel processing

  • Optimum file size (advice >120Mb)

    • Strike a balance between having too many small data files and having too few large data files

  • Compression ratio

    • algorithms with high compression ratios make files smaller but also increase decompress CPU consumption

If you compress => you reduce money consumed by Athena (as there is less data and you pay for queried data. You pay for data size in compressed state)

Presto SQL Engine

  • Open-source distributed SQL engine

  • ANSI-Compliant or standard SQL

  • Reads info about schema from remote Hive metastore (Athena catalog)

    • Presto needs to know where the data is and what of data it is

  • Size of dataset does not matter

Columnar Format considerations

  • Analytic query workloads

    • Queries that concern large numbers of rows, but only a few columns will benefit from this kind of format

  • Compression algorithm

    • Parquet and ORC compress by default. However, the algorithms they use may be experimented with.

Other solutions

  • Redshift

  • Elastic MapReduce

:

Does not use MapReduce

🟢
⚠️
Data formats
Compression
Compression considerations