This blog introduces Wormhole, an open-source Dockerized solution for deploying Presto and Alluxio clusters for blazing-fast analytics on file system (we use S3, GCS, OSS). When it comes to analytics, generally people are hands-on in writing SQL queries and love to analyze data that resides in a warehouse (e.g. MySQL database). But as data grows, these stores start failing and there arises a need for getting the faster results in the same or a shorter time frame. This can be solved by distributed computing and Presto is designed for that. When attached to Alluxio, it works even more, faster. That’s what Wormhole is all about.
You may also enjoy: Alluxio Cluster Setup Using Docker
Here is the high-level architecture diagram of solution:
Let us explain each component in the order which they should be setup in wormhole:
- Consul — Consul is a service networking solution to connect and secure services across any runtime platform and public or private cloud. It helps in the identification of a container by its containerID to other containers. Setup instructions here.
- Docker — Docker is a set of platform-as-a-service products that use OS-level virtualization to deliver software in packages called containers. Containers are isolated from one another and bundle their own software, libraries and configuration files; they can communicate with each other through well-defined channels. In our setup, we have put every service as a Docker container. Setup instructions here.
- Alluxio Master — Alluxio masters should be deployed in High Availability (HA). These are responsible for making query execution plan and distributing the query to workers and then the joining individual result back and sending it back to the requester. Setup instructions here.
- Alluxio Worker — Alluxio workers are the actual cache storage of Alluxio. When data is queried from Alluxio filesystem (FS), it fetches data from underlying FS (maybe S3, GCS or any configured FS) and stores in RAM in LRU fashion. Next time it doesn’t need to go into the FS and returns data blocks from its own cache. Setup instructions are here.
- Hive metastore — Hive metastore is the collection of metadata of all the hive tables. In our setup, we create metastore on MySQL for storing Alluxio located data. Setup instructions are here.
- Presto Coordinator — Presto coordinator should be deployed in High Availability (HA). These are responsible for making the JQuery execution plan and distributing the query to workers and then the joining individual result back and sending it back to the requester. Setup instructions are here.
- Presto Worker — Presto workers are responsible for crunching the data from underlying Alluxio. These pull the data, perform operations on it and send the individual results to coordinator. Setup instructions are here.
Apart from the above components, we require a Zookeeper quorum setup which is required for making Alluxio master and Presto coordinator highly available(HA). For complete documentation on setup, please refer here.
Time for Some Action
Now since we have set up Presto on top of Alluxio, how do we make it available for everyone to use? The answer could be some other tools like Metabase, which provide connectivity to Presto. We just need to add the appropriate configurations and it works like a charm for all sorts of analysis.
Presto and Alluxio also provides UI to track the current state of things and that helps a lot.
The next focus will be mostly on making the solution self-serve through a user interface and make it self scalable (possibility of deployment on Kubernetes).
from DZone Cloud Zone