Behind the Screens: Big Data Technologies at Google, Facebook, Microsoft, and Zoho

Arun Kumar Dave
9 min readFeb 5, 2024

For industry experts such as Google, Facebook, Microsoft, Zoho, Amazon, and others, harnessing the power of big data is not just a strategic advantage but a necessity for driving innovation and maintaining their competitive edge in a data-driven world.

In this article, we delve into the intricate workings of big data technologies deployed by these tech giants, exploring their data infrastructure, processing frameworks, and analytical tools that enable them to extract actionable insights, enhance user experiences, and deliver cutting-edge products and services to millions of users worldwide.

1. Presto by Facebook

In the contemporary landscape of big data, the volume, velocity, and variety of data continue to grow exponentially, and the demand for efficient and flexible querying tools has surged. In response to this challenge, Presto emerges as a powerful solution — an open-source Massively Parallel Processing (MPP) SQL query engine developed at Facebook.

Presto represents a paradigm shift in the realm of data analytics, offering organizations the ability to process large datasets quickly and efficiently. Developed internally at Facebook in 2013, Presto has since gained widespread adoption among leading companies such as Uber, Netflix, Airbnb, Bloomberg, and LinkedIn. Its versatility and scalability have led to the emergence of commercial offerings from organizations like Qubole, Treasure Data, and Starburst Data. Furthermore, the adoption of Presto by Amazon Athena, a prominent interactive querying service, underscores its significance in the industry.

Presto Architecture

Within Facebook’s operational framework, Presto operates across numerous clusters, supporting many use cases with distinct requirements and challenges -

  • Interactive Analytics
  • Batch ETL
  • A/B Testing
  • Developer/Advertiser Analytics

With its adaptability, flexibility, and extensibility, Presto has found a niche within the crowded SQL-on-Big-Data space. Its architecture and design, optimized for performance and scalability, enable organizations to confidently tackle complex analytical challenges. As Presto continues to evolve, driven by the contributions of its vibrant open-source community, its relevance and impact in the world of data analytics are set to grow exponentially.

In conclusion, Presto has emerged as a transformative force in the realm of big data analytics. Developed at Facebook to address the evolving needs of data-driven organizations, Presto offers a versatile, high-performance SQL query engine capable of handling diverse use cases with ease. From interactive analytics and batch ETL to A/B testing and developer/advertiser analytics, Presto’s impact spans the entire spectrum of data analytics workflows.

Original White paperhttps://scontent.fmaa1-4.fna.fbcdn.net/v/t39.8562-6/240861303_1946229012222045_8738935750973889667_n.pdf?_nc_cat=102&ccb=1-7&_nc_sid=e280be&_nc_ohc=xJn9vpQ3yUUAX9FDtDw&_nc_ht=scontent.fmaa1-4.fna&oh=00_AfDbH5dGfv9dE_-lWTZXxZBShgf2cPRUqm4h69gpFDa_5g&oe=65C2E0F7

2. Timestream DB by Amazon

Amazon Timestream represents a paradigm shift in time series data management, leveraging cutting-edge technology to provide a scalable and efficient solution for ingesting, storing, and querying time series data. Unlike traditional databases that struggle to handle the scale and velocity of time series data, Timestream is purpose-built to address these challenges, offering a serverless architecture that eliminates the need for upfront resource provisioning. This scalability and flexibility make Timestream an ideal choice for a wide range of use cases across industries such as application monitoring, DevOps, IoT, and more.

Performance and Scalability

A hallmark of Amazon Timestream is its exceptional performance and scalability, capable of handling massive volumes of time series data with ease. Timestream seamlessly scales to ingestion volumes exceeding 250 MB/s and accommodates tables with petabytes of data. Its decoupled scaling of ingestion, storage, and query ensures scalability to time series data of any scale. Moreover,

Timestream dynamically scales resources based on query complexity and data volume, resulting in query latency increases much smaller than the corresponding increases in data volume.

Performance measurements across various scale points have demonstrated Timestream’s ability to execute hundreds of concurrent queries, analyzing data across thousands of devices, within milliseconds. Even when managing petabytes of time series data, Timestream delivers consistent performance, ensuring that users can derive insights from their data in real-time.

Original White paper — https://aws.amazon.com/blogs/database/deriving-real-time-insights-over-petabytes-of-time-series-data-with-amazon-timestream/

3. Google Fusion Tables

The Big Data landscape often overlooks the needs of non-technical data experts who lack resources and expertise for large-scale analytics but aim to tell compelling stories through data visualization. Google Fusion Tables (GFT) fills this gap by offering collaborative data management in the cloud, emphasizing ease of use and support for interactive visualizations like maps. GFT’s maps have gained traction among journalists, being featured in high-profile articles published by reputable news sources.

Architecture Overview

GFT enables users to store tables of up to 100MB and supports SQL queries with low latency, emphasizing map visualizations over large and complex geospatial datasets. When a user opens a map, the browser initiates tile requests to Google backend servers, and GFT’s backend processes these requests by identifying visible items, evaluating filters, and customizing presentations. Spatial indexing and in-memory column stores ensure efficient query processing and fast response times.

Sample Dataset

Scaling to Large Datasets: GFT focuses on delivering interactive maps with fast response times, even for large datasets. To achieve this, it utilizes an in-memory column store for efficient query processing and spatial sampling to limit tile response sizes. Effort-based response caching reduces CPU load and ensures low latencies, particularly under heavy request rates. These optimizations enable GFT to support maps with millions of features while maintaining interactivity.

Scaling to Massive and Complex Polygon Datasets: Supporting maps with complex line and polygon geometries presents additional challenges. GFT implements line and polygon simplification algorithms to reduce rendering time and optimize response sizes. Effort-based response caching further enhances performance, particularly for maps with intricate geometries. Pre-computing covers for large polygon datasets minimizes delays and ensures fast response times, even for highly complex maps.

Scaling by Merging: Merged tables in GFT allow users to combine disparate datasets, leading to significant resource savings and improved performance. Queries over merged tables are rewritten to leverage the spatial index of base tables, reducing memory footprint and cache miss rates. Merged tables facilitate collaboration and data integration, enabling organizations to create comprehensive datasets from multiple sources. They also simplify the management of dynamic updates, making them ideal for scenarios involving frequently updated datasets.

Original White paper — https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/39959.pdf

4. Hadoop on Microsoft Cloud

The exploration into the utilization of Hadoop and Microsoft Cloud in big data analytics provides a comprehensive overview of how these technologies are transforming the landscape of data management and analysis. In today’s digital era, organizations across various industries are grappling with the challenge of dealing with vast volumes of data, much of which is semi-structured or unstructured. Traditional databases and data processing methods are often inadequate to handle this influx of data, leading to the emergence of big data analytics (BDA) as a solution to extract meaningful insights from diverse datasets.

Hadoop, an open-source framework, has emerged as a fundamental tool in the realm of BDA. Its distributed architecture offers a scalable solution for storing and processing large datasets across clusters of commodity hardware. At the core of Hadoop is the Hadoop Distributed File System (HDFS), which efficiently stores data by dividing it into blocks and distributing it across multiple servers. This approach not only addresses the capacity needs of modern data storage but also enables faster data retrieval through concurrent computation using the MapReduce functional programming framework.

Furthermore, Azure provides Databricks, an Apache Spark-based analytics service, which supports various programming languages and machine learning libraries. Databricks enables organizations to build and deploy advanced analytics solutions using Spark’s distributed processing capabilities. With support for languages like Scala, Python, Java, SQL, and R, as well as machine learning frameworks like TensorFlow and PyTorch, Databricks empowers data scientists and analysts to leverage their preferred tools and libraries for building predictive models and performing advanced analytics tasks.

The future of big data analytics lies in the continued evolution and innovation of technologies like Hadoop and Microsoft Cloud. As organizations generate and collect increasingly large volumes of data, the need for scalable, flexible, and cost-effective analytics solutions will only continue to grow. By harnessing the power of Hadoop and Microsoft Cloud, organizations can unlock the full potential of their data and drive innovation across industries.

In conclusion, the utilization of Hadoop and Microsoft Cloud in big data analytics represents a transformative approach to data management and analysis. These technologies empower organizations to store, analyze, and derive insights from large datasets, ultimately enhancing decision-making, driving innovation, and fostering economic growth. As technology continues to evolve, the role of Hadoop and Microsoft Cloud in BDA is expected to expand, further revolutionizing the way organizations harness data for competitive advantage.

Original White paper — https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3855257

5. Zoho’s Hadoop-Powered File Systems

Zoho Corporation, a renowned provider of cloud-based software solutions, leverages cutting-edge technologies to power its internal software infrastructure. Among these technologies, the utilization of the Hadoop Distributed File System (HDFS) stands out as a cornerstone of Zoho’s data management strategy. Referred to internally as “Zoho File Systems,” this implementation of HDFS plays a pivotal role in facilitating the storage, processing, and analysis of vast amounts of data across Zoho’s diverse range of software offerings.

Understanding ZFS (Zoho File Systems):

Zoho File Systems, built on top of Hadoop Distributed File System (HDFS), serves as the backbone of Zoho’s internal software architecture. At its core, HDFS is a distributed file system designed to store and manage large datasets across clusters of commodity hardware. Similarly, Zoho File Systems harnesses the scalability, fault tolerance, and performance capabilities of HDFS to cater to the extensive data requirements of Zoho’s software ecosystem.

Architecture and Components:

The architecture of Zoho File Systems closely mirrors that of HDFS, consisting of master and slave nodes that collaborate to store and process data. A central component of Zoho File Systems which serves as the master server responsible for maintaining metadata and coordinating data access. Multiple “Datanodes” function as slave nodes, storing data blocks and executing data processing tasks. This architecture ensures high availability, fault tolerance, and scalability, enabling Zoho to handle massive volumes of data across its software portfolio.

Data Management and Processing:

The primary function of Zoho File Systems is to facilitate efficient data management and processing within Zoho’s software ecosystem. Data stored in Zoho File Systems is partitioned into blocks and replicated across multiple nodes to ensure fault tolerance and data redundancy. This distributed storage approach enables Zoho to store massive datasets while mitigating the risk of data loss due to hardware failures.

Scalability and Performance:

Scalability is a key requirement for Zoho’s software infrastructure, given the exponential growth of data generated by its user base. Zoho File Systems, built on HDFS, is inherently scalable, allowing Zoho to expand its storage capacity and processing capabilities seamlessly as data volumes increase. Whether it’s accommodating new customers, scaling up existing applications, or introducing new features, Zoho File Systems provide the flexibility and scalability needed to support Zoho’s evolving business needs.

Security and Reliability:

Data security and reliability are paramount concerns for Zoho Corporation, given the sensitive nature of the information handled by its software applications. Zoho File Systems incorporate robust security measures to safeguard data integrity, confidentiality, and availability. Access controls, encryption, and auditing mechanisms are implemented to protect data assets from unauthorized access and malicious threats.

Integration with Zoho Software:

The data storage, processing, and querying integrates with various internal software applications developed by Zoho. One key system to use is the Mobile data application analytics platform named “Apptics” to manage data efficiently. The storage on HDFS is done with columnar-based tables with multiple chunks of files to process and deliver the desired output. It’s useful for robust performance and reliability.

In conclusion, Zoho File Systems, powered by Hadoop Distributed File System (HDFS), play a vital role in enabling the storage, processing, and analysis of data within Zoho’s internal software ecosystem.

Curated by

Arun Kumar Dave — Big Data Engineer, Zoho Apptics

With the help of,

Samuthira Pandian Muniyandi — Manager, Zoho Apptics (Server)

Shanmuga Sundaram — Manager, Zoho DFS (Server)

--

--