Impulse Data Warehousing and OLAP Solution Outperforms Google BigQuery by 3x
A comprehensive benchmarking of Accure’s Impulse data warehousing and OLAP solution was performed and compared against the price-performance of Google BigQuery (GBQ). Impulse outperformed GBQ by 3x on an average in all queries executed on both the platforms against the same dataset (described below). For the same query performance, Impulse runs on a cluster that costs 10% of GBQ.
Data size: 100GB
Method: Star Schema Benchmark (SSB), which is a dataset and query set designed to evaluate performance of data warehouses since 2007
Average query performance: 6.5 second on Impulse and 20 seconds on BigQuery
Price performance: Impulse infrastructure costs 10% of BigQuery
The benchmarking method that we followed to compare the performance of Impulse data warehouse against BigQuery is:
- Prepared the SSB data using a publicly available data gen tool, https://github.com/lemire/StarSchemaBenchmark.
- The data was loaded using Impulse scalable ETL platform. We configured Impulse to store data in HDFS in parquet format.
- Impulse was configured to load the data into the data warehouse at the end of the ETL process.
- We loaded the parquet dataset into the Google Cloud Storage (GCS) bucket and created tables in BigQuery. The data was loaded into BigQuery from the GCS bucket.
- We executed the 13 queries that were provided in the original SSB paper, https://www.cs.umb.edu/~poneil/StarSchemaB.pdf.
- All 13 queries were executed on both Impulse and BigQuery and query result times were recorded. The same queries were repeated 5 times and the results were aggregated.
Google BigQuery offers two different pricing options — on demand and flat-rate pricing. For the purpose of this benchmarking, we took the most cost effective pricing that BigQuery offers. We considered the $1,700 per month rate for a flat-rate based for one year of committed usage of 100 reserved slots.
Impulse was installed on Amazon AWS EC2 instances running Ubuntu 20.04. The cluster configuration, machine types and monthly costs are described below.
We utilized a publicly available free tool, https://github.com/lemire/StarSchemaBenchmark, to generate an SSB compliant dataset.
The datagen tool created the following tables:
The data files are pipe “|” delimited having the schema shown in Figure 1 below.
Figure 1: SSB Schema. Source: https://www.cs.umb.edu/~poneil/StarSchemaB.pdf
The pipe delimited data was ingested using Impulse’s “Delimited File” ingester with the output format selected as parquet. Impulse stores all data on a HDFS cluster. Impulse was configured to load the data into the data warehouse. The data warehouse was configured with HDFS as its deep storage system. The data was partitioned by month.
The parquet data was manually moved to the Google Cloud Storage bucket from where it was read and loaded into BigQuery tables.
Impulse Data Warehousing Cluster Details
Impulse consists of several components for end-to-end machine learning, enterprise automation, big data analytics and visualization. For the purpose of this benchmarking we installed only those components that were needed to conduct this exercise and also to keep the AWS server cost down. We utilized the following AWS machine types to create the cluster:
Query Execution Performance Result
The following Table 1 shows the average query response times when Star Schema queries were executed on Impulse and BigQuery. The Query ID refers to the same query id as is mentioned in the original paper, https://www.cs.umb.edu/~poneil/StarSchemaB.pdf. On average, the total time taken by all 13 queries on Impulse was 6.5 seconds compared to 20 seconds on BigQuery. In other words, Impulse performed on average 3 times faster than BigQuery.
Table 1: Star Schema Query response times on Impulse and BigQuery
We also performed concurrency tests to compare Impulse cluster cost and BigQuery cost. Concurrency 2 means that two users are concurrently accessing the same dataset. For our experiment we conducted the concurrency test for a single user and extrapolated the results for up to 64 concurrent execution of queries on a 24x7 days basis.
Table 2 below shows the price comparison of the two platforms. We utilized Google Cloud Console to record the slot utilization time and number of slots. For the purpose of pricing, we took the most cost effective offer that BogQuery has. We considered $1,700 per month for a yearly committed usage of 100 slot batches.
On average, each data node of the Impulse server costs $449.28 per month for on demand AWS instances. However, for long term committed usage, we considered a rounded price of $200 per month.
Table 2 and Figure 3 indicate that the Impulse server price is about 10% of BigQuery monthly cost. For example, for a 64 concurrent-user-based usage, the monthly price of BigQuery is $85,000 compared to $9,400 on Impulse.
Table 2: BigQuery flat rate monthly cost (for 1 year committed usage of 100 slot batch) and Impulse server monthly cost (with 3 year long term committed usage on AWS)
Figure 3: Graphical representation of price-performance comparison of executing Star Schema Queries with concurrencies on BigQuery and Impulse.
Impulse Data Warehouse Features
Impulse Data Warehouse is built based on technology that offers scale, speed, and consistent query response at concurrency. It is based on the column-based storage system similar to BigQuery.
While BigQuery is a 100% cloud based platform with shared resources, Impulse technology can be deployed on any Linux based hardware, such as on-premise, private data center, or on cloud virtual machines.
Key features of Impulse data warehousing technology are:
- Blazing fast and scalable database for ad hoc analytics and online analytical processing (OLAP)
- A fully integrated ETL to ingest data from a wide variety of data sources and formats, transform, and build an automated data pipeline to load and index data for interactive and ad hoc query.
- Fully integrated web-based visualization and BI engine to create interactive dashboards.
- JDBC and Restful APIs are available to connect with third party BI tools.
Accure provides software platforms for data engineers, scientists, analysts, and automation engineers to efficiently solve machine learning problems and automate business processes.
Accure provides products and professional services to prototype, build, deploy and scale enterprise AI.
Accure engineered Impulse to accelerate all phases of AI development. Accure’s professional services help connect all pieces together to build sustainable solutions so that customers focus on deriving values from the AI implementation.