Impulse Data Warehousing and OLAP Solution Outperforms Google BigQuery by 3x

Benchmark Summary

Data size: 100GB

Benchmarking Method

The benchmarking method that we followed to compare the performance of Impulse data warehouse against BigQuery is:

  1. Prepared the SSB data using a publicly available data gen tool, https://github.com/lemire/StarSchemaBenchmark.
  2. The data was loaded using Impulse scalable ETL platform. We configured Impulse to store data in HDFS in parquet format.
  3. Impulse was configured to load the data into the data warehouse at the end of the ETL process.
  4. We loaded the parquet dataset into the Google Cloud Storage (GCS) bucket and created tables in BigQuery. The data was loaded into BigQuery from the GCS bucket.
  5. We executed the 13 queries that were provided in the original SSB paper, https://www.cs.umb.edu/~poneil/StarSchemaB.pdf.
  6. All 13 queries were executed on both Impulse and BigQuery and query result times were recorded. The same queries were repeated 5 times and the results were aggregated.

Data Preparation

We utilized a publicly available free tool, https://github.com/lemire/StarSchemaBenchmark, to generate an SSB compliant dataset.

Impulse Data Warehousing Cluster Details

Impulse consists of several components for end-to-end machine learning, enterprise automation, big data analytics and visualization. For the purpose of this benchmarking we installed only those components that were needed to conduct this exercise and also to keep the AWS server cost down. We utilized the following AWS machine types to create the cluster:

Query Execution Performance Result

The following Table 1 shows the average query response times when Star Schema queries were executed on Impulse and BigQuery. The Query ID refers to the same query id as is mentioned in the original paper, https://www.cs.umb.edu/~poneil/StarSchemaB.pdf. On average, the total time taken by all 13 queries on Impulse was 6.5 seconds compared to 20 seconds on BigQuery. In other words, Impulse performed on average 3 times faster than BigQuery.

Price-Performance Comparison

We also performed concurrency tests to compare Impulse cluster cost and BigQuery cost. Concurrency 2 means that two users are concurrently accessing the same dataset. For our experiment we conducted the concurrency test for a single user and extrapolated the results for up to 64 concurrent execution of queries on a 24x7 days basis.

Impulse Data Warehouse Features

Impulse Data Warehouse is built based on technology that offers scale, speed, and consistent query response at concurrency. It is based on the column-based storage system similar to BigQuery.

  1. Blazing fast and scalable database for ad hoc analytics and online analytical processing (OLAP)
  2. A fully integrated ETL to ingest data from a wide variety of data sources and formats, transform, and build an automated data pipeline to load and index data for interactive and ad hoc query.
  3. Fully integrated web-based visualization and BI engine to create interactive dashboards.
  4. JDBC and Restful APIs are available to connect with third party BI tools.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Sam Ansari

Sam Ansari

39 Followers

CEO, author, inventor and thought leader in computer vision, machine learning, and AI. 4 US Patents.