Thursday, September 1, 2016

Digging into BigData with Google's BigQuery

Well, i was one of the speaker at Colombo Big Data Meetup which was held yesterday and i spoke about Google's bigquery. Hence i have decided to write a blog on that so that you could get benefited if you are a BigData Fan.

What is Big Data?

There are so many definitions for Big Data , let me explain what does it really mean? In the near feature, every object on this earth will be generating data including our body.We have been exposed to so much information everyday.In vast ocean of data, complete picture of where  we live where we go and what we say, its all been recorded and stored forever.More data allows us to see new , better different things.Data in the recent times have changed from stationary and static to fluid and dynamic.we rely a lot on data and thatch is  major part of any business.we live in a very exciting world  today, a world where technology is advancing at a staggering pace, a world data is exploding, tons of data being generated. 10 years before we were measuring data in mega bytes, today we are talking about data which is in petabyte size, may be in few years we are going to reach zetabyte era, that means the end of English alphabets.Does it means the end of Big Data? .No . If you have shared a photo or post or a tweet on any social media,You are one of them who is generating data, and you are doing it very rapidly.

More than 100 thousand tweets in 60 seconds are generated , more than 7 million posts have been posted on Facebook, before you read this sentence.So the data is generated faster  than you could ever think before.Big data and analysis has exploded recently but there is a barrier. That barrier is indeed it needs lot of money resources and time to setup the infrastructure.Also it needs skillful people to make it all happen. so google solves all these big query.Big query,One of the products of google cloud platform that allows us to easily work with big data.It is google's fully managed data analysis service offering in the cloud. It enables super fast analysis.Easily store and analyse big data in google infrastructure.

Lets get familiar with the components.
 Projects are going to be the top level item inside the google cloud platform. Project contains, users authentication billing information and that is where data sets are going to live.
 Data set is really a container for tables.Access controllers cannot be done on tables so that they are don  through projects and data sets.Project contains data sets, data sets contains tables.
 Tables where the data lives
 Jobs are going to be asynchronous process that run on the background to load, export and to execute large queries.

Lets see How to use google cloud platform for big data solution.Architecture is divided in to two workflows named data workflow and visualization workflow.We need to get our source data into big query using any ETL tool and pipe into google cloud storage. Extract it from source and de normalize it,Biquery likes less joins the better .we can use hadoop clusters running on computers to do many pre processing and transforming data.once its in bigquery its all about visualization. Most of the use cases are log analysis which is used to analyse application behavior and user behavior in order to improve the system.Retail forecast - the more data, business has the more accurately they can predict product sales for the next month,that allows them to plan better. Lets see how we can use big query to analyse lots of data in very short time they handle the infrastructure and we can just simply focus on getting our data and analyse it.

Google handles Big Data every second of every day to provide services like Search, YouTube, Gmail and Google Docs.Can you imagine how Google handles this kind of Big Data during daily operations? How they are doing it?

As an example, let’s consider the following SQL query, which requests the Wikipedia® content titles that includes numeric characters in it:

select count(*) from publicdata:samples.wikipedia where REGEXP_MATCH
(title, ‘[0-9]*’) AND wp_namespace = 0;

Notice the following:
• This “wikipedia” table holds all the change history records on Wikipedia’s article content and consists of 314 millions of rows – that’s 35.7GB.

• The expression REGEXP_MATCH(title, ‘[0-9]+’) means it executes a regular expression matching on title of each change history record to extract rows that includes numeric characters in its title (e.g. “United States presidential election, 2015”).
• Most importantly, note that there was no index or any pre-aggregated values
for this table prepared in advance.

Dremel can even execute a complex regular expression text matching on ahuge logging table that consists of about 35 billion rows and 20 TB, in merely tens of seconds. This is the power of Dremel; it has super high scalability and most of the time it returns results within seconds or tens of seconds no matter how big the queried data set is.

Two core technologies which gives Dremel this performance:

1. Columnar Storage. Data is stored in a columnar storage fashion which
makes possible to achieve very high compression ratio and scan throughput.
2. Tree Architecture is used for dispatching queries and aggregating results
across thousands of machines in a few seconds.

Columnar Storage
Dremel stores data in its columnar storage, which means it separates a record into column values and stores each value on different storage volume, whereas
traditional databases normally store the whole record on one volume.

• Traffic minimization. Only required column values on each query are scanned and transferred on query execution. For example, a query “SELECT top(title) FROM foo” would access the title column values only. In case of the Wikipedia table example, the query would scan only 9.13GB out of 35.7GB.
• Higher compression ratio. One study  reports that columnar storage can  achieve a compression ratio of 1:10, whereas ordinary row-based storage can compress at roughly 1:3. Because each column would have similar values, especially if the cardinality of the column (variation of possible column values) is low, it’s easier to gain higher compression ratios than row-based storage. Columnar storage has the disadvantage of not working efficiently when updating existing records. In the case of Dremel, it simply doesn’t support any update operations.

Tree Architecture
One of the challenges Google had in designing Dremel was how to dispatch queries and collect results across tens of thousands of machines in a matter of seconds. The challenge was resolved by using the Tree architecture. The architecture forms a massively parallel distributed tree for pushing down a query to the tree and then aggregating the results from the leaves at a blazingly fast speed.

The tree architecture also enables multiple queries to run at once within the tree, which lets
different users share the same hardware.You might have heard of hadoop map reduce mechanism,
so what is the difference between mad reduce and bigquery

Bigquery can be integrated in to applications in so many ways, Following are the integrations supported by bigquery,

Rest API (SDK) 
  1. Google Spreadsheet
  2. Web application

Interfaces for query:
  1. Command Line Tool
  2. Bigquery UI

Connectors for excel

Tools for Big Data Solution:
As mentioned in the above architecture following tools can be used for managing the data ingest and visualization.

Tableau,BIME and DigIn for analysing and creating visualizations for various insights. Talend and SQLStream for the ingestion of data into bigquery from various data sources.

Nothing comes free, since google handles the infrastructure there is bit of a cost involved and the pricing goes as below.

Once you have decided to use bigquery there are certain things you need to know before using for optimizations and less cost.

Do not use queries that contains Select * , which is going to execute entire dataset and hence it will result in a high cost.
Since bigquery stores values in nested fields it is always better to use repeated fields.
Store in multiple tables as possible since it is recommended not to have JOINS
Bigquery also supports extensions such as ebq and dry run to encrypt the data and for executing the query to actually check how much resources that actual query is going to consume, which makes lot of developers and data analysts job easy.

I will be writing two separate blogs in the coming days on how to integrate with Bigquery and How to ingest the data into bigquery.

You can find the slides of the presentation from here