Saturday, December 31, 2016

Big Data Architecture - a walkthrough

Let me start this blog with a little example. Assume, Sachin have a leak in a water pipe in his garden. He takes a bucket and a some sealing material to fix the problem. After a while, he see s that the leak is much bigger that and he needs a specialist  to bring bigger tools. Meanwhile, he still uses the bucket to drain the water. After a while, he notices that a massive underground stream has opened and he needs to handle millions of liters of water every second. 

He does n't just need new buckets, but a completely new approach to looking at the problem just because the volume and velocity of water has grown. To prevent the town from flooding, maybe he needs his government to build a massive dam that requires an enormous civil engineering expertise and an elaborate control system. To make things worse, everywhere water is gushing out from nowhere and everyone is scared with the variety.

Welcome to Big Data.


Key elements of Big Data:

  1. There are over 600 million tweets every day that is flowing every second which tells about the High Volume & Velocity
  2. Next  need to understand what each tweet means - where is it from, what kind of a person is tweeting, is it trustworthy or not which tells about the High Variety
  3. Identify the sentiment - is this person talking negative about iPhone or positive? which describes about the High Complexity
  4. And finally need to have a way to quantify the sentiment and track it in real time which tells about High Variability
Traditional architecture of any Big Data solution would look something like below,




Data is collected from various sources such as Content management system, software application and then transferred to relational database management systems such as mssql,postgresql etc. 
In order to analyse these data which are being collected , ETL part is done using a single machine most of the times and the necessary data is being transfered to some of the OLAP data warehouses  for analyzing data. The data which is analysed finally is archived data which is not live/real time data. Which in terms refered as Death of data.  Data which is finally analysed is only 10%. 


Big Data,  is massive data that cannot be stored and processed by a single machine. Having understood the basics of Big Data, it is important to learn how it can prove helpful in enhancing business opportunities. Given the massive volume of Big Data, it is logical that only some of it would be of extreme importance. It is this small percentage of data which when analyzed and used in the right manner can prove quite advantageous for promoting business online. 
A modern Architecture:
I have transformed the above architecture inorder to support large volume of data with the modern tools and technologies, A Modern BI architecture will be cooked by using the recipe as below:


Sunday, December 18, 2016

Am I really a developer or just a nethead?

It has been 6 years since i entered into programming field, and 18 years since i started using a computer, Everyone things i am a computer geek. Some times in my mind sounds come that Is that I am really a developer or just a good nethead?.

It's because there has not been a single day i coded without using Google search and Stack Overflow.


My Experience

When I was 15, I wrote my first program.  That was a long time ago now.  And it was in pascal language. During my university days i had more interest on gaming and animation rather than programming. When i got my first job, i struggled to code in c# with the dot net platform  during the initial days. With the help of google and stack overflow, now i would rate myself 8 on c# and i have experience of various open source technologies. But still i felt i was a better googler , not a good programmer. 

What made me to think I am a really bad programmer?
(i) Choosing workarounds over doing the right thing.
(ii)Used Ctr+c and Ctr+v more than normal keys
(iii)When things went wrong, i asked who is at fault rather than what the problem was. 


Mistakes to avoid to become a better programmer
In the year 2016, i started to avoid above practices and i would say programming is the first step to solving problems using technology. My tips that i followed during the year to be a better programmer as follows,



  1. Every day find a small challenge that can  be done in an hour.
  2. Read code. There is a plethora of freely available code for applications. There are tons of free projects by others on github.
  3. Make small projects to build experience.  Make it an open source project and if you can encourage collaboration if your project is compelling enough.
  4. Try programming for a day without googling. continue it for two days, maybe a week. See how it feels.
  5. Go to Meetups, Workshops, meet with others who feel the same way you do about technology.


anyone can become a good developer if he/she is passioned about it and practice a lot, preferably daily."In order to remain at the same level you have to spend at least two hours daily programming.". There will be many programmers out there who would think the same! What do you think?



Monday, November 28, 2016

Complete Tutorial: Webpy + Apache + mod_wsgi on Ubuntu

There has been plenty of tutorials and blogs on how to configure webpy application with apache and mod_wsgi, but none of them turned into successful one. After 2 days of research i have found the solution and decided to write a blog on the same. Hope it will be useful for others.

In the future, I hope to update this post to also include a complete list of steps for getting setup with python’s webpy over lighttpd.


1. Install web.py

1.1. Install webpy with apt-get

sudo apt-get install python-webpy

1.2. Install webpy using easy_install using python setuptools 

1.2.1. Install python setuptools (easy_install)

# 1.2.1.1. Using apt-get:

sudo apt-get install python-setuptools
# 1.2.1.2. Manually retrieving easy_install from the web using wget

wget http://peak.telecommunity.com/dist/ez_setup.py
sudo python ez_setup.py

# 1.2.2. Now get the web.py egg using python’s easy_install
# This will put the python ‘web’ module in your python environment path

sudo easy_install web.py

1.3. Install webpy straight from git

# Or, get webpy straight from git

git clone git://github.com/webpy/webpy.git ln -s `pwd`/webpy/web .

2. Write Your Web.py App

Choose a directory where you would like your web.py python application to live. If my username is ‘mek’ and I want to name my project ‘project’, I might make a directory /home/sajee/project.

2.1. Make a directory for your web.py app to live
# Replace the word project in the path below with your desired project name

mkdir ~/project
cd ~/project # move into the project directory you have created

2.2. Create your application file using web.py
# this will create our application file ~/project/main.py

touch main.py
2.3. Open your application with your favourite editor

# Substitute “emacs -nw” with an editor of your choice: vim, nano, etc

emacs -nw main.py

2.4. Paste the following in your app file and save

import web

app_path = os.path.dirname(__file__)
sys.path.append(app_path)

if app_path: # Apache
os.chdir(app_path)
else: # CherryPy
app_path = os.getcwd()

urls = (
'/(.*)', 'hello'
)
# WARNING
# web.debug = True and autoreload = True
# can mess up your session: I've personally experienced it
web.debug = False # You may wish to place this in a config file
app = web.application(urls, globals(), autoreload=False)
application = app.wsgifunc() # needed for running with apache as wsgi
class hello:
def GET(self, name):
if not name:
name = 'World'
return 'Hello, ' + name + '!'

if __name__ == "__main__":
app.run()

2.4. (Optional) Setup static directory for imgs, css, js, etc
# Having a static directory allows you to serve static content without
# your webpy application trying to steal focus and parse the request
# This is especially important using the default CherryPy server.
# We’ll also handle this case in our apache config within:
# /etc/apache2/sites-available

mkdir ~/project/static

3. Install Apache2

3.1. Install apache and wsgi dependencies

sudo aptitude install apache2 apache2.2-common apache2-mpm-prefork apache2-utils libexpat1 ssl-cer
# I like to also install python-dev (optional) to make sure I have
# python’s latest support files

sudo apt-get install python-dev
3.2. Install apache mod_wsgi and enable mod_wsgi + mod_rewrite

sudo aptitude install libapache2-mod-wsgi
sudo a2enmod mod-wsgi;sudo a2enmod rewrite
Need help troubleshooting your apache/mod_wsgi installation?

4. Configure Apache2 With Your App

In the following steps, replace ‘project’ with the name of your project

4.1. Make Apache Directories for your project

sudo mkdir /var/www/project
sudo mkdir /var/www/project/production
sudo mkdir /var/www/project/logs
sudo mkdir /var/www/project/public_html
4.2. Create Symlinks
Creating symlinks to your project files is an important covention as, if there is a problem with one of your code bases, you can simply change your symlink to a stable codebase without having to modify your apache configuration.

ln -s ~/project/ production
ln -s ~/project/static public_html # If you created the static directory in step 2.4.
4.3. Replace you /etc/apache2/sites-available/default with:



ServerAdmin admin@project.com
DocumentRoot /var/www/project.com/public_html/
ErrorLog /var/www/project.com/logs/error.log
CustomLog /var/www/project.com/logs/access.log combined

WSGIScriptAlias / /var/www/project.com/production/main.py
Alias /static /var/www/project.com/public_html
AddType text/html .py
WSGIDaemonProcess www-data threads=15
WSGIProcessGroup www-data


Order deny,allow
Allow from all
Options +FollowSymLinks
Options -Indexes




4.4. Change the group and owner of files requiring write access to apache’s www-data
Careful in this step to only change the group and owner of directories or files that will require write access.

sudo chgrp -R www-data
sudo chown -R www-data

5.Try to run!

sudo /etc/init.d/apache2 restart # Open your browser and visit the url: http://localhost or 127.0.01

You will see Hello World on the browser.

Saturday, November 26, 2016

I love Visual Studio Code



I've been a .Net developer since the beta days of .Net 3.0, Now i find myself doing less and less coding related with .Net related stuffs. However, the new strategy from Microsoft  encouraged all the developers including me to once again start doing some .Net work from time to time.
One of the highlighting tool among them was the Visual Studio Code.


Sublime text has been my favorite text editor all these time.When i downloaded it and looked at the first time my impression was nothing more than a plain editor with very little added value. Before VS code I have tried all popular editors - Sublime, Atom, Brackets etc. After trying it for few weeks now I feel developers  have everything they need .

Some of the highlights of VS Code are as follows,


  • Default integrated git system is really awesome.
  • Powerful debugging option.
  • Very smart code completion.
  • Huge list of Languages Support
  • Multi panel for side by side editing
  • Always-On IntelliSense
  • Debugging support
  • Peek Information like grammer correction
  • Command Palette




Choice of editor is a personal preference. If you like an lightweight IDE environment that's cross platform, you might enjoy VS Code. Give it a try, you will definitely love it.

Tuesday, November 8, 2016

Angular Directives with D3

It's been exactly 2 years since i started to learn Angular and it's sad that i dint write even a single blog on the same. Finally decided to start a series on the same topic. AngularJS is a JavaScript MVC Framework that integrates two-way data binding, web services, and build web components. There are enough number of blogs and tutorials to follow on the same.

The current product which i am working is a data visualization tool which is built on AngularJS  and has many visualization  been integrated with D3.js.

In this blog, will be describing how to build a directive using d3.js and angular.

Directive is very powerful feature of AngularJS. It easily wired up with controller, html and do the DOM manipulations.

Building a Decomposition Force directed d3 directive:


 App.directive('forceGraph', function() {  
   return {  
     restrict: 'EA',  
     transclude: true,  
     scope: {  
       chartData: '='  
     },  
     controller: 'hierarchySummaryCtrl',  
     link: function(scope, elem, attrs) {  
       var svg;  
       elem.bind("onmouseover", function(event) {  
         scope.svg = svg;  
         console.log("hierarchy svg", scope.svg);  
         scope.$apply();  
       });  
       scope.$watch('chartData', function(newValue, oldValue) {  
         if (newValue) {  
           scope.draw(newValue.data,newValue.id);  
         }  
       });  
       scope.draw = function(rootData,divID) {  
         var width = 400,  
           height = 320,  
           root;  
         var force = d3.layout.force()  
           .linkDistance(80)  
           .charge(-120)  
           .gravity(.05)  
           .size([width, height])  
           .on("tick", tick);  
         var divid = "#" + divID;  
         d3.select(divid).selectAll("*").remove();  
         svg = d3.select(divid)  
           .append("svg").attr("viewBox", "0 0 400 400")  
           .attr("width", '100%')  
           .attr("height", '100%');  
         var link = svg.selectAll(".link"),  
           node = svg.selectAll(".node");  
         root = rootData;  
         update();  
         console.log(svg);  
         scope.setSvg(svg[0][0].innerHTML);        
         function update() {  
           console.log(nodes)  
           var nodes = flatten(root),           
           links = d3.layout.tree().links(nodes);          
           var nodes = flatten(rootData),           
           links = d3.layout.tree().links(nodes);             
           force.nodes(nodes)  
             .links(links)  
             .start();  
           // Update links.  
           link = link.data(links, function(d) {  
             return d.target.id;  
           });  
           link.exit().remove();  
           link.enter().insert("line", ".node")  
             .attr("class", "link");  
           // Update nodes.  
           node = node.data(nodes, function(d) {  
             return d.id;  
           });  
           node.exit().remove();  
           var nodeEnter = node.enter().append("g")  
             .attr("class", "node")  
             .on("click", click)  
             .call(force.drag);  
           nodeEnter.append("circle")  
             .attr("r", function(d) {  
               return Math.sqrt(d.size) / 5 || 4.5;  
             });  
           nodeEnter.append("text")  
             .attr("dy", ".25em")  
             .text(function(d) {  
               return d.name + ", Count: " + d.size;  
             });  
           node.select("circle")  
             .style("fill", color);  
         }  
         function tick() {  
           link.attr("x1", function(d) {  
               return d.source.x;  
             })  
             .attr("y1", function(d) {  
               return d.source.y;  
             })  
             .attr("x2", function(d) {  
               return d.target.x;  
             })  
             .attr("y2", function(d) {  
               return d.target.y;  
             });  
           node.attr("transform", function(d) {  
             return "translate(" + d.x + "," + d.y + ")";  
           });  
         }  
         function color(d) {  
           return d._children ? "#FFEB3B" // collapsed package  
             :  
             d.children ? "#F44336" // expanded package  
             :  
             "#D32F2F"; // leaf node  
         }  
         // Toggle children on click.  
         function click(d) {  
           if (d3.event.defaultPrevented) return; // ignore drag  
           if (d.children) {  
             d._children = d.children;  
             d.children = null;  
           } else {  
             d.children = d._children;  
             d._children = null;  
           }  
           update();  
         }  
         // Returns a list of all nodes under the root.  
         function flatten(root) {  
           var nodes = [],  
             i = 0;  
           function recurse(node) {  
             if (node.children) node.children.forEach(recurse);  
             if (!node.id) node.id = ++i;  
             nodes.push(node);  
           }  
           recurse(root);  
           return nodes;  
         }  
       };  
     }  
   };  
 });  
My Repository With the sample

Thursday, September 1, 2016

Digging into BigData with Google's BigQuery

Well, i was one of the speaker at Colombo Big Data Meetup which was held yesterday and i spoke about Google's bigquery. Hence i have decided to write a blog on that so that you could get benefited if you are a BigData Fan.

What is Big Data?


There are so many definitions for Big Data , let me explain what does it really mean? In the near feature, every object on this earth will be generating data including our body.We have been exposed to so much information everyday.In vast ocean of data, complete picture of where  we live where we go and what we say, its all been recorded and stored forever.More data allows us to see new , better different things.Data in the recent times have changed from stationary and static to fluid and dynamic.we rely a lot on data and thatch is  major part of any business.we live in a very exciting world  today, a world where technology is advancing at a staggering pace, a world data is exploding, tons of data being generated. 10 years before we were measuring data in mega bytes, today we are talking about data which is in petabyte size, may be in few years we are going to reach zetabyte era, that means the end of English alphabets.Does it means the end of Big Data? .No . If you have shared a photo or post or a tweet on any social media,You are one of them who is generating data, and you are doing it very rapidly.



More than 100 thousand tweets in 60 seconds are generated , more than 7 million posts have been posted on Facebook, before you read this sentence.So the data is generated faster  than you could ever think before.Big data and analysis has exploded recently but there is a barrier. That barrier is indeed it needs lot of money resources and time to setup the infrastructure.Also it needs skillful people to make it all happen. so google solves all these big query.Big query,One of the products of google cloud platform that allows us to easily work with big data.It is google's fully managed data analysis service offering in the cloud. It enables super fast analysis.Easily store and analyse big data in google infrastructure.

Lets get familiar with the components.
 Projects are going to be the top level item inside the google cloud platform. Project contains, users authentication billing information and that is where data sets are going to live.
 Data set is really a container for tables.Access controllers cannot be done on tables so that they are don  through projects and data sets.Project contains data sets, data sets contains tables.
 Tables where the data lives
 Jobs are going to be asynchronous process that run on the background to load, export and to execute large queries.




Lets see How to use google cloud platform for big data solution.Architecture is divided in to two workflows named data workflow and visualization workflow.We need to get our source data into big query using any ETL tool and pipe into google cloud storage. Extract it from source and de normalize it,Biquery likes less joins the better .we can use hadoop clusters running on computers to do many pre processing and transforming data.once its in bigquery its all about visualization. Most of the use cases are log analysis which is used to analyse application behavior and user behavior in order to improve the system.Retail forecast - the more data, business has the more accurately they can predict product sales for the next month,that allows them to plan better. Lets see how we can use big query to analyse lots of data in very short time they handle the infrastructure and we can just simply focus on getting our data and analyse it.



Google handles Big Data every second of every day to provide services like Search, YouTube, Gmail and Google Docs.Can you imagine how Google handles this kind of Big Data during daily operations? How they are doing it?

As an example, let’s consider the following SQL query, which requests the Wikipedia® content titles that includes numeric characters in it:

select count(*) from publicdata:samples.wikipedia where REGEXP_MATCH
(title, ‘[0-9]*’) AND wp_namespace = 0;

Notice the following:
• This “wikipedia” table holds all the change history records on Wikipedia’s article content and consists of 314 millions of rows – that’s 35.7GB.

• The expression REGEXP_MATCH(title, ‘[0-9]+’) means it executes a regular expression matching on title of each change history record to extract rows that includes numeric characters in its title (e.g. “United States presidential election, 2015”).
• Most importantly, note that there was no index or any pre-aggregated values
for this table prepared in advance.

Dremel can even execute a complex regular expression text matching on ahuge logging table that consists of about 35 billion rows and 20 TB, in merely tens of seconds. This is the power of Dremel; it has super high scalability and most of the time it returns results within seconds or tens of seconds no matter how big the queried data set is.

Two core technologies which gives Dremel this performance:

1. Columnar Storage. Data is stored in a columnar storage fashion which
makes possible to achieve very high compression ratio and scan throughput.
2. Tree Architecture is used for dispatching queries and aggregating results
across thousands of machines in a few seconds.

Columnar Storage
Dremel stores data in its columnar storage, which means it separates a record into column values and stores each value on different storage volume, whereas
traditional databases normally store the whole record on one volume.

• Traffic minimization. Only required column values on each query are scanned and transferred on query execution. For example, a query “SELECT top(title) FROM foo” would access the title column values only. In case of the Wikipedia table example, the query would scan only 9.13GB out of 35.7GB.
• Higher compression ratio. One study  reports that columnar storage can  achieve a compression ratio of 1:10, whereas ordinary row-based storage can compress at roughly 1:3. Because each column would have similar values, especially if the cardinality of the column (variation of possible column values) is low, it’s easier to gain higher compression ratios than row-based storage. Columnar storage has the disadvantage of not working efficiently when updating existing records. In the case of Dremel, it simply doesn’t support any update operations.

Tree Architecture
One of the challenges Google had in designing Dremel was how to dispatch queries and collect results across tens of thousands of machines in a matter of seconds. The challenge was resolved by using the Tree architecture. The architecture forms a massively parallel distributed tree for pushing down a query to the tree and then aggregating the results from the leaves at a blazingly fast speed.


The tree architecture also enables multiple queries to run at once within the tree, which lets
different users share the same hardware.You might have heard of hadoop map reduce mechanism,
so what is the difference between mad reduce and bigquery


Bigquery can be integrated in to applications in so many ways, Following are the integrations supported by bigquery,

Rest API (SDK) 
  1. Google Spreadsheet
  2. Web application


Interfaces for query:
  1. Command Line Tool
  2. Bigquery UI


Connectors for excel

Tools for Big Data Solution:
As mentioned in the above architecture following tools can be used for managing the data ingest and visualization.

Tableau,BIME and DigIn for analysing and creating visualizations for various insights. Talend and SQLStream for the ingestion of data into bigquery from various data sources.



Nothing comes free, since google handles the infrastructure there is bit of a cost involved and the pricing goes as below.

Once you have decided to use bigquery there are certain things you need to know before using for optimizations and less cost.

Do not use queries that contains Select * , which is going to execute entire dataset and hence it will result in a high cost.
Since bigquery stores values in nested fields it is always better to use repeated fields.
Store in multiple tables as possible since it is recommended not to have JOINS
Bigquery also supports extensions such as ebq and dry run to encrypt the data and for executing the query to actually check how much resources that actual query is going to consume, which makes lot of developers and data analysts job easy.

I will be writing two separate blogs in the coming days on how to integrate with Bigquery and How to ingest the data into bigquery.

You can find the slides of the presentation from here







Sunday, May 1, 2016

Business Intelligence Solution for an Insurance Industry

Business Intelligence and It's impacts:

Hello! I am back, after what I realized was my first extended blog break in four years. This time something technical, which is very much relevant to the product which i am currently working on. First of all, what is Business Intelligence?  It is the process, tools and infrastructure for generating insights from raw data is collectively called Business Intelligence.

Every organization generates data as part of its operations. Every organization also has access to some form of external data. But in order to analyze and take informed decisions, the data has to be processed and turned into information, and the information has to be presented in a meaningful way to be able to identify patterns or to see key performance indicators (KPIs), and thereby generate insights out of the information.          

This blog mainly describes the importance of BI in an insurance industry since i have to do my final project  where a BI solution can be used and come up with a solution with a collection of tools.

BI Solution for an Insurance Industry:

With rising globalization and growth, Business intelligence has become important for many firms. Business Intelligence solutions help these firms to transform into a dynamic enterprise through actionable intelligence. One of the important sector in this modern world is the insurance firms. In terms of technology, insurance companies are generally not at the forefront of technology and their systems are way behind. I have taken this domain as my project with some of the open source and commercial business intelligence tools that could help this sector.

The Organization i picked was HNB assurance PLC, it is a leading Sri Lankan Insurance corporation which provides Life Insurance solutions for Sri Lankan citizens.

The following are the Analysis that i have performed on the sample data that i have collected.

Classification of their Customer profiles using Decision tree and plotted in GeoMaps
Sentimental analysis of the company's Facebook Page
Various insights for their Insurance Claims
Sales forecasting – Dashboard

BI Tool that i have used : 

DigIn 

DigIn is the only true end-to-end analytics platform that allows you to easily visualize your structured and unstructured data in one place.Also it has On-demand data ingest capabilities and in-memory caching allows anyone to access data from anywhere at anytime from any device.

Conclusion: 

Use of business intelligence and analytic tool/solution is very vital for any insurance company wanting to succeed in an increasingly competitive industry. The ability to turn large volumes of raw data into actionable insights represents a significant value proposition for these businesses. These insights can be priceless in terms of the limitless opportunities they can unearth across the business with the help of social media analytics. Hope, This BI tool that i have suggested for this organization is a wonder not a
blunder.

You could find the slides that i have used for my presentation Here