Monday, 7 December 2015

Cassandra Installations on Windows Machine

Cassandra Installations on Windows Machine

 Prior to installations set up Java in windows and set environment paths as well

Step 1)
Download Cassandra tar file in your windows machine of any location
Click below link to download tar file
If u want new version of Cassandra click on the latest version or else check the version in Cassandra archives (check this section of above URL -> Previous and Archived Cassandra Server Releases)
Click on the tar file. It will download into your windows machine















Step 2)
Go to Download tar file location and extract files using (WinZip or 7zip)
Copy the extracted file into your any drive. In my Case I am placing it in ‘D :/< location>’
Location means your Cassandra extracted file

Step 3)
Set up Cassandra home path in environment variables (see the below screen shot to set path in environment variables)
 


















CASSANDRA_HOME=D :\< location of Cassandra>

Step 4) 
We have to modify 2 configurations in Conf/Cassandra.yaml file in the Cassandra
Go to Cassandra.yaml file and search for this line ( Commitlog_Directory and data_file_directories)
CommitLogDirectory: /var/lib/cassandra/commitlog
               And change the line into
              CommitLogDirectory:  D :/< location of Cassandra/commitlog
              Create commit log folder in the specified location (As mentioned above)
                           data_file_directories : /var/lib/Cassandra/Data
              and change the line to
              data_file_directories:  D :/< location of Cassandra/data
Create data folder in the specified location (As mentioned above)
See Screenshot below for better understanding





Step 5)
Once the above 4 steps are done as mentioned above
Go to command prompt in windows -> then switch to Cassandra folder location -> run the Cassandra instance by entering Cassandra.bat command
Then enter
Cassandra-cli.bat in another terminal to interact with Cassandra
See Screenshot for better understanding
























So once everything is working fine it means that installations done properly

Thursday, 3 December 2015

Cassandra Introduction

Cassandra:
 Apache Cassandra is column oriented No SQL Data base for processing large amount of data that is spawned across multiple clusters and nodes.  Cassandra process unstructured data and data is going to store in terms of key-values pairs. It has some unique features comparing to other data models.
 Features                                                                                                                                                       
·         High available service
·         No single point of failure
·         Linear scale performance
·         Easy data distribution across multiple data centers

Some of differences in key features:  RDBMS Vs Cassandra
Feature
RDBMS
Cassandra
Type of Data
Only deals with structured data
Deals with Unstructured data
Schema
Fixed schema
Flexible schema can be designed according to data
Relationships
Through joins and foreign keys between tables
In this it will represent through collections
Data Storage
In terms of tables by rows and columns
In terms of Nested key-value pairs
Data model
Database-> tables
Keyspaces->column families
Row Representation
Row is nothing but individual record present in table
Row represents replication
Column Representation
Series of relations
It represents storage

The top 5 most common use cases are: 
1. Internet of Things
Cassandra is a perfect fit for scaling time-series data from users, devices, and sensors.

2. Personalization
Use Cassandra to ingest and analyze for custom, fast, low cost, scalable user experiences.
3. Messaging
Cassandra’s original Facebook use case; storing, managing, and analyzing messages requires sophisticated systems and massive scale.
4. Fraud detection
Staying a step ahead of fraud has become best solved at the database level. Apache Cassandra lets you analyze patterns quickly, accurately, and effectively.
5. Playlist
Product catalogs; movie ratings; you name it. Storing a collection of user selected items has massive performance and availability demands. 


Next Article we are going to see how we work with Key  spaces in Cassandra

Cassandra Operations Part 1( Working with Key spaces )

Cassandra operations:
In this article we are going to create, alter and delete key spaces in Cassandra.  Key spaces are like data bases in Cassandra. In side of Key spaces we can able to create tables and we can load data into tables further
We are going to learn the following concepts from this article




For this first we will go to cqlsh mode and to connect with local Cassandra cluster as shown in below

Go to Cassandra installed location and to bin location and type. /cqlsh
Observe the below screen shot for better understanding



Creation of Key space

 Before going to create first we will check what the key spaces that present in Cassandra.  For this we can use DESC command to get the list of key spaces present in Cassandra.


From the above screen shot we can observe these steps
1)      This step will give the key spaces present in the Cassandra. If we observe there are 5 key spaces present as of now
2)      Creation of Key Space Sample_Cassandra   as shown in step 2 with options Replication.
In this replication we have to mention the class name and replication factor
3)      Checking either Sample_Cassandra created or not using DESC command once again



Altering Key space:

We can alter the already existing key space with this Command Alter.
In this current example we are Altering the Sample_Cassandra that we created in the above step           

If we observe the above screen shot we are altering schema with replication factor value as ‘3 ‘
So we can check the modified schema by seeing key space information as mention below
Just as similar to SQL command in Cassandra also supports same query language.

So to see the keyspace information we can see like executing command as below
Query: Select * from system.shema_keyspaces;
From the below screen shot we can able to find out the Schema details of key spaces present in the Cassandra.

In the above screen shot we can check the keyspaces schemas present in the Cassandra. Total it returned 6 rows each represents one key space.

Dropping Key space:

We can drop key space present in Cassandra using Drop command

From the above screen shot we will observe the following
1)      Dropping Keyspace Sample_Cassandra
2)      Checking either schema dropped or not using DESC command



Next article we will see Table operations in Cassandra 

Thursday, 6 August 2015

Social Media Analytics using R + Hadoop

Social media Analytics Using R + Hadoop (RHadoop):
This article is about an idea of doing analytics using RHadoop. For the domains like bio medical, research and analysis of educational institutions , Statistical computing we use R to find out different patterns , prediction analysis and more insights from the data. If suppose data is limited and its usage are nominal then we can do those analyses with R. But think of scenarios where data is going to be huge and in terms of peta bytes.
I here am plotting a diagram which will show the view to inculcate R with hadoop and social media analytics.



                                          Fig (1):    R Hadoop with Social Media analytics

RHadoop Set up and Installations:-
--> Setting up of R in your system, the latest one R 3.1.3 with the required packages that we work on. Check this for installations
Refer -->"http://cran.r-project.org/bin/windows/base/"

-->Setting up of Hadoop system in single node or multinode cluster.
Referàhttp://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node- cluster/

RHadoop Use Cases

--> Coming to the Use Cases of RHadoop ,its presence in two ways .one with the streamed data (Like the Social Media Sites and news feeds from different Sources )and one with the data that resides in the Standard Traditional or NOSQL DBs (like MongoDb).

Coming With the Social media Analytics using RHadoop we have the following setup
--> Hadoop setup with R running on it
--> API s to connect with different social media like Linkedin,Facebook,Twitter.
--> Packages to be loaded must in R be ( ROAuth, twitteR, RLinkedin,RCurl )

Key User Case for Streaming Data be Like : 
R <------> Twitter and fetching tweets and slice and dice the fetching data
R<-------> Linkedin .Connecting with Linkedin and getting data and slice and dice it.
Similar way we can do with FB and Instagram.

The Second User Case be like:
R <-----> MongoDb. Fetching the documents and applying logic on the fetched documents and 
performing the analytics.

As of now there is No Parallel distribution supporting with R as a standalone.
But with some Distributions its comes up with Parallel distribution.


Wednesday, 5 August 2015

Big Data Core Indicators and Key Aspects

Big Data Core Indicators :



As we all talking about big data the core indicators that comes into picture are four V's.
Volume,Velocity,Variety and Veracity.
These V's are going to decide the big data and its future. Technically big data comes into picture when ever an organization or company only deals about any of these V's.

Big Data Core Indicators


Key Aspects of Big data Platform

1. Integration --The point is to have one platform to manage all of the data. Big data has to be             bigger than just one technology.
2. Analytics   — A very important point. We see big data as a viable place to analyze and store  data. sophistication and accuracy of the analytics matters.
3. Visualization --   Need to bring big data to the users.
4. Development — Need sophisticated development tools for the engines and across them to              enable the market to develop analytic applications.
5. Workload optimization — Improvements upon open source for efficient processing and                    storage.
6. Security and Governance — Sensitive data that needs to be protected, retention policies need to be determined .


As Technology advancements  day by day the amount of data that dealing with the business requirements also increasing. So Big data analytics and solutions providing better and enhanced solutions to solve business problems in different industrial verticals.

Big Data At Glance

 Big Data At Glance 

The big data ecosystem can be confusing. The popularity of “big data” as industry buzzword has created a broad category. As Hadoop steamrolls through the industry, solutions from the business intelligence and data warehousing fields are also attracting the big data label. To confuse matters, Hadoop-based solutions such as Hive are at the same time evolving toward being a competitive data warehousing solution.
Understanding the nature of your big data problem is a helpful first step in evaluating potential solutions. Let’s remind ourselves of Big Data.

“Big data is data that exceeds the processing capacity of conventional database systems. The data is too big, moves too fast, or doesn’t fit the strictures of your database architectures. To gain value from this data, you must choose an alternative way to process it."

Big data problems vary in how heavily they weigh in on the axes of volume, velocity and variability. Predominantly structured yet large data, for example, may be most suited to an analytical database approach.                                                  



This survey makes the assumption that a data warehousing solution alone is not the answer to your problems, and concentrates on analyzing the commercial Hadoop ecosystem. We’ll focus on the solutions that incorporate storage and data processing, excluding those products which only sit above those layers, such as the visualization or analytical workbench software.

Getting started with Hadoop doesn’t require a large investment as the software is open source, and is also available instantly through the Amazon Web Services cloud. But for production environments, support, professional services and training are often required

Monday, 26 January 2015

Pentaho with Real Time Data Analytics

Pentaho with Real Time Data Analytics :

The Importance of Big Data is well recognized today, with implementations across every size and type of business today. What has become apparent is that the real value of big data is not the data in and of itself, but in the combination of data with other relevant data from existing internal and external systems and sources need to blend data to derive maximum value will only escalate as new types and sources of data and information continue to emerge.

With Pentaho BI Analytics , you can easily create architected, blended views across both the traditional Call Detail Records in the warehouse, and the network data Just in time, architected blending delivers accurate big data analytics based on blended data. You can connect to, combine, and even transform data from any of the multiple data stores in your hybrid data ecosystem into blended views, then query the data directly via that view using the full spectrum of ana­lytics in the Pentaho Analytics platform, including predictive analytics.


Examining a typical big data Analytics process workflow helps identify where many of these potential problems may occur, special skill sets are required and delays are introduced

Common steps in the Bigdata Analytics workflow include Data Ingestion ,Manipulations,Access ,Model and Visualization .