Showing posts with label Hadoop. Show all posts
Showing posts with label Hadoop. Show all posts

Thursday, 3 December 2015

Cassandra Introduction

Cassandra:
 Apache Cassandra is column oriented No SQL Data base for processing large amount of data that is spawned across multiple clusters and nodes.  Cassandra process unstructured data and data is going to store in terms of key-values pairs. It has some unique features comparing to other data models.
 Features                                                                                                                                                       
·         High available service
·         No single point of failure
·         Linear scale performance
·         Easy data distribution across multiple data centers

Some of differences in key features:  RDBMS Vs Cassandra
Feature
RDBMS
Cassandra
Type of Data
Only deals with structured data
Deals with Unstructured data
Schema
Fixed schema
Flexible schema can be designed according to data
Relationships
Through joins and foreign keys between tables
In this it will represent through collections
Data Storage
In terms of tables by rows and columns
In terms of Nested key-value pairs
Data model
Database-> tables
Keyspaces->column families
Row Representation
Row is nothing but individual record present in table
Row represents replication
Column Representation
Series of relations
It represents storage

The top 5 most common use cases are: 
1. Internet of Things
Cassandra is a perfect fit for scaling time-series data from users, devices, and sensors.

2. Personalization
Use Cassandra to ingest and analyze for custom, fast, low cost, scalable user experiences.
3. Messaging
Cassandra’s original Facebook use case; storing, managing, and analyzing messages requires sophisticated systems and massive scale.
4. Fraud detection
Staying a step ahead of fraud has become best solved at the database level. Apache Cassandra lets you analyze patterns quickly, accurately, and effectively.
5. Playlist
Product catalogs; movie ratings; you name it. Storing a collection of user selected items has massive performance and availability demands. 


Next Article we are going to see how we work with Key  spaces in Cassandra

Thursday, 6 August 2015

Social Media Analytics using R + Hadoop

Social media Analytics Using R + Hadoop (RHadoop):
This article is about an idea of doing analytics using RHadoop. For the domains like bio medical, research and analysis of educational institutions , Statistical computing we use R to find out different patterns , prediction analysis and more insights from the data. If suppose data is limited and its usage are nominal then we can do those analyses with R. But think of scenarios where data is going to be huge and in terms of peta bytes.
I here am plotting a diagram which will show the view to inculcate R with hadoop and social media analytics.



                                          Fig (1):    R Hadoop with Social Media analytics

RHadoop Set up and Installations:-
--> Setting up of R in your system, the latest one R 3.1.3 with the required packages that we work on. Check this for installations
Refer -->"http://cran.r-project.org/bin/windows/base/"

-->Setting up of Hadoop system in single node or multinode cluster.
Referàhttp://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node- cluster/

RHadoop Use Cases

--> Coming to the Use Cases of RHadoop ,its presence in two ways .one with the streamed data (Like the Social Media Sites and news feeds from different Sources )and one with the data that resides in the Standard Traditional or NOSQL DBs (like MongoDb).

Coming With the Social media Analytics using RHadoop we have the following setup
--> Hadoop setup with R running on it
--> API s to connect with different social media like Linkedin,Facebook,Twitter.
--> Packages to be loaded must in R be ( ROAuth, twitteR, RLinkedin,RCurl )

Key User Case for Streaming Data be Like : 
R <------> Twitter and fetching tweets and slice and dice the fetching data
R<-------> Linkedin .Connecting with Linkedin and getting data and slice and dice it.
Similar way we can do with FB and Instagram.

The Second User Case be like:
R <-----> MongoDb. Fetching the documents and applying logic on the fetched documents and 
performing the analytics.

As of now there is No Parallel distribution supporting with R as a standalone.
But with some Distributions its comes up with Parallel distribution.


Wednesday, 5 August 2015

Big Data At Glance

 Big Data At Glance 

The big data ecosystem can be confusing. The popularity of “big data” as industry buzzword has created a broad category. As Hadoop steamrolls through the industry, solutions from the business intelligence and data warehousing fields are also attracting the big data label. To confuse matters, Hadoop-based solutions such as Hive are at the same time evolving toward being a competitive data warehousing solution.
Understanding the nature of your big data problem is a helpful first step in evaluating potential solutions. Let’s remind ourselves of Big Data.

“Big data is data that exceeds the processing capacity of conventional database systems. The data is too big, moves too fast, or doesn’t fit the strictures of your database architectures. To gain value from this data, you must choose an alternative way to process it."

Big data problems vary in how heavily they weigh in on the axes of volume, velocity and variability. Predominantly structured yet large data, for example, may be most suited to an analytical database approach.                                                  



This survey makes the assumption that a data warehousing solution alone is not the answer to your problems, and concentrates on analyzing the commercial Hadoop ecosystem. We’ll focus on the solutions that incorporate storage and data processing, excluding those products which only sit above those layers, such as the visualization or analytical workbench software.

Getting started with Hadoop doesn’t require a large investment as the software is open source, and is also available instantly through the Amazon Web Services cloud. But for production environments, support, professional services and training are often required