Saturday, January 26, 2013

Jumpstart to Big Data and Cloud

These guidelines and approaches are targeted for first time users of  Big Data(Hadoop ecosystem) and Cloud. "Cloud" here refers to public cloud  services like Amazon AWS

Big Data - Go with the Hadoop distribution

Start with the hadoop distributions like cloudera, hottonworks.  These distributions comes with all the required dependencies and packages to work. You also get an admin console to check and mange your hadoop cluster. Avoid setting up the  Hadoop cluster on your own unless absolutely needed. For development purpose, if you want to work on your windows machine, install a VM player and use the Cloudera distribution on top of it.

Big Data -  Use  Map Reduce Programs only when necessary

Do your development using Pig or Hive for your big data map reduce problems. You can handle most big data uses cases with these tools.  Don't write your own map reduce part unless you have very specific use case that cannot be handled by these tools

Big Data - Use Cloud to scale

Make your code run on your local distribution(cluster) like cloudera. Your local distribution can be of any size. You can use this for your development and test your code.  If you want to scale, you can check cloud options like amazon EMR. Important point to remember  is your big data code for data crunching,  recommendation.etc are not going to change much if you want to  run in cloud service like EMR.  If you read the documentation and follow the steps you can set up your code in the cloud in no time.

Big  Data - More handy tools

There are many handy tools and connectors available in this space. Always search for the options before you start doing  your development. For example, part of your big data problems, you may have a requirement to export the data to database. You can use a tool -  "Sqoop" for this purpose instead of writing your own.

Big Data - Understand the Use Cases

Understand all the use cases where Big Data fits in your application assuming you have large data set.

Some use-cases I have listed here for your understanding..

1) Counting and Grouping

Example: Grouping a user based on some criteria, Counting the most visited page. etc

2) Filter the data set using some parameters.

Example: Get the user list who accessed your app via Iphone.

3) Process, Filter the data and combine with various data sources like RDMBS, Mongo Db

 Example: Get the user list who accessed via Iphone and get those user's profile details from mongo db.
 Process the message with these details and put a entry into the Email delivery table of some datastore.

4) Analytics, Recommendation algorithm

Example: Finding the related items and recommend to users. What Amazon and other sites do. You can build a big data analytics solutions using Hadoop.

5) Data Pipeline ETL

You can crunch large data set and transform and store it in some data store for anlaytics.

Analytics, Machine Learning  problems are not entirely dependent on Hadoop/ Big Data 

Analytics, Machine Learning problems are not entirely dependent on Hadoop/Big Data.  Machine learning libraries like Mahout  have specific distributed algorithms which can leverage Hadoop/ Big Data for processing huge data sets. It can work without Big Data/ Hadoop as well. So learn and understand the algorithms, check which suits your need without worrying the Hadoop part.

Cloud -  Just start using it

Cloud providers like AWS abstracted the underlying virtualization part and they made easy for the end users. Read the documentation, best practices and you can start using it  without worrying much about the underlying implementation. Don't consume so much information(lot of information available in this space) without proceeding with your product/project goal.. More can be learnt, great insights and knowledge can be acquired as and when you start using it.

Cloud -  More offering apart from Scaling and Cost

Cloud services like Amazon, Rackspace are not only usedul to scale your app or to reduce the cost. You can  explore their ecosystem and check various services they provide with storage, cache,CDN .etc and leverage in your app

Cloud  - Things Fail.  Don't investigate. Have a fail-over strategy

If a instance or virtual machine fails,  hardly you can restore it.  Have a good failover strategy(Load Balancers, Backup.etc) .

Cloud  - Application Abstraction Layer

Have a abstraction layer in your application wherever possible when you use cloud services like cache, CDN. etc. All your program interacts with this abstraction layer and real implementation should be hidden. So there will be minimal change,  if you switch over to a different cloud provider in future.

Cloud  -  Collect Metrics and Analyze

Collect the essential metrics, analyze and check how your application performs in cloud. Amazon has a excellent service - CloudWatch for this. Even you use best practices, it is better to analyze how your application performs in cloud and do the necessary tuning.


  1. Nice info regarding big data and cloud my sincere thanks for sharing this post Please Continue to share this post
    Hadoop Training in Chennai

  2. nice blog has been shared by you. before i read this blog i didn't have any knowledge about this but now i got some knowledge. so keep on sharing such kind of an interesting blogs.
    software testing training in chennai

  3. Excellent blog. Thank you for sharing with us. The information you shared is very effective for Bigdata learners and I have got some important suggestions from your blog post. Hadoop Training in Chennai | Software Testing Training in Chennai

  4. Well Said, you have provided the right info that will be beneficial to somebody at all time. Thanks for sharing your valuable Ideas to our vision.Big Data Hadoop Training in Bangalore | AWS Training in Bangalore

  5. Well Said, you have provided the right info that will be beneficial to somebody at all time. Thanks for sharing your valuable Ideas to our vision.

    Hadoop Training in Marathallai

    Hadoop Training in BtmLayout