Monday, April 17, 2017

Infrastructure of Data Science and Analytics

Data Science apparently is now becoming a trend.  It basically a collection of methods and paradigms in querying, analysing, processing and visualizing big data.  Oil companies, business intelligence and marketing, social media, Artificial Intelligence, etc. want that.

Definition
Big Data: Extremely large data sets that may be analyzed computationally to revel patterns, trends, and associations, especially relating to human behavior an interactions.

  • Cluster: a collection of computers working together in parallel (distributed processing) and in the same local network and using the similar hardware
  • Grid: Similar to grid, but computers are geographically spread out and use more heterogeneous hardware.  
  • Hadoop: a Java-based programming framework that supports the processing and storage of extremely large data sets by using MapReduce framework.  It's open source under Apache license.
  • MapReduce, a core component of the Apache's Hadoop Software Framework.  Originated from Google Research paper with the same name.
  • Hiv
  • Pig: a high-level platform for creating programs that run on Apache Hadoop.  It consists of high-level language called Pig Latin abstracting underlying Java on Hadoop.
  • SQL
  • Full-stack
  • MIKE2.0
  • Groovy
  • Tez