Breaking Big Data

What is Big data and the key terms under it?

Big data in simple terms is non-traditional strategies to handle large data sets.

Main characteristics:

1. Volume- Scale

2. Velocity- Real-time

3. Variety- Formats (Structure)

4. Veracity- Accuracy (Complexity in processing)

5. Variability- In quality (Additional resources needed to improve data)

6. Value- Reduce complexity to show value

Life cycle:

Ingesting data: Taking raw data and adding to the system

a) Data ingestions tools- Apache Sqoop can take data from relation databases
b) Projects- Apache Flume & Chukwa designed to aggregate & import application & server logs
c) Queuing system- Apache Kafka can act as an interface between data generator & big data system
d) Ingestion frameworks- Apache Goblin can aggregate & normalize the output of the above at the end of the ingestion pipeline

Persisting data: Persistent storage means making data available even when power is off. Like a hard disk. The volume, requirement of availability & distributed compute makes it complex storage.

Hence the need to leverage a distributed file system like Apache Hadoop's HDFS, which allows large quantities of data to be written across multiple nodes in clusters. Other distributed systems are Ceph & Gluster FS

Computing and analyzing data: This layer depends on the type of insights required. Data is often processed repeatedly, either multiple times by a single tool or by a number of tools to bring different insights.

Batch processing is one of the methods. Basically giving different batches or pieces and giving to different machines. Then using the output calculating and assembling the final result. These steps are referred to as splitting, mapping, shuffling, reducing & assembling or collectively as map-reduce. This is the strategy that Apache Hadoop's MapReduce program for very large data sets.

Other workloads require more realtime processing. It works with in-memory hence avoiding to write back on disk. Apache Storm, Apache Flink, and Apache Spark provide different ways of achieving real-time processing a strategy suited for small data sets.

Other ways of computing using tools that plug-in to the above-given frameworks and provide an additional interface. For e.g.

Apache Hive: Datawarehouse for Hadoop
Apache Pig: High-level querying interface
Apache Drill, Impala, Spark SQL & Presto: SQL like interactions can be done using these projects
Apache SystemML, Mahout, Spark's MLlib: For Machine learning projects

R & Python: Analytics programming

Visualizing: Recognizing changes over time is more important than the data itself. Visualizing data helps spot trends. Real-time processing is needed to visualize application & server metrics.

Prometheus: Processing data stream as a time-series database and visualizing it.
Elastic stack: Logstash for data collection, Elastic search for indexing data & Kibana for visualization.
Silk: Apache Solr for indexing & Kibana fork called Banana for visualization.

Another visualization technology used for interactive data science work is a data "notebook". Especially in Formats that are conducive to sharing, presenting & collaborating. Examples of such interface are Jupyter Notebook & Apache Zeppelin.

Talking Points - Data

Monday, January 20, 2020