Talking Points - Data: January 2020

Wednesday, January 29, 2020

Breaking - Amazon Web Services

What all comes under Amazon Web Services?

Cloud is becoming more & more critical for most businesses as it provides flexibility, cost-effective on-demand storage. The cloud computing service deployment models give command and control over data.

Types of service models:

1. Infrastructure as a Service: IaaS lets users use Virtual machines, storage & servers

2. Platform as a Service: PaaS lets users develop & host apps on the platform

3. Software as a Service: SaaS lets users access these applications across all devices
4. Network as a Service: NaaS allows users to access network infrastructure directly & securely. Help deploy custom routing protocols, manage & maintain network resources.
5. Identity as a Service: IDaaS manages user's digital identity which can be used during electronic transactions.

AWS has a huge size and presence in the cloud business today. It has two main products

1. EC2: Virtual machine service

2. S3: Amazon storage system (Simple Storage System)

Amazon Redshift: Cloud-based Data Warehouse.

Amazon Glacier: Low-cost cold storage for infrequently accessed data for long retrieval times.

Amazon Elastic Block Store: Store persistent data in block-level storage which is highly available even when EC2 is shut down.

Amazon Elastic Cloud Compute (EC2): Virtual servers know as instances for running applications.

Database Management: Relational database service (RDS) for users to migrate and backup data.

Data Migration: To bring data from servers, databases & applications to AWS cloud. Migration Hub & Snowball.

Networking: Secluded segment of AWS cloud and balancing traffic.

Security: IAM - Identity & Access management to create and control custom policies for multiple accounts.

Amazon Messaging Services: SQS Simple Queue service SNS Simple notification service SES Simple email service.

Amazon Development tools: Command line interface & SDK's

Sunday, January 26, 2020

Breaking - Clustered Computing

What is Clustered Computing in Big data?

To handle big data individual computers are often inadequate when it comes to storage and compute requirements. Node is a single computer and cluster is a bunch of them together.

Clustering helps in combining resources of smaller machines:

1. Resource pooling: Storage, CPU & Memory
2. High availability: Varying levels of fault tolerance to prevent failures
3. Easy scalability: Add machines horizontally

Manual & Automatic cluster examples: Veritas, Linux natic clusters, IBM AIX bases clusters.

One also needs solutions to manage cluster membership, resource sharing & scheduling work on individual nodes. Hadoop's YARN or Apache Mesos.

The cluster acts as a foundation layer for other processing software.

Monday, January 20, 2020

Breaking Big Data

What is Big data and the key terms under it?

Big data in simple terms is non-traditional strategies to handle large data sets.

Main characteristics:

1. Volume- Scale

2. Velocity- Real-time

3. Variety- Formats (Structure)

4. Veracity- Accuracy (Complexity in processing)

5. Variability- In quality (Additional resources needed to improve data)

6. Value- Reduce complexity to show value

Life cycle:

Ingesting data: Taking raw data and adding to the system

a) Data ingestions tools- Apache Sqoop can take data from relation databases
b) Projects- Apache Flume & Chukwa designed to aggregate & import application & server logs
c) Queuing system- Apache Kafka can act as an interface between data generator & big data system
d) Ingestion frameworks- Apache Goblin can aggregate & normalize the output of the above at the end of the ingestion pipeline

Persisting data: Persistent storage means making data available even when power is off. Like a hard disk. The volume, requirement of availability & distributed compute makes it complex storage.

Hence the need to leverage a distributed file system like Apache Hadoop's HDFS, which allows large quantities of data to be written across multiple nodes in clusters. Other distributed systems are Ceph & Gluster FS

Computing and analyzing data: This layer depends on the type of insights required. Data is often processed repeatedly, either multiple times by a single tool or by a number of tools to bring different insights.

Batch processing is one of the methods. Basically giving different batches or pieces and giving to different machines. Then using the output calculating and assembling the final result. These steps are referred to as splitting, mapping, shuffling, reducing & assembling or collectively as map-reduce. This is the strategy that Apache Hadoop's MapReduce program for very large data sets.

Other workloads require more realtime processing. It works with in-memory hence avoiding to write back on disk. Apache Storm, Apache Flink, and Apache Spark provide different ways of achieving real-time processing a strategy suited for small data sets.

Other ways of computing using tools that plug-in to the above-given frameworks and provide an additional interface. For e.g.

Apache Hive: Datawarehouse for Hadoop
Apache Pig: High-level querying interface
Apache Drill, Impala, Spark SQL & Presto: SQL like interactions can be done using these projects
Apache SystemML, Mahout, Spark's MLlib: For Machine learning projects

R & Python: Analytics programming

Visualizing: Recognizing changes over time is more important than the data itself. Visualizing data helps spot trends. Real-time processing is needed to visualize application & server metrics.

Prometheus: Processing data stream as a time-series database and visualizing it.
Elastic stack: Logstash for data collection, Elastic search for indexing data & Kibana for visualization.
Silk: Apache Solr for indexing & Kibana fork called Banana for visualization.

Another visualization technology used for interactive data science work is a data "notebook". Especially in Formats that are conducive to sharing, presenting & collaborating. Examples of such interface are Jupyter Notebook & Apache Zeppelin.

Talking Points - Data