Talking Points - Data: 2020

Monday, August 31, 2020

Breaking Ads Data Hub

What is Ads Data Hub?

The cookie is crumbling.- It's the most talked about topic in the ad-tech industry in 2020.

Advertisers need a way to measure view-through conversions - Ads viewed but not clicked and the user journey from thereon.

Earlier the browser could be identified using the cookie but post-2021 this capability will be gone. Previously you could upload raw log data for each click, view or event including UserID, from DCM to Google BigQuery.

Ads data hub solves this by running queries on Campaign manager, website sessions & online purchases from GA, and paid orders from CRM.

How does it do it? Ads Data Hub links two BigQuery projects — your own and Google’s.

The Google project stores log data from Campaign Manager, Display & Video 360, YouTube, and Google Ads. This information you can’t get elsewhere because of GDPR rules.

The other project stores all your marketing data (online and offline) uploaded to BigQuery from Google Analytics, CRM, or other sources.

Ads Data Hub as an API allows you to request data from these two projects at the same time without uploading them at the user level.

Tuesday, March 10, 2020

Breaking - Cloud computing

What is Cloud computing?

Anything that involves delivering hosted services over the internet. Applications or data which are present in a remote location. It refers to manipulating, configuring, and assessing hardware & software resources remotely.

Why would you need it?

Allows flexibility of operations. To save costs on infrastructure. To scale-up & down resources quickly.

The underlying concepts for Cloud computing:

Deployment models- (Type of access to the cloud)

Private cloud- Services & systems accessible within an organization hence more secure.
Public cloud- Accessible to the general public hence not very secure. (Google, Amazon & Microsoft)
Community cloud- Accessible to a group of organizations.

Hybrid cloud- Mix of private & public, where critical services performed using private cloud & other over the public cloud.

Service models - IaaS, PaaS, SaaS. Xaas, IaaC.

XaaS - Anything as a service. For e.g. Network, Database, Business, Identity or Strategy as a service.

Infrastructure as a code- Language-specific SDKs containing APIs that allow you to easily incorporate the connectivity & functionality of a wider range of services into your code without the difficulty of writing the functions yourself.

Technologies at play at the back-end:

Virtualization: It is the abstraction of resources. So making a single physical resource to function as multiple resources (Server, OS, App or Storage) OR making multiple resources to function as a single resource (Storage or servers).

A technology that allows you to optimize computing resources like Networking, Memory, storage, CPU, etc.

Server virtualization: One physical machine is divided into many virtual servers. The core concept is of a Hypervisor or a Virtual machine monitor which provides a layer that intercepts OS calls to hardware. Hence providing virtualized CPU & memory for guests running on top of it.

Pros & Cons of Server Virtualization:

Service-Oriented Architecture

It helps use applications as a service regardless of the vendor, product, or technology. Therefore it is possible to exchange data between applications of different vendors without additional programming.

Grid Computing

This refers to distributed computing, in which a group of computers from multiple locations are connected to achieve a common objective. It basically breaks complex tasks into smaller pieces.

Utility Computing

Pay per Use model offering computational resources as metered service.

Key terms in Cloud computing:

Storage- Keeps multiple replicas of storage
Server- Helps to compute the resource sharing and with allocation/de-allocation
Network- Allows connecting cloud services over the internet. Customers can customize the network route & protocol.
Deployment software- Helps deploy & integrate application over the cloud
Management software- Helps maintain & configure the infrastructure

Friday, February 21, 2020

Breaking - Marketing Automation

What is Marketing Automation?

A combination of tools & thinking to personalize the journey of prospects and make them happy customers.

Why do you need it?

To identify the right leads, nurture them, move them faster in the funnel, generate revenue & deliver better ROI on campaigns.

Email marketing & landing page development

Sending emails and getting people to land on designated pages using a CTA to help them convert is the essence of this practice.

Lead Management (Capturing, Scoring & Nurturing)

Lead Capturing: Leads can come from various places like websites, social media posts, email lists, campaigns, etc. These leads should ideally land in the marketing automation tool automatically with the help of an API.

Lead Scoring: By assigning values to basis some predetermined rules on behavior (on website, social) or demographics and give predictive scoring by connecting CRM system as well.

Lead Nurturing: To keep leads engaged thru personalized campaigns.

CRM Integration

Most businesses need to align sales and marketing and the need for CRM integration has become a must for marketing automation. Platforms like Salesforce, Microsoft Dynamics, Oracle Netsuite & Sugar CRM are 9/10 times available as connectors.

Sales team enters leads and the marketing automation platform gets those leads automatically to be acted upon.

Apps Marketplaces

Use of third party helps to free up time for other marketing activities. Apps for content creation, benchmarking apps, SEO ranks, etc. Some time with the use of an API these connections can be made which might incur a cost per call.

Dynamic content creation

Ability to create, send & measure personalized email campaigns. Some platforms have better personalization capabilities in terms of content creation on websites or emails. It also differs in static & dynamic content which adjusts as people interact with it. Use of Progressive profiling to capture additional information from a prospect each time they fill something.

Even email & message deliverability monitoring and dedicated IP for the same is useful in marketing automation platforms.

Account-Based Marketing

B2B selling requires marketers to reach out to multiple stakeholders in an organization, which makes ABM essential for a complex high-value sale. Hence enhanced account nurturing and predictive scoring come in handy in such cases.

Mobile Marketing

Since mobile is now at the heart of most campaigns many marketing automation tools offer responsive templates for email, landing pages & webforms. There are email testing tools that help in getting a preview of the same across platforms.

In-App marketing, push notifications, Geo-fencing are increasingly deployed to get prospects to share of mind.

Predictive Analytics

Most marketing platforms offer standard analytics like clickstream data & campaign responses. Most platforms are offering predictive models that give trends and insights to enhance the customer experience. Advanced platforms also recommend the next best action - product recommendation or website content basis the prospect interaction with brand assets like Website, CRM, etc.

Social Integration

Almost every platform will have some method of publishing & monitoring on social networks from within. Some will have advanced features to monitor the social behavior of leads and scoring them.

How to measure the success of marketing automation

It is also wise to check data before getting started on the journey to email automation. Also, be mindful on synergies of email with other campaigns on performance.

Some of the most popular platforms:

Salesforce Pardot
Hubspot
Marketo
Eloqua
Mailchimp

Sunday, February 16, 2020

Breaking MMM vs MTA vs CCA

Marketing Mix Modelling vs Multi-Touch Attribution vs Cross Channel Attribution

Why do you need MMM or MTA?

Every marketer needs to know where to spend their budget to get maximum returns. There are different methods/models in analytics and data science to help the marketer makes sense of there spend vs returns.

Let's try and break them one by one.

Marketing Mix Modelling (Media Mix Modelling)

Analyses most of the offline data like TV, Print & radio & gives a high-level picture of what is working and to invest. It takes into account a holistic view of the market environment like the price, seasonality, weather etc. Basically the effect of these factors on the performance of any marketing campaign. This is mostly done couple of time a year max.

Multi-Touch Attribution

MTA takes more of a ground-up approach as it looks at user journeys and tries to ascertain their conversion paths. Every touchpoint in the user journey is assigned a specific weight and feeds into the MTA analysis. It is more of a real-time digital play when compared to MMM which is mostly historical in nature.

Let's look at the Pros & Cons of both approaches:

MMM - Pros

1. Macro-level & strategic in nature.
2. Financial implications of spend (ROI)
3. Longer time frame & context for the brand

MMM- Cons

1. Not real-time hence delays optimization
2. High level hence difficult to go deeper into specifics of user journeys

MTA - Pros
1. Quick response is possible
2. Helps understand which channels work best together & what does not

MTA - Cons
1. Misses offline as the most emphasis on digital
2. Baseline conversions which happen without any marketing are missed
3. Better than last-click attribution but still restricted to one algorithm

Due to this disconnect between both MMM & MTA are run at different times to gauge overall marketing.

Cross-Channel Attribution

If we use a common key like the time we can stitch together both MTA & MMM. Using advanced analytics to allocate proportional credit to each touchpoint across be it online or offline channel, leading to the desired customer action.

Friday, February 14, 2020

Breaking - Datalake & Data Warehouse

What is a Data Lake & how is it different from a Data Warehouse?

Enterprises have loads of data which is often very varied and difficult to make sense of. A Data lake keeps that data in its purest form which can be used for different purposes.

Traditionally Data warehouses have helped us use the data that we have but in the age of Big Data, this model doesn't serve us well.

Data Warehouse

We take data from structured data sources, do some ETL and structuring and basis a predefined data model create data marts for reporting, OLAP cubes for slicing & dicing & visualization.

This process needs the understanding of the data that is coming in (Source, Type, Anomalies, Cardinality) along with the business requirement to make it work.

Also based on the fact that the business understands the requirements. So most of the time goes into understanding what, where and how of data than on actual analysis.

Data Lake

Here the data sources can be structured or unstructured. We extract and load all types of data to a raw data storage. This place is a persistent storage that can store data at scale (Volume).

Components of a data lake:

A sandbox environment: For understanding & exploring data, creating prototypes & use cases.
A batch processing engine: For converting raw data into structured data used for reporting.
Real-time processing: Handles streaming data & processes it as well.
Cataloging & Curating: The value of data basis its source, quality & lineage.Helps in deciding which data set to be used for a particular analysis. Helps in providing Meta-data.

Lambda architecture: The hybrid approach of batch & Real-time processing is called Lambda. Batch layer is the slow layer whereas the Speed layer takes care of fast incoming data. Once data is passed through the speed layer it goes for batch processing.

This makes room for more data to come in the speed layer. The batch layer & the Speed layer run together when queried for reporting.

Glossary

ETL: Extract, Transform, Load (ETL) first extracts the data from a pool of data sources, which are typically transactional databases. The data is held in a temporary staging database. Transformation operations are then performed, to structure and convert the data into a suitable form for the target data warehouse system. The structured data is then loaded into the warehouse, ready for analysis.

Persistent storage: Persistent storage means making data available even when power is off. Like a hard disk. The volume, requirement of availability & distributed compute makes it complex storage.

Batch processing: Typically uses Hadoop MapReduce

Immutable data: That cannot be changed once stored.

Master data set: This is where all batch process data is stored. This data is immutable.

Atomic data: That data which cannot be broken further. Not calculated metrics like revenue.

Timestamp: Recording event information called log for organizing data.

Wednesday, January 29, 2020

Breaking - Amazon Web Services

What all comes under Amazon Web Services?

Cloud is becoming more & more critical for most businesses as it provides flexibility, cost-effective on-demand storage. The cloud computing service deployment models give command and control over data.

Types of service models:

1. Infrastructure as a Service: IaaS lets users use Virtual machines, storage & servers

2. Platform as a Service: PaaS lets users develop & host apps on the platform

3. Software as a Service: SaaS lets users access these applications across all devices
4. Network as a Service: NaaS allows users to access network infrastructure directly & securely. Help deploy custom routing protocols, manage & maintain network resources.
5. Identity as a Service: IDaaS manages user's digital identity which can be used during electronic transactions.

AWS has a huge size and presence in the cloud business today. It has two main products

1. EC2: Virtual machine service

2. S3: Amazon storage system (Simple Storage System)

Amazon Redshift: Cloud-based Data Warehouse.

Amazon Glacier: Low-cost cold storage for infrequently accessed data for long retrieval times.

Amazon Elastic Block Store: Store persistent data in block-level storage which is highly available even when EC2 is shut down.

Amazon Elastic Cloud Compute (EC2): Virtual servers know as instances for running applications.

Database Management: Relational database service (RDS) for users to migrate and backup data.

Data Migration: To bring data from servers, databases & applications to AWS cloud. Migration Hub & Snowball.

Networking: Secluded segment of AWS cloud and balancing traffic.

Security: IAM - Identity & Access management to create and control custom policies for multiple accounts.

Amazon Messaging Services: SQS Simple Queue service SNS Simple notification service SES Simple email service.

Amazon Development tools: Command line interface & SDK's

Sunday, January 26, 2020

Breaking - Clustered Computing

What is Clustered Computing in Big data?

To handle big data individual computers are often inadequate when it comes to storage and compute requirements. Node is a single computer and cluster is a bunch of them together.

Clustering helps in combining resources of smaller machines:

1. Resource pooling: Storage, CPU & Memory
2. High availability: Varying levels of fault tolerance to prevent failures
3. Easy scalability: Add machines horizontally

Manual & Automatic cluster examples: Veritas, Linux natic clusters, IBM AIX bases clusters.

One also needs solutions to manage cluster membership, resource sharing & scheduling work on individual nodes. Hadoop's YARN or Apache Mesos.

The cluster acts as a foundation layer for other processing software.

Monday, January 20, 2020

Breaking Big Data

What is Big data and the key terms under it?

Big data in simple terms is non-traditional strategies to handle large data sets.

Main characteristics:

1. Volume- Scale

2. Velocity- Real-time

3. Variety- Formats (Structure)

4. Veracity- Accuracy (Complexity in processing)

5. Variability- In quality (Additional resources needed to improve data)

6. Value- Reduce complexity to show value

Life cycle:

Ingesting data: Taking raw data and adding to the system

a) Data ingestions tools- Apache Sqoop can take data from relation databases
b) Projects- Apache Flume & Chukwa designed to aggregate & import application & server logs
c) Queuing system- Apache Kafka can act as an interface between data generator & big data system
d) Ingestion frameworks- Apache Goblin can aggregate & normalize the output of the above at the end of the ingestion pipeline

Persisting data: Persistent storage means making data available even when power is off. Like a hard disk. The volume, requirement of availability & distributed compute makes it complex storage.

Hence the need to leverage a distributed file system like Apache Hadoop's HDFS, which allows large quantities of data to be written across multiple nodes in clusters. Other distributed systems are Ceph & Gluster FS

Computing and analyzing data: This layer depends on the type of insights required. Data is often processed repeatedly, either multiple times by a single tool or by a number of tools to bring different insights.

Batch processing is one of the methods. Basically giving different batches or pieces and giving to different machines. Then using the output calculating and assembling the final result. These steps are referred to as splitting, mapping, shuffling, reducing & assembling or collectively as map-reduce. This is the strategy that Apache Hadoop's MapReduce program for very large data sets.

Other workloads require more realtime processing. It works with in-memory hence avoiding to write back on disk. Apache Storm, Apache Flink, and Apache Spark provide different ways of achieving real-time processing a strategy suited for small data sets.

Other ways of computing using tools that plug-in to the above-given frameworks and provide an additional interface. For e.g.

Apache Hive: Datawarehouse for Hadoop
Apache Pig: High-level querying interface
Apache Drill, Impala, Spark SQL & Presto: SQL like interactions can be done using these projects
Apache SystemML, Mahout, Spark's MLlib: For Machine learning projects

R & Python: Analytics programming

Visualizing: Recognizing changes over time is more important than the data itself. Visualizing data helps spot trends. Real-time processing is needed to visualize application & server metrics.

Prometheus: Processing data stream as a time-series database and visualizing it.
Elastic stack: Logstash for data collection, Elastic search for indexing data & Kibana for visualization.
Silk: Apache Solr for indexing & Kibana fork called Banana for visualization.

Another visualization technology used for interactive data science work is a data "notebook". Especially in Formats that are conducive to sharing, presenting & collaborating. Examples of such interface are Jupyter Notebook & Apache Zeppelin.

Talking Points - Data