The Big Data BI Vibes By Anz: March 2018

Sunday, March 18, 2018

Microsoft SQLServer 2012 Native Sequence In The Definition Of ODI Sequence

How To Use a Microsoft SQLServer 2012 Native Sequence

In The Definition Of ODI Sequence

The ODI configuration for the Native Sequence to work needs to be adjusted so that

the drop-down list which displays when you create the ODI Sequence shows the Native Sequences that exists in the used Microsoft SQLServer Logical Schema:
a)Start ODI Studio, and navigate to Topology Manager > Physical Architecture > Technologies.

1.Update the Microsoft SQLServer technology, so that it allows the usage of native Sequences:

Edit the "Microsoft SQL Server" technology.

On the "Definition" tab, check the "Supports native sequences" box,and set:

Local Sequence Mask:

NEXT VALUE FOR %SCHEMA.%OBJECT

Remote Sequence mask:

NEXT VALUE FOR %DSERVER.%SCHEMA.%OBJECT

b) On the "SQL" tab, in the "Specific Queries" area, go to the "Reverse Engineer Native Sequences" tab, and set the query to:

select SEQUENCE_NAME as SEQUENCE_NAME
from INFORMATION_SCHEMA.SEQUENCES
where SEQUENCE_CATALOG= :CATALOG
and SEQUENCE_SCHEMA = :SCHEMA
order by SEQUENCE_NAME

Use the Native Sequence in the definition of ODI Sequence:

Navigate to Designer. Create your ODI Sequence (either as Project Sequence - from the "Projects" > "Sequences" navigator, or as Global Sequence - from the "Global Objects" > "Global Sequences" navigator). On the "Definition" tab, set the "Native sequence" radio-button, and chose the desired Microsoft SQLServer Logical Schema. Set the "Native Sequence name", or chose the desired one from the loopup window:

Saturday, March 17, 2018

The Big Data , Reference Integration Architecture

From today on wards I will start writing on my learning and experiment on Next gen data ,
its Integration & Data Visualization

What is Big Data?

Global digital content created will increase some 30 times over the next ten years – to 35 zettabytes.Big data is a popular, but poorly defined marketing buzzword. One way of looking at big data is that it represents the large and rapidly growing volume of information that is mostly untapped by existing analytical applications and data warehousing systems. Examples of this data include high-volume sensor data and social networking information from web sites such as FaceBook and Twitter. Organizations are interested in capturing and analyzing this data because it can add significant value to the decision making process. Such processing, however, may involve complex workloads that push the boundaries of what is possible using traditional data warehousing and data management techniques and technologies.

In addition, Big Data has popularized two foundational storage and processing technologies: Apache Hadoop and the NoSQL database.

Big Data has also been defined as any data which satisfies the characteristics of the five “V”s: Volume, Velocity, Variety, Veracity and Value. These become a reasonable test to determine whether you should add Big Data to your information architecture.

Volume. The amount of data. While volume indicates more data, it is the granular nature of the data that is unique. Big Data requires processing high volumes of low-density data, that is, data of unknown value, such as twitter data feeds, clicks on a web page, network traffic, sensor-enabled equipment capturing data at the speed of light, and many more. It is the task of Big Data to convert low-density data into high-density data, that is, data that has value. For some companies, this might be tens of terabytes, for others it may be hundreds of petabytes.

Velocity. A fast rate that data is received and perhaps acted upon. The highest velocity data normally streams directly into memory versus being written to disk. Some Internet of Things (IoT) applications have health and safety ramifications that require real-time evaluation and action. Other internet-enabled smart products operate in real-time or near real-time. As an example, consumer e-commerce applications seek to combine mobile device location and personal preferences to make time sensitive offers. Operationally, mobile application experiences have large user populations, increased network traffic, and the expectation for immediate response.

Variety. New unstructured data types. Unstructured and semi-structured data types, such as text, audio, and video require additional processing to both derive meaning and the supporting metadata. Once understood, unstructured data has many of the same requirements as structured data, such as summarization, lineage, auditability, and privacy. Further complexity arises when data from a known source changes without notice. Frequent or real-time schema changes are an enormous burden for both transaction and analytical environments.

Veracity: refers to the messiness or trustworthiness of the data. With many forms of big data, quality and accuracy are less controllable, for example twitter posts with hashtags, abbreviations, typos and colloquial speech. Big data and analytics technology now allows us to work with these types of data. The volumes often make up for the lack of quality or accuracy.But all the volumes of fast-moving data of different variety and veracity have to be turned into value! This is why value is the one V of big data that matters the most.

Value refers to our ability turn our data into value. It is important that businesses make a case for any attempt to collect and leverage big data. It is easy to fall into the buzz trap and embark on big data initiatives without a clear understanding of the business value it will bring.

Big Data Reference Architecture Overview

Traditional Information Architecture Capabilities

In the illustration, you see two data sources that use integration (ELT/ETL/Change Data Capture) techniques to transfer data into a DBMS data warehouse or operational data store, and then offer a wide variety of analytical capabilities to reveal the data. Some of these analytic capabilities include: dashboards, reporting, EPM/BI applications, summary and statistical query, semantic interpretations for textual data, and visualization tools for high-density data. In addition, some organizations have applied oversight and standardization across projects, and perhaps have matured the information architecture capability through managing it at the enterprise level.

The key information architecture principles include treating data as an asset through a value, cost,
and risk lens, and ensuring timeliness, quality, and accuracy of data. And, the enterprise architecture
oversight responsibility is to establish and maintain a balanced governance approach including
using a center of excellence for standards management and training.

When Big Data is Added - Transformed Data Architecture:

The defining processing capabilities for big data architecture are to meet the five V’s requirement.
Unique distributed (multi-node) parallel processing architectures have been created to parse these
large data sets. There are differing technology strategies for real-time and batch processing storage
requirements. For real-time, key-value data stores, such as NoSQL, allow for high performance,
index-based retrieval. For batch processing, a technique known as “Map Reduce,” filters data according
to a specific data discovery strategy. After the filtered data is discovered, it can be analyzed directly,
loaded into other unstructured or semi-structured databases, sent to mobile devices, or merged into
traditional data warehousing environment and correlated to structured data.

Reference Architecture for Big Data:

Big Data Architecture Patterns:

IIn addition to the new components, new architectures are emerging to efficiently accommodate new storage, access, processing, and analytical requirements. First, is the idea that specialized data stores, fit for purpose, are able to store and optimize processing for the new types of data.

A Polygot strategy suggests that big data oriented architectures will deploy multiple types of data stores. Keep in mind that a polyglot strategy does add some complexity in management, governance, security, and skills.

Second, we can parallelize our data foundation for both speed and size, this is crucial for next-generation data services and analytics that can scale to any latency and size requirements. With this Lambda based architecture we’re now able to address fast data that might be needed in an Internet of Things architecture.

Third, Data pipelines that allow us to treat data events in a moving time windows at variable latencies; in the long run this will change how we do ETL for most use cases. The defining processing capabilities for big data architecture are to meet the five V’s requirements. Unique distributed (multi-node) parallel processing architectures have been created to parse these large data sets. There are differing technology strategies for real-time and batch processing storage requirements. For real-time, key-value data stores, such as NoSQL, allow for high performance, index-based retrieval. For batch processing, a technique known as “Map Reduce,” filters data according to a specific data discovery strategy. After the filtered data is discovered, it can be analyzed directly, loaded into other unstructured or semi-structured databases, sent to mobile devices, or merged into traditional data warehousing environment and correlated to structured data. Many new analytic capabilities are available that derive meaning from new, unique data types as well as finding straightforward statistical relevance across large distributions. Analytical throughput also impacts the transformation, integration, and storage architectures, such as real-time and near-real time events, ad hoc visual exploration, and multi-stage statistical models. Nevertheless, it is common after Map Reduce processing to move the “reduction result” into the data warehouse and/or dedicated analytical environment in order to leverage existing investments and skills in business intelligence reporting, statistical, semantic, and correlation capabilities. Dedicated analytical environments, also known as Discovery Labs or sandboxes, are architected to be rapidly provisioned and deprovisioned as needs dictate. One of the obstacles observed in enterprise Hadoop adoption is the lack of integration with the existing BI ecosystem. As a result, the analysis is not available to the typical business user or executive. When traditional BI and big data ecosystems are separate they fail to deliver the value added analysis that is expected. Independent Big Data projects also runs the risk of redundant investments which is especially problematic if there is a shortage of knowledgeable staff.

Wednesday, March 14, 2018

Schema less DB for Universal Stage using Mongo DB & Alteryx

Hey All

It’s been long time since i blogged. So today I am here with the building and demonstration of an Universal STG for different data sources using Schema less DB - MONGO DB which is a life saver in the Data Integration area.

Problem:

In conventional data integration process, relational DB schema is being used as the staging layer for GW PC XML data feed. A STG layer is always subjected to changes with respect to the data structure or business requirements in an agile execution. It can also have ‘n’ number of sources when future compatibility is considered. Considering the above scenarios , the ease of flexibility to changes , wide variety of data are very complex if the STG layer is a relational data base. Lead time for a real time analytics is also high.

What We are trying to implement here:

The XML feed is loaded into a schema less DB after being converted into JSON. In this case Mongo DB being a document DB, the degree of flexibility to changes is very high and schema migration process is made very simple.

For this POC the below two scenarios are implemented viz;

1)Loaded STG PLCY PRD and STG_PRTY_AGMT_ROLE and STG PRSN tables in the STG Layer( this schema is a relational stg Layer with no constraints)

2)Loaded AGMT , PRSN and PRTY_AGMT tables in the ODS layer directly from Mongo DB( a sample Star schema in the core ODS layer)

Technology Used:

Mongo DB, RoboMongo, XML to Json Converter, Alteryx, SQL server

Detailed Solution Steps with Screenshots:

STEP 1:

Download the mongodb and robomongo in your local machine.
Create a folder named “data” in C drive.

STEP 2:

In command prompt, mention the ”mongod.exe” file location and click enter.

STEP 3:

start mongod.exe

STEP 4:

Open Robomongo.exe file

Step 5:

Set new connections and save.

Then click connect

Step 6: Import json files to mongo dB

Paste the mongod.exe file and json file location along with

Mongo import --db startup_log --collection test –file.

Step 7:

In the robomongo page
Go to Startup_log dB Collections
So we can view the imported data.

Step 8:

In the robomongo page we can view data

Step 9: Load Mongodb data to STG DB/ODS using Alteryx

Setup source DB connection from Mongo DB

Launch alteryx, select ‘connectors’ ‘from tool pallet’
Drag ‘mongodb input’ to workflow
Set the server name as ‘localhost’,

Database: ‘startup_log’

Collection: ’test’

Setup target DB connection to STG/ODS relational DB

Select ‘In/Out’ from tool palate and drag ‘Output Data’ to it.
Double click the output data object and configure the target db connection.

Set the file location and also ‘Update, insert if new option ’option as Output options
Do the custom data mapping from source to target

Save and run the workflow.

For this POC the two scenarios were tried out :

1) Loaded STG PLCY PRD and STG_PRTY_AGMT_ROLE , STG PRSN were
loaded in the STG Layer( This schema is a relational stg Layer with no constraints)

2) Loaded AGMT , PRSN and PRTY_AGMT were loaded in the ODS layer directly from Mongo DB( a sample Star schema in the ODS)

Step11: Viewing results in dB (SQL)

AGMT: