Thursday, July 19, 2018

Preparation for a successful Data Lake in the cloud


A data lake is conceptual data architecture which is not based on any specific technology. So, the technical implementation can vary technology to technology, which means different types of storage can be utilized, which translates into varying features.
The pillars of a data lake also include scalable and durable storage of data, mechanisms to collect and organise that data, and tools to process and analyze the data and share the findings.
If we are talking about architectural point of views for a well-developed cloud-based data lake then it must be capable to serve many corporate audiences, including IT applications, infrastructure, and operations teams, data scientists and even line of business groups.
If you are planning to develop a successful data lake then you should have to consider the cloud service providers which allow organizations to avoid the cost and hassle of managing an on-premises data center by moving storage, compute, and networking to hosted solutions. Cloud services also offer many other advantages such as ease of provisioning, elasticity, scalability, and reduced administration. 
Apart from this, we should have to consider the following things:
Type of storage: Data Lake storage does matter for any organisation because it is directly link with cost and efforts.
The most common data lake implementations utilize:
HDFS (Hadoop Distributed File System)
Proprietary distributed file systems with HDFS compatibility (ex: Azure Data Lake Store)
Object storage (ex: Azure Blob Storage or Amazon S3)
The following options for a data lake are less commonly used due to greatly reduced flexibility:
Relational databases (ex: SQL Server, Azure SQL Database, Azure SQL Data Warehouse)
NoSQL databases (ex: Azure Cosmos DB)

Security capabilities- We have already stated that Data Lake is not based any specific technology. So, implementation of securities, privacy, and governance must be differed to technology to technology.
For example, service such as Azure Data Lake Store implements hierarchical security based on access control lists, whereas Azure Blob Storage implements key-based security. These capabilities are continually evolving in the cloud, so be sure to verify on a frequent basis.
Other hand, AWS has a number of ready-to-roll services here, including AWS Identity and Access Management (IAM) for roles, AWS Key Management Service (KMS) to create and control the encryption keys used to encrypt our data.
Data management services – Data is the most important component for any organisation which is used in different platforms. The data lake analogy is conceived to help bring a common and visual understanding to the benefits of distributed computing systems able to handle multiple types of data, in their native formats, with a high degree of flexibility and scalability.

With the right data captured from a variety of sources, we should be capable to expose that information to data professionals and business decision makers without an oppressive amount of red tape, or bureaucracy from IT.
For example, AWS is introducing AWS Glue as an ETL engine to easily understand data sources, prepare the data, and load it reliably to data stores. Azure Data Lake (ADL) integrations, developers who are required to manage information in those services can use Data Lake Explorer within ADL Tools for Visual Studio Code to get a better and quicker grasp of their cloud-based big data environments.
Data Efficiency and Business Execution - One of the most powerful features of cloud-based deployments is elasticity, which refers to scaling resources up or down depending on demand. Data lakes should be made easily accessible to a wide range of users, and their efforts in implementing and supporting core applications, for any line of business or function and business users are able to utilize this internal data efficiency to help perform core activities more effectively
Disaster recovery - The most critical data from a disaster recovery standpoint is our raw data. The ability to recover our data after a damaging weather event, system error, or human error is crucial.
Azure Data Lake Store provides locally-redundant storage (LRS). Hence, the data in our Azure Data Lake Store account is resilient to transient hardware failures within a region through automated replicas. This ensures durability and high availability, meeting the Azure Data Lake Store SLA.
AWS offers all the tools and capabilities we need to transfer data into the cloud and build comprehensive backup & restore solutions that are compatible with your IT environment. 
Please visit us to learn more on -
  1. Collaboration of OLTP and OLAP systems
  2. Major differences between OLTP and OLAP
  3. Data Warehouse - Introduction
  4. Data Warehouse - Multidimensional Cube
  5. Data Warehouse - Multidimensional Cube Types
  6. Data Warehouse - Architecture and Multidimensional Model
  7. Data Warehouse - Dimension tables.
  8. Data Warehouse - Fact tables.
  9. Data Warehouse - Conceptual Modeling.
  10. Data Warehouse - Star schema.
  11. Data Warehouse - Snowflake schema.
  12. Data Warehouse - Fact constellations
  13. Data Warehouse - OLAP Servers.
  14. Preparation for a successful Data Lake in the cloud
  15. Why does cloud make Data Lakes Better?

Wednesday, July 18, 2018

Why does cloud make Data Lakes better?

A data lake is conceptual data architecture which is not based on any specific technology. So, the technical implementation can vary technology to technology, which means different types of storage can be utilized, which translates into varying features.
The main focus of a data lake is that it is not going to replace a company’s existing investment in its data warehouse/data marts. In fact, they complement each other very nicely. With a modern data architecture, organizations can continue to leverage their existing investments, begin collecting data they have been ignoring or discarding, and ultimately enable analysts to obtain insights faster. Employing cloud technologies translates costs to a subscription-based model which requires much less up-front investment for both cost and effort.

The most of the organizations are enthusiastically considering cloud for functions like Hadoop, Spark, data bases, data warehouses, and analytics applications. This makes sense to build their data lake in the cloud for a number of reasons such as infinite resources for scale-out performance, and a wide selection of configurations for memory, processors, and storage. Some of the key benefits include:
  1. Pervasive security - A cloud service provider incorporates all the aggregated knowledge and best practices of thousands of organizations, learning from each customer’s requirements.
  2. Performance and scalability - Cloud providers offer practically infinite resources for scale-out performance, and a wide selection of configurations for memory, processors, and storage.
  3. Reliability and availability - Cloud providers have developed many layers of redundancy throughout the entire technology stack, and perfected processes to avoid any interruption of service, even spanning geographic zones.
  4. Economics - Cloud providers enjoy massive economies of scale, and can offer resources and management of the same data for far less than most businesses could do on their own.
  5. Integration - Cloud providers have worked hard to offer and link together a wide range of services around analytics and applications, and made these often “one-click” compatible.
  6. Agility - Cloud users are unhampered by the burdens of procurement and management of resources that face a typical enterprise, and can adapt quickly to changing demands and enhancements. 
Advantages of a Cloud Data Lake – it is already proved that a data lake is a powerful architectural approach to finding insights from untapped data, which brings new agility to the business. The ability to harness more data from more sources in less time will directly lead to a smarter organization making better business decisions, faster. The newfound capabilities to collect, store, process, analyze, and visualize high volumes of a wide variety of data, drive value in many ways. Some of the advantages of cloud data lake is given below –
  • Better security and availability than you could guarantee on-premises
  • Faster time to value for new projects
  • Data sources and applications already cloud-based
  • Faster time to deploy for new projects
  • More frequent feature/functionality updates
  • More elasticity (i.e., scaling up and down resources)
  • Geographic coverage
  • Avoid systems integration effort and risk of building infrastructure/platform
  • Pay-as-you-go (i.e., OpEx vs. CapEx)
A basic premise of the data lake is adaptability to a wide range of analytics and analytics-oriented applications and users, and clearly AWS has an enormous range of services to match any. Many engines are available for many specific analytics and data platform functions. And all the additional enterprise needs are covered with services like security, access control, and compliance frameworks and utilities.
Please visit us to learn more on -
  1. Collaboration of OLTP and OLAP systems
  2. Major differences between OLTP and OLAP
  3. Data Warehouse - Introduction
  4. Data Warehouse - Multidimensional Cube
  5. Data Warehouse - Multidimensional Cube Types
  6. Data Warehouse - Architecture and Multidimensional Model
  7. Data Warehouse - Dimension tables.
  8. Data Warehouse - Fact tables.
  9. Data Warehouse - Conceptual Modeling.
  10. Data Warehouse - Star schema.
  11. Data Warehouse - Snowflake schema.
  12. Data Warehouse - Fact constellations
  13. Data Warehouse - OLAP Servers.
  14. Preparation for a successful Data Lake in the cloud
  15. Why does cloud make Data Lakes Better?

Tuesday, July 17, 2018

Data Lake Vs Data Warehouse


We know that data is the business asset for any organisation which always keeps secure and accessible to business users whenever it required. 
In current era, two techniques are very popular to store the data for the business insights. Hence, we are going to differentiate them based on some technical terms.

One is Data Warehouse which is highly structured store of the data that is requiring a significant amount of discovery, planning, data modeling, and development work before the data becomes available for analysis by the business users.

Second one is a Data Lake which is a storage repository that holds a vast amount of raw data in its native format, including structured, semi-structured, and unstructured data. The data structure and requirements are not defined until the data is needed. We can say that Data Lake is a more organic store of data without regard for the perceived value or structure of the data.

Data Warehouses compared to Data Lakes - Depending on the business requirements, a typical organization will require both a data warehouse and a data lake as they serve different needs, and use cases.
Characteristics
Data Warehouse
Data Lake
Type of data stored
Structured data (most often in columns & rows in a relational database) from transactional systems, operational databases, and line of business applications
Any type of data structure,
any format, including structured, semi-structured, and unstructured data from IoT devices, web sites, mobile apps, social media, and corporate applications
Best way to ingest data
Batch processes
Streaming, micro-batch, or
batch processes
Schema
Designed prior to the DW implementation (schema-on-write)
define the structure of the data at the time of analysis , referred to as schema on reading (schema-on-read)
Typical load pattern
ETL - (Extract, Transform, then Load)
ELT - (Extract, Load, and Transform at the time the data is loaded)
Price/Performance
Fastest query results using higher cost storage
Query results getting faster using low-cost storage
Data Quality
Highly curated data that serves as the central version of the truth
Any data that may or may not be curated (ie. raw data)
Users
Business analysts
Data scientists, Data developers, and Business analysts (using curated data)
Analytics pattern
Determine structure, acquire data, then analyze it; iterate back to change structure as needed.
Batch reporting, BI and visualizations
Acquire data, analyze it, then iterate to determine its final structured form.
Machine Learning, Predictive analytics, data discovery and profiling
During the development of a traditional data warehouse, we should decide a considerable amount of time which is going to spend analyzing data sources, understanding business processes, profiling data, and modeling data.
In contrast, the default expectation for a data lake is to acquire all of the data and retain all of the data.
Please visit us to learn more on -
  1. Collaboration of OLTP and OLAP systems
  2. Major differences between OLTP and OLAP
  3. Data Warehouse - Introduction
  4. Data Warehouse - Multidimensional Cube
  5. Data Warehouse - Multidimensional Cube Types
  6. Data Warehouse - Architecture and Multidimensional Model
  7. Data Warehouse - Dimension tables.
  8. Data Warehouse - Fact tables.
  9. Data Warehouse - Conceptual Modeling.
  10. Data Warehouse - Star schema.
  11. Data Warehouse - Snowflake schema.
  12. Data Warehouse - Fact constellations
  13. Data Warehouse - OLAP Servers.
  14. Preparation for a successful Data Lake in the cloud
  15. Why does cloud make Data Lakes Better?

Monday, July 16, 2018

Benefits and capabilities of Data Lake

We know that data is the business asset for any organisation which always keeps secure and accessible to business users whenever it required. 
Data Lake is a storage repository that holds a vast amount of raw data in its native format, including structured, semi-structured, and unstructured data. The data structure and requirements are not defined until the data is needed. We can say that Data Lake is a more organic store of data without regard for the perceived value or structure of the data.
Benefits and capabilities of Data Lake 
The data lake is essential for any organization who wants to take full advantage of its data. The data lake arose because new types of data needed to be captured and exploited by the enterprise. As this data became increasingly available, early adopters discovered that they could extract insight through new applications built to serve the business. 
It supports the following capabilities:
  • To capture and store raw data at scale for a low cost – i.e. The Hadoop-based data lake
  • To store many types of data in the same repository – data lake store the data as-is and support structured, semi-structured, and unstructured data
  • To perform transformations on the data
  • To define the structure of the data at the time, it is used, referred to as schema on reading
  • To perform new types of data processing
  • To perform single subject analytics based on particular use cases
  • To catch all phrase for anything that does not fit into the traditional data warehouse architecture
  • To be accessed by users without technical and/or data analytics skills is ludicrous
Silent Points of Data Lake - It is containing some of the salient points as given below:
  1. A Data Lake stores data in 'near exact original format' and by itself does not provide data integration
  2. Data Lakes need to bring ALL data (including the relevant relational data)
  3. A Data Lake becomes meaningful ONLY when it comes a Data Integration Platform

  4. A Data Integration Platform (Meaningful Data Lake) requires the following 4 major components: An ingestion layer, A multi-modal NoSQL database for data persistence, Transformation Code (Cleanse, Massage & Harmonize data) and A Hadoop Cluster for (generate batch and real-time analytics)
  5. The goal of this architecture is to use 'the right technology solution for the right problem'
  6. This architecture utilises the foundation data management principle of ELT not ETL. In fact T is continuous (T to the power of n). Transformation (change) is continuous in every aspect of any thriving business and Data Integration Platforms (Meaningful Data Lakes) need to support that.
  7. So the process is as follows:
  • 1) Ingest ALL data
  • 2) Persist in a scalable multi-model NoSQL database  - RawDB
  • 3) Transform the data continuously - CleanDB
  • 4) Transport 'clean data' to Hadoop to generate Analytics
  • 5) Persist the 'Analytics' back in the NoSQL database - AnalyticsDB
  • 6) Expose the databases using REST endpoints
  • 7) Consume the data via applications

Friday, June 22, 2018

Microsoft PowerShell - Which version of Windows

P
owerShell is quite a good language which is already used widely on Window, both by system administrators and by Microsoft for delivering management tools. PowerShell went “Open Source” last year and may actually become popular on Linux and Mac iOS where Windows is now available.


To find which version of Windows you are running, enter the following commands in the  Powershell:

--- Below command provides the caption of the current running windows
wmic os get caption

--- Below command provides the architecture of the current running windows
wmic os get osarchitecture


You can see the following output after running the above commands in Powershell-


Outside of its uses for systems administration, it also happens to be incredibly useful for penetration testers needing a good platform for post exploitation.

Wednesday, June 20, 2018

Artificial intelligence with SQL Server

D
ata is the business asset for any organisation which is audited and protected. To gain in their business, it is become very urgent for every organization to choose few good predictive data models and validates them using test data before figuring out an operationalization plan for the model to be deployed to production so that applications can consume it.
It is true the data and artificial intelligence is growing with each other and we have to do agree that database platforms were just using for the fundamental operations on data in the from of queries or CRUD operations as well as some basic computation routines not more than that.  With built-in R and Python support in SQL Server 2017 release, SQL Server is in a unique position to fuel innovations that database professionals and developers can co-create with the data science and AI communities. The possibilities are endless.


In the current era, Machine Learning is very fast growing field because it is used when searching the web, placing ads, credit scoring, stock trading and for many other applications. With the help of new technology called Artificial intelligence (AI), we are looking at an incredibly exciting time for marketing and customer experience, with huge benefits for the consumer and faster than real-time customer service.
AI services will transform how we all interact with media; by understanding our needs ahead of the game, it will change our lives. By entering in conversation directly with a company, and receiving a directly personalised service in return, we will feel that we are being really taken care of as individuals.

Artificial intelligence is reliable on the huge data which is coming from the heterogeneous sources and we never denied that data movement is very costly for any organisation. By doing data science and AI where the data resides, there are many benefits. These include being able to take advantage of the enterprise-grade performance, scale, security and reliability that you’ve come to expect from SQL Server over the years. The most important benefit is that we can eliminate the need for expensive data movement.
By encapsulating the machine learning and AI models as part of the SQL Server stored procedure, it lets SQL Server serve AI with the data. There are other advantages for using stored procedures for operationalizing machine learning and AI (ML/AI).
That why, Microsoft development team has, arguably, built the most complete Machine Intelligence (MI) technology suite in the market. The list of Microsoft’s MI technologies includes advanced platforms such as Azure ML or R Server, Artificial intelligence APIs such as Microsoft Cognitive Services, data visualisation tools such as PowerBI or even vertical solutions included in the Cortana Intelligence Suite. 
In addition, if the data science project involves working with spatial data, non-structured,  temporal data or semi-structured data, we can leverage SQL Server capabilities that let us do this efficiently which takes important steps to bring new MI capabilities to the traditional database platform. 
Most companies already possess reams of data that is not being used; they need to put it to work. Data, analysed by MI/AI, can be used to develop products and services based on patterns and trends of customer behaviour and preferences. In future, we are going to see MI/AI algorithms becoming as common as data access operations in database servers. Microsoft is in a unique position to lead this new trend but we should expect similar moves by competitors such as Oracle, IBM or newcomers such as MongoDB or Couchbase.

For businesses that are small or failing in the current environment where big players take all, AI/MI could provide a real advantage over the competition – these companies need to engage with it now, and fast, to make the best of this advantage.
Conclusion
SQL Server supports Python and R which will allow developers to implement MI/AI models that natively process data stored in SQL Server databases. Those MI/AI models can be directly persisted in the underlying database servers and scaled as part of SQL Server clusters. More importantly, developers will have access to these capabilities using the familiar SQL Server tool set.

Monday, June 18, 2018

What is Data Engineering

D
ata engineering ensuring all right data (internal/external, structured/unstructured) are identified, sourced, cleaned, analyzed, modelled, and decisions implemented — without losing on granularity and value as the data travels this path.
Data Engineering has to help businesses by building robust capabilities to deal with the volume, velocity, reliability, and variety of data and makes this data available for business users to consume — both as traditional marts and warehouses, and new-age big data ecosystems.
Data engineering is dealing with data — data lakes, clouds, pipelines, and platforms. Data Warehouse is the base of BI (Business Intelligence) project, and ETL (Extract, Transform and Load) is the base of Data Warehouse.

Data Approaches: There are many data engineering approaches which are very helpful to understand different techniques as given below-
1. Implement Data Lakes/ Data Warehouses/ Data Marts: Help lay or enlarge the enterprise data foundation so a range of analytics solutions can be built on top
2. Develop Data Pipelines: Facilitate production grade end-to-end pipeline of data-to-value that takes data solutions from sandbox environments, and rolls them out to end users

Popular Posts

Get Sponsored by Big Brands