Thursday, November 2, 2023

Data Engineering — Construct a Standard Data Pipeline Design

 One common challenge encountered by developers and data professionals in the area of data processing and analysis is whether to import data directly into a database or store it as.csv files before loading. This decision is crucial considering it concerns data continuity, error logging, and data retrieval in the event of a system failure.

The standard design pattern is to bring in data into a blob storage called raw, standardised it into parquet or any other columnar format, and then load it into a database system, or if using spark delta lake, build a delta table with location referring to storage.

 

Blob storage is often preferable since it keeps things flexible and makes debugging easier. Particularly if you can compress the data, which can easily save 90+% on egress, ingress, and storage costs. However, if the data from the API is little enough, it doesn’t really matter.



It doesn’t really matter if you’re constructing a data warehouse in a relational database like Oracle Exadata (engineered platform for Oracle).

 

Having said that, we typically standardise the incoming data JSON, CSV, XML, mainframe.dat files or SAS b7dat files, then the first step is to standardise into a columnar format like parquet/orc/iceberg as it offers better compression and thus saves space and money, then the second step is to standardise if you want to run ml modes iteratively in the future.

 

Then you have data saved in the correct format, and csv format does not inform you what data types are when you look at the data, however in parquet format data types/schema travel with the data, so you do not need to explain the schema to the target consumer. And you have the added benefit of keeping the data in an open-source format, so if you decide to shift to another platform tomorrow, you will have data available in a consumable manner, eliminating the need to go source.

 

Opinions —

Case 1: While some claim that simply importing data into a database system is more efficient,

Case 2: Others preference the dependability and flexibility of storing data as.csv files in blob location before moving it into the database systems.

 

In this post, we’ll look at the advantages and disadvantages of each strategy, as well as why one could be preferable to the other.

 

Case 1: Loading Data Directly to a Database System

Loading data directly to a database has its merits. It eliminates the need for an intermediary step of saving data as .csv files, allowing for a more streamlined data processing pipeline. Here are a few reasons why some developers prefer this approach:

1. Efficiency and Performance — By bypassing the step of saving data as .csv or .json or .parquet files, loading data directly to a database system can be faster and more efficient. Databases are designed for data storage and retrieval, making them an excellent solution for dealing with massive amounts of data. Furthermore, databases frequently have powerful indexing and query optimisation algorithms that may greatly accelerate data retrieval and analysis.

2. Real-Time Data Updates — Adding data directly to a database helps ensure that your analysis is always based on the most up-to-date information if your data source is constantly updated in real-time. This is especially handy when working with time-sensitive data or when the data must be updated often.

3. Data Integrity and Consistency — Data integrity restrictions are enforced by databases, ensuring that the data being loaded complies with predetermined rules and standards. You may use these limitations to preserve consistency and correctness in your data by immediately importing it into a database. This is especially true when working with data from many sources or clients.

 

Case 2: Saving Data as raw file before ingestion into database system

While loading data directly to a database system has its advantages, there are also compelling reasons to save data such as .csv or .json or .parquet files before loading. Let’s explore some of these reasons:

1. Data Continuity and Backup — One of the main concerns with loading data directly to a database is the potential risk of data loss in case of system failures or other unforeseen circumstances. By saving data as .csv files, you create a backup that can be easily accessed and restored if needed. This provides an additional layer of data continuity and ensures that you have a copy of your data even if the database becomes inaccessible.

2. Easy Error Logging and Data Validation — Saving data as .csv files allows for easy error logging and data validation. When data is loaded directly into a database, it can be challenging to identify and track errors that occur during the loading process. However, by first saving data as .csv files, you can easily log any errors encountered and validate the data before it is loaded into the database. This can help in identifying and resolving issues early on, improving data quality and accuracy.

3. Flexibility with Multiple Clients and Data Sources — Saving data as .csv files provides a standardized way to load data from multiple clients and data sources. Different clients may have their own data formats or requirements and saving data as .csv files allows for seamless integration and compatibility. It provides the flexibility to transform and manipulate data before loading it into the database, ensuring that it aligns with your specific business needs.

 

As we all know, API and database rates, availability periods, and so on will differ. If the database is down for maintenance, for example, you can simply keep the API scraper running and backfill when the database comes back up.

 

The Best of Both Worlds: A Hybrid Approach

Rather than choosing between loading data directly to a database or saving it as .csv files, a hybrid approach can offer the best of both worlds. This approach involves leveraging the strengths of both methods to create a robust and flexible data processing pipeline.

Here’s how it can work:

1.       Extract data from the source and save it as .csv or .json or .parquet files files in a designated blob storage. This provides a backup and allows for easy error logging and data validation.

2.      Standardize and transform the data into a more efficient columnar format (such as parquet) within the blob storage. This improves data storage and retrieval performance.

3.      Load the transformed data from the blob storage into the database. This ensures data continuity, real-time updates, and takes advantage of the database’s indexing and query optimization capabilities.

By adopting a hybrid approach, you can address the concerns of data continuity, error logging, and performance while maintaining the flexibility to handle different data sources and clients.

 

Keep the data in its original form, together with the necessary metadata (what query when against what API, etc.). AFTER THAT, go to an organized storage. Storage is inexpensive, and you now have independent concerns that are auditable and re-playable.

Aside from that, we’ve found it quite useful for reverse engineering undocumented data sets because you can search for known values and discover the column name right next to the value. In the database, however, you must know the column name to search.

 

Conclusion

When it comes to loading data directly to a database or saving it as .csv or .json or .parquet files, there is no one-size-fits-all approach. The decision should be based on your specific requirements, data characteristics, and business needs. While loading data directly to a database offers efficiency and real-time updates, saving data as .csv files provides data continuity, easy error logging, and flexibility with multiple clients and data sources.

Consider the nature of your data, the frequency of updates, the risk of system failures, and the need for data validation when making this decision. In some cases, a hybrid approach that combines the benefits of both methods may be the best solution. Ultimately, the goal is to create a robust and reliable data processing pipeline that meets your organization’s unique needs.

 

To learn more, please follow us -
http://www.sql-datatools.com
To Learn more, please visit our YouTube channel at —
http://www.youtube.com/c/Sql-datatools
To Learn more, please visit our Instagram account at -
https://www.instagram.com/asp.mukesh/
To Learn more, please visit our twitter account at -
https://twitter.com/macxima

 

No comments:

Post a Comment