Wednesday, May 8, 2024

Cloud Functions - How to Read PDF Files on GCS Events and Store in BigQuery

In this tutorial, you will learn "How to create an event-driven Cloud Function that reads PDF files from Google Cloud Storage (GCS) and pushes their contents into BigQuery" in GCP.

Google Cloud Functions provide a serverless environment for running event-driven code, making them ideal for a variety of data engineering applications. They enable you to create single-purpose functions that are triggered by a variety of events, such as data changes, user activities, or time-based schedules, without the need to setup or manage servers.

Google Cloud Functions are a flexible and efficient method to build event-driven apps and microservices, allowing developers to focus on writing code rather than worrying about infrastructure administration.

Real-time data processing in Google Cloud Functions entails processing events from streaming data sources including Cloud Pub/Sub, Cloud Storage, and Cloud IoT Core in near real time.



🌿Roles and Permission -
Ensure that you have appropriate permissions for accessing GCS and BigQuery in your Cloud Function's service account.
👍Cloud Functions Invoker: to execute Cloud Function 👍Service Account User: to interact with other Google Cloud services 👍BigQuery Data Editor: to edit data in BigQuery datasets 👍Storage Object Admin: to read, write, update, and delete on GCS Bucket 🌿Cloud Function Configuration & Trigger Trigger Type, Event Type, Bucket, Retry on failure 🌿Runtime, Build, Connections and Security Settings Memory Allocated, CPU, Timeout, Concurrency, Autoscaling 🌿Runtime Service Account 🌿Runtime Environment Variables 🌿Cloud Function Code 🎯Runtime Language and Entry Point 🎯Supportive Python Packages/Libraries
🚀PyPDF2 - PyPDF2 is a Python library for working with PDF files. It allows you to perform various operations such as reading, writing, and modifying PDF documents. 
🚀google-cloud-storage - To interact with Google Cloud Storage in Python, you can use the google-cloud-storage library, which is an official client library provided by Google Cloud Platform. This library allows you to perform various operations such as uploading, downloading, and managing objects in Google Cloud Storage buckets. 
🚀google-cloud-bigquery - To interact with Google BigQuery in Python, you can use the google-cloud-bigquery library, which is an official client library provided by Google Cloud Platform. This library allows you to execute SQL queries, manage datasets and tables, and perform other operations on Google BigQuery.
🚀Read PDF File from GCS bucket as a Byte Format 🚀Convert Byte Object to File Like Object 🎯Conditions 🚀Verify the Source Directory Path 🚀Verify the File Extension 🎯Reading PDF File from GCS 🚀Download file as Byte Format 🚀Call PDF Convert Method 🚀Get the PDF Data as Text in Json Array List 🚀Deleting PDF file from GCS Bucket 🎯Load Data Into BigQuery 🚀Initializing client variable from BigQuery 🚀Create Schema variable to meet BQ Table 🚀BigQuery Job Configuration 🚀Load BigQuery Job 🌿Save and Deployment and Testing- 🎯Upload PDF files into GCS Bucket, 🎯Log Check and Verification 🎯Verify the data in BigQuery Table


⭐To learn more, please follow us - http://www.sql-datatools.com ⭐To Learn more, please visit our YouTube channel at - http://www.youtube.com/c/Sql-datatools ⭐To Learn more, please visit our Instagram account at - https://www.instagram.com/asp.mukesh/ ⭐To Learn more, please visit our twitter account at - https://twitter.com/macxima ⭐To Learn more, please visit our Medium account at - https://medium.com/@macxima

No comments:

Post a Comment