Friday, April 22, 2022

PySpark — Read All files from nested Folders/Directories

As we know that PySpark is a Python API for Apache Spark where as Apache Spark is an Analytical Processing Engine for large scale powerful distributed data processing and machine learning applications.

Note : I’m using Jupyter Notebook for this process and assuming that you guys have already setup PySpark on it.
#import all the libraries of pyspark.sql
from pyspark.sql import*#import SparkContext and SparkConf
from pyspark import SparkContext, SparkConf
#setup configuration property 
#set the master URL
#set an application name
conf = SparkConf().setMaster("local").setAppName("sparkproject")#start spark cluster
#if already started then get it else start it
sc = SparkContext.getOrCreate(conf=conf)#initialize SQLContext from spark cluster
sqlContext = SQLContext(sc)
#variable to hold the main directory path
dirPath='/content/PySparkProject/Datafiles'
#variable to store file path list from main directory
Filelists=sc.wholeTextFiles("/content/PySparkProject/Datafiles/*/*.csv").map(lambda x: x[0]).collect()
#for loop to read each file into dataframe from Filelists
for filepath in Filelists:
print(filepath)
#read data into dataframe by using filepath
df=sqlContext.read.csv(filepath, header=True)
#show data from dataframe
df.show()
#set sparksession 
sparkSession=SparkSession(sc)
#variable to hold the main directory path
dirPath='/content/PySparkProject/Datafiles'
#read files from nested directories
df= sparkSession.read.option("recursiveFileLookup","true").option("header","true").csv(dirPath)
#show data from data frame
df.show()
To learn more, please follow us -
To Learn more, please visit our YouTube channel at —
To Learn more, please visit our Instagram account at -
To Learn more, please visit our twitter account at -

Popular Posts