As we know that PySpark is a Python API for Apache Spark where as Apache Spark is an Analytical Processing Engine for large scale powerful distributed data processing and machine learning applications.
If you are working as a PySpark developer, data scientist or data analytics and many times we need to load data from a nested data directory. These nested data directories typically created when there is an ETL job which keep on putting data from different dates in different folder. You would like to read these CSV files into spark Dataframe for further analysis. In this is article, I am going to talk about data loading from nested folders.

Note : I’m using Jupyter Notebook for this process and assuming that you guys have already setup PySpark on it.
Step 1: Import all the necessary libraries in our code as given below —
- SparkContext is the entry gate of Apache Spark functionality and the most important step of any Spark driver application is to generate SparkContext which represents the connection to a Spark cluster, and can be used to create RDDs, accumulators and broadcast variables on that cluster.
- SparkSession is an entry point to underlying PySpark functionality in order to programmatically create PySpark RDD, DataFrame.
- SQLContext can be used create DataFrame , register DataFrame as tables, execute SQL over tables, cache tables, and read parquet files whereas SparkContext is backing this SQLContext. The SparkSession around which this SQLContext wraps.
- SparkConf offers configurations to run a Spark application on the local/cluster by supporting few configurations and parameters
#import all the libraries of pyspark.sql
from pyspark.sql import*#import SparkContext and SparkConf
from pyspark import SparkContext, SparkConf
Step 2: Configure spark application, start spark cluster and initialize SQLContext for dataframes
#setup configuration property
#set the master URL
#set an application name
conf = SparkConf().setMaster("local").setAppName("sparkproject")#start spark cluster
#if already started then get it else start it
sc = SparkContext.getOrCreate(conf=conf)#initialize SQLContext from spark cluster
sqlContext = SQLContext(sc)
Method 1: Declare variables for the file path list and you can use *
wildcard for each level of nesting as shown below:
#variable to hold the main directory path
dirPath='/content/PySparkProject/Datafiles'#variable to store file path list from main directory
Filelists=sc.wholeTextFiles("/content/PySparkProject/Datafiles/*/*.csv").map(lambda x: x[0]).collect()
In my case, the structure is even more nested & complex as given below-

Read data into dataframe by using for loop
#for loop to read each file into dataframe from Filelists
for filepath in Filelists:
print(filepath)
#read data into dataframe by using filepath
df=sqlContext.read.csv(filepath, header=True)
#show data from dataframe
df.show()
Above, read csv file into pyspark dataframe where you are using sqlContext to read csv full file path and also set header property true to read the actual header columns from the file.
Sample Output -

Method 2: Spark 3.0 provides an option recursiveFileLookup to load files from recursive subfolders. This recursively loads the files from src/main/resources/nested and it’s subfolders.
#set sparksession
sparkSession=SparkSession(sc)#variable to hold the main directory path
dirPath='/content/PySparkProject/Datafiles'#read files from nested directories
df= sparkSession.read.option("recursiveFileLookup","true").option("header","true").csv(dirPath)#show data from data frame
df.show()

User can enable recursiveFileLookup option in the read time which will make spark to read the files recursively. This improvement makes loading data from nested folder much easier now. The same option is available for all the file based connectors like parquet, avro etc.
Now, you can see this is very easy task to read all files from the nested folders or sub-directories in PySpark.
薛如冰怀着曾经去做室内设计师的梦想,对房子的美有独到的理解。在她手上挂盘的出售房源,经过团队的精心策划和布置后,无一不是市场上璀璨的明星。无论什么类型的房产,薛如冰懂得如何去发挥一栋房子的长处,懂得如何抓住买家心理,对市场的熟知让薛如冰在定价上精准,在谈判中胸有成竹, 她手上售出的物业,往往可以突破市场价并顺利成交,让卖房这件有压力的事情变得轻松愉悦。
ReplyDeletehttps://rubyxue.com/
功夫卡海外充值是一家专为海外华人服务的商城,功夫卡海外充值拥有着全非常齐全的商品,优质的客服服务。商品齐全
ReplyDelete包括海外游戏点卡、手游代充、腾讯业务代充、加速器代充、快手直播平台代充等等,更有海外华人专属代购业务,不管你有任何代购代充需求我们就能帮忙达成。
在iTangka上如何购买苹果iTunes礼品卡?
ReplyDelete打开iTangka海外充值官网(https://www.itangka.com) ,免费注册一个iTangka账号,现在注册账号即可领取价值88美金的优惠券,下单即可抵用;
We are Suiet Wallet, as a part of Sui ecosystem, our goal is to build a Sui wallet for everyone and onboard the next billion Web3 users.
ReplyDeleteiOS 版 TG纸飞机 于 2013 年8 月 14 日推出。2013 年 10 月 20 日,TG纸飞机Android 的 alpha 版本正式推出。越来越多的TG纸飞机 客户端出现,由独立开发者使用 TG纸飞机 的开放平台构建。
ReplyDelete