Reading multiple csv files with spark is very slow?

Reading multiple csv files with spark is very slow? - python

I'm trying to read multiple csv files from a specific folder. However, even with only two files in the folder, the reading runs infinitely. Follow the code
df = spark.read.csv("data/*.csv",sep =';', header = True)
df.printSchema()
When I run the same code for one file only, it only takes a few seconds.
df = spark.read.csv("data/file1.csv",sep =';', header = True)
df.printSchema()
I'm probably doing something wrong, but I don't know what it is. The two files have the exact same header and are similar in size.
Any idea?

Related

How to improve my append and read excel For loop in python

Hope you can help me.
I have a folder where there are several .xlsx files with similar structure (NOTE that some of the files might be bigger than 50MB). I want to combine them all together and (eventually) send them to a database. But before that, I need to improve the performance of this block of code because sometimes it takes a lot of time to process all those files.
The code in question is this:
df_list = []
for file in location:
df_list.append(pd.read_excel(file, header=0, engine='openpyxl'))
df_concat = pd.concat(df_list)
Any suggestions?
Somewhere I read that converting Excel files to CSV might improve the performance, but should I do that before appending the files or after everything is concatenated?
And considering df_list is a list, can I do that conversion?

I've found a solution with xlsx2csv
xlsx_path = './data/Extract/'
csv_path = './data/csv/'
list_of_xlsx = glob.glob(xlsx_path+'*.xlsx')
for xlsx in list_of_xlsx:
# Extract File Name on group 2 "(.+)"
filename = re.search(r'(.+[\\|\/])(.+)(\.(xlsx))', xlsx).group(2)
# Setup the call for subprocess.call()
call = ["python", "./xlsx2csv.py", xlsx, csv_path+filename+'.csv']
try:
subprocess.call(call) # On Windows use shell=True
except:
print('Failed with {}'.format(filepath)
outputcsv = './data/bigcsv.csv' #specify filepath+filename of output csv
listofdataframes = []
for file in glob.glob(csv_path+'*.csv'):
df = pd.read_csv(file)
if df.shape[1] == 24: # make sure 24 columns
listofdataframes.append(df)
else:
print('{} has {} columns - skipping'.format(file,df.shape[1]))
bigdataframe = pd.concat(listofdataframes).reset_index(drop=True)
bigdataframe.to_csv(outputcsv,index=False)
I tried to make this work for me but had no success. Maybe you might be able to have it working for you? Or does anyone have any ideas?

Reading excel files is quite slow in pandas as you stated, you shoudld have a look at this answer. It bascally uses a vbscript before running the python script to convert excel file to csv file, which is way faster to read for the python script.
To be more specific and answer the second part of your question, you should convert teh excel files to csv before loading them with pandas. The read_excel function is the slow part.

how do i assemble bunch of excel files into one or more using python

how to there is around 10k .csv files named as data0,data1 like that in sequence, want to combine them and want to have a master sheet in one file or at least couple of sheets using python because i think there is limitation of around 1070000 rows in one excel file i think?
import pandas as pd
import os
master_df = pd.DataFrame()
for file in os.listdir(os.getcwd()):
if file.endswith('.csv'):
master_df = master_df.append(pd.read_csv(file))
master_df.to_csv('master file.CSV', index=False)

A few things to note:
Please check your csv file content first. It would easily mismatch columns when reading csv with text(maybe ; in the content). Or you can try changing the csv engine
df= pd.read_csv(csvfilename,sep=';', encoding='utf-8',engine='python')
If you want to combing into one sheet, your can concat into one dataframe first, then to_excel
df = pd.concat([df,sh_tmp],axis=0,sort=False)
note: concat or append would be a straightforward way to combine data. However, 10k would lead to a perfomance topic. Try list instead of pd.concat if you facing perfomance issue.
Excel has maximum row limitation. 10k files would easily exceed the limit (1048576). You might change the output to csv file or split into multiple .xlsx
----update the 3rd----
You can try grouping the data first (1000k each), then write to sheet one by one.
row_limit = 1000000
master_df['group']=master_df.index//row_limit
writer = pd.ExcelWriter(path_out)
for gr in range(0,master_df['group'].max()+1):
master_df.loc[master_df['group']==gr].to_excel(writer,sheet_name='Sheet'+str(gr),index=False)
writer.save()

Extracting data from excel using python and writing to an empty excel file

I have a large set of data that I am trying to extract from multiple excel files that have multiple sheets using python and then write that data into a new excel file. I am new with python and have tried to use various tutorials to come up with code that can help me automate the process. However, I have reached a point where I am stuck and need some guidance on how to write the data that I extract to a new excel file. If someone could point me in the write direction, it would be greatly appreciated. See code below:
import os
from pandas.core.frame import DataFrame
path = r"Path where all excel files are located"
os.chdir(path)
for WorkingFile in os.listdir(path):
if os.path.isfile(WorkingFile):
DataFrame = pd.read_excel(WorkingFile, sheet_name = None, header = 12, skipfooter = 54)
DataFrame.to_excel(r'Empty excel file where to write all the extracted data')
When I execute the code I get an error "AttributeError: 'dict' object has no attribute 'to_excel'. So I am not sure how to rectify this error, any help would be appreciated.
Little bit more background on what I am trying to do. I have a folder with about 50 excel files, each file might have multiple sheets. The data I need is located on a table that consists of one row and 14 columns and is in the same location on each file and each sheet. I need to pull that data and compile it into a single excel file. When I run the code above and and a print statement, it is showing me the exact data I want but when I try to write it to excel it doesn't work.
Thanks for help in advance!

Not sure why you're importing DataFrame instead of pandas. Looks like your code is incomplete. Below code will clear the doubts you have. (Not include any conditions for excluding non excel files/dir etc )
import pandas as pd
import os
path = "Dir path to excel files" #Path
df = pd.DataFrame() # Initialize empty df
for file in os.listdir(path):
data = pd.read_excel(path + file) # Read each file from dir
df = df.append(data, ignore_index=True) # and append to empty df
# process df
df.to_excel("path/file.xlsx")

Issue with saving CSV file how can I fix this in python pandas?

I'm having trouble dropping columns and saving the new data frame as a CSV file.
Code:
import pandas as pd
file_path = 'Downloads/editor_events.csv'
df = pd.read_csv(file_path, index_col = False, nrows= 1000)
df.to_csv(file_path, index = False)
df.to_csv(file_path)
The code executes and doesn't give any error. I've looked in my root directory but can't see any new csv file

Check file in folder in which you are running python script. And you are saving with same name, so you can check modified time to confirm it. Also you are not dropping columns as per posted code, you are just taking 1000 rows and saving it.

First: you are saving the same file that you are reading, so you won't see any new csv files. All you are doing right now is rewriting the same file.
But since I can guess you just show it as simple example of what you want to do, I will move to second:
Make sure that your path is correct. Try to write the full path, like 'c:\Users\AwesomeUser\Downloads\editor_events.csv' instead of just 'Downloads\editor_events.csv'.

Looping through .xlsx files using pandas, only does first file

My ultimate goal is to merge the contents of a folder full of .xlsx files into one big file.
I thought the below code would suffice, but it only does the first file, and I can't figure out why it stops there. The files are small (~6 KB), so it shouldn't be a matter of waiting. If I print f_list, it shows the complete list of files. So, where am I going wrong? To be clear, there is no error returned, it just does not do the entire for loop. I feel like there should be a simple fix, but being new to Python and coding, I'm having trouble seeing it.
I'm doing this with Anaconda on Windows 8.
import pandas as pd
import glob
f_list = glob.glob("C:\\Users\\me\\dt\\xx\\*.xlsx") # creates my file list
all_data = pd.DataFrame() # creates my DataFrame
for f in f_list: # basic for loop to go through file list but doesn't
df = pd.read_excel(f) # reads .xlsx file
all_data = all_data.append(df) # appends file contents to DataFrame
all_data.to_excel("output.xlsx") # creates new .xlsx
Edit with new information:
After trying some of the suggested changes, I noticed the output claiming the files are empty, except for 1 of them which is slightly larger than the others. If I put them into the DataFrame, it claims the DataFrame is empty. If I put it into the dict, it claims there are no values associated. Could this have something to do with the file size? Many, if not most, of these files have 3-5 rows with 5 columns. The one it does see has 12 rows.

I strongly recommend reading the DataFrames into a dict:
sheets = {f: pd.read_excel(f) for f in f_list}
For one thing this is very easy to debug: just inspect the dict in the REPL.
Another is that you can then concat these into one DataFrame efficiently in one pass:
pd.concat(sheets.values())
Note: This is significantly faster than append, which has to allocate a temporary DataFrame at each append-call.
An alternative issue is that your glob may not be picking up all the files, you should check that it is by printing f_list.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Reading multiple csv files with spark is very slow? - python

Related

How to improve my append and read excel For loop in python

how do i assemble bunch of excel files into one or more using python

Extracting data from excel using python and writing to an empty excel file

Issue with saving CSV file how can I fix this in python pandas?

Looping through .xlsx files using pandas, only does first file

Categories

Resources