I'm trying to read some excel files as pandas dataframes. The problem is they are quite large (about 2500 rows, columns up to 'CYK' label in the excel sheet, and there are 14 of them).
Every time that I run my program, it has to import again the files from excel. This causes the runtime to grow a lot, currently it's a bit more than 15 minutes and as of now the program doesn't even do anything significant except importing the files.
I would like to be able to import the files just once, then save the dataframe objects somewhere and make my program work only on those dataframes.
Any suggestions?
This is the code I developed until now:
import pandas as pd
import os
path = r'C:/Users/damia/Dropbox/Tesi/WIOD'
dirs = os.listdir(path)
complete_dirs = []
for f in dirs:
complete_dirs.append(path + r"/" + f)
data = []
for el in complete_dirs:
wiod = pd.read_excel(el, engine='pyxlsb')
data.append(wiod)
If anyone is interested, you can find the files I'm trying to read at this link:
http://www.wiod.org/database/wiots16
You could use the to_pickle and read_pickle methods provided by pandas to serialize the dataframes and store them in files.
docs
Example pickling:
data = []
pickle_paths = []
for el in complete_dirs:
wiod = pd.read_excel(el, engine='pyxlsb')
# here's where you store it
pickle_loc = 'your_unique_path_to_save_this_frame'
wiod.to_pickle(pickle_loc)
pickle_paths.append(pickle_loc)
data.append(wiod)
Depickling
data = []
for el in pickle_paths:
data.append(pd.read_pickle(el))
Another solution using to_pickle and read_pickle.
As an aside, you can read Excel files directly from URLs if you don't want to save to your drive first.
#read each file from the URL and save to disk
for year in range(2000, 2015):
pd.read_excel(f"http://www.wiod.org/protected3/data16/wiot_ROW/WIOT{year}_Nov16_ROW.xlsb").to_pickle(f"{year}.pkl")
#read pickle files from disk to a dataframe
data = list()
for year in range(2000, 2015):
data.append(pd.read_pickle(f"{year}.pkl"))
data = pd.concat(data)
Related
Hope you can help me.
I have a folder where there are several .xlsx files with similar structure (NOTE that some of the files might be bigger than 50MB). I want to combine them all together and (eventually) send them to a database. But before that, I need to improve the performance of this block of code because sometimes it takes a lot of time to process all those files.
The code in question is this:
df_list = []
for file in location:
df_list.append(pd.read_excel(file, header=0, engine='openpyxl'))
df_concat = pd.concat(df_list)
Any suggestions?
Somewhere I read that converting Excel files to CSV might improve the performance, but should I do that before appending the files or after everything is concatenated?
And considering df_list is a list, can I do that conversion?
I've found a solution with xlsx2csv
xlsx_path = './data/Extract/'
csv_path = './data/csv/'
list_of_xlsx = glob.glob(xlsx_path+'*.xlsx')
for xlsx in list_of_xlsx:
# Extract File Name on group 2 "(.+)"
filename = re.search(r'(.+[\\|\/])(.+)(\.(xlsx))', xlsx).group(2)
# Setup the call for subprocess.call()
call = ["python", "./xlsx2csv.py", xlsx, csv_path+filename+'.csv']
try:
subprocess.call(call) # On Windows use shell=True
except:
print('Failed with {}'.format(filepath)
outputcsv = './data/bigcsv.csv' #specify filepath+filename of output csv
listofdataframes = []
for file in glob.glob(csv_path+'*.csv'):
df = pd.read_csv(file)
if df.shape[1] == 24: # make sure 24 columns
listofdataframes.append(df)
else:
print('{} has {} columns - skipping'.format(file,df.shape[1]))
bigdataframe = pd.concat(listofdataframes).reset_index(drop=True)
bigdataframe.to_csv(outputcsv,index=False)
I tried to make this work for me but had no success. Maybe you might be able to have it working for you? Or does anyone have any ideas?
Reading excel files is quite slow in pandas as you stated, you shoudld have a look at this answer. It bascally uses a vbscript before running the python script to convert excel file to csv file, which is way faster to read for the python script.
To be more specific and answer the second part of your question, you should convert teh excel files to csv before loading them with pandas. The read_excel function is the slow part.
I'm currently running in a problem withe Apache NiFi ExecuteStreamCommand using PYthon. I have a script which reads a csv and converts it in a pandas-Dataframes and afterwards in a JSON. The script splits the csv file in several DataFrames due to inconsistent naming of the columns. My current script looks as follows:
import pandas as pd
import sys
input = sys.stdin.readlines()
#searching for subfiles and saving them to a list with files ...
appendDataFrames = []
for dataFrames in range(len(files)):
df = pd.DataFrame(files[dataFrame])
#several improvements of DataFrame ...
appendDataFrames.append(df)
output = pd.concat(appendDataFrames)
JSONOutPut = output.to_json(orient='records', date_format='iso', date_unit='s')
sys.stdout.write(JSONOutPut)
In the queue to my next processor I can now see one FlowFile as JSON (as expected).
My question is, is it possible to write each JSON in seperate FlowFiles, so that my next processor is able to work at them separated? I need to do this because the next processor is a InferAvroSchema and since all JSONs have different schemas this is no opportunity. Am I mistaken? Or is there a possible way to solve this?
The code below won't work since its anyway in the same flow file and my InferAvroSchema is not able to handle this separated.
import pandas as pd
import sys
input = sys.stdin.readlines()
#searching for subfiles and saving them to a list with files ...
appendDataFrames = []
for dataFrames in range(len(files)):
df = pd.DataFrame(files[dataFrame])
#several improvements of DataFrame ...
JSONOutPut = df.to_json(orient='records', date_format='iso', date_unit='s')
sys.stdout.write(JSONOutPut)
Thanks in advance!
with ExecuteStreamCommand you can't split output because you have to write to stdout.
However you could write some delimiter into output and use SplitContent with the same delimiter as next processor.
I just modified my code as follows:
import pandas as pd
import sys
input = sys.stdin.readlines()
#searching for subfiles and saving them to a list with files ...
appendDataFrames = []
for dataFrames in range(len(files)):
df = pd.DataFrame(files[dataFrame])
#several improvements of DataFrame ...
JSONOutPut = df.to_json(orient='records', date_format='iso', date_unit='s')
sys.stdout.write(JSONOutPut)
sys.stdout.write("#;#")
And added a SplitContent processor like:
I have six .csv files. They overall size is approximately 4gigs. I need to clean each and do some data analysis task on each. These operations are the same for all the frames.
This is my code for reading them.
#df = pd.read_csv(r"yellow_tripdata_2018-01.csv")
#df = pd.read_csv(r"yellow_tripdata_2018-02.csv")
#df = pd.read_csv(r"yellow_tripdata_2018-03.csv")
#df = pd.read_csv(r"yellow_tripdata_2018-04.csv")
#df = pd.read_csv(r"yellow_tripdata_2018-05.csv")
df = pd.read_csv(r"yellow_tripdata_2018-06.csv")
Each time I run the kernel, I activate one of the files to be read.
I am looking for a more elegant way to do this. I thought about doing a for-loop. Making a list of file names and then reading them one after the other but I don't want to merge them together so I think another approach must exist. I have been searching for it but it seems all the questions lead to concatenating the files read at the end.
Use for and format like this. I use this every single day:
number_of_files = 6
for i in range(1, number_of_files+1):
df = pd.read_csv("yellow_tripdata_2018-0{}.csv".format(i)))
#your code here, do analysis and then the loop will return and read the next dataframe
You could use a list to hold all of the dataframes:
number_of_files = 6
dfs = []
for file_num in range(len(number_of_files)):
dfs.append(pd.read_csv(f"yellow_tripdata_2018-0{file_num}.csv")) #I use Python 3.6, so I'm used to f-strings now. If you're using Python <3.6 use .format()
Then to get a certain dataframe use:
df1 = dfs[0]
Edit:
As you are trying to keep from loading all of these in memory, I'd resort to streaming them. Try changing the for loop to something like this:
for file_num in range(len(number_of_files)):
with open(f"yellow_tripdata_2018-0{file_num}.csv", 'wb') as f:
dfs.append(csv.reader(iter(f.readline, '')))
Then just use a for loop over dfs[n] or next(dfs[n]) to read each line into memory.
P.S.
You may need multi-threading to iterate through each one at the same time.
Loading/Editing/Saving: - using csv module
Ok, so I've done a lot of research, python's csv module does load one line at a time, it's most likely in the mode we are opening it in. (explained here)
If you don't want to use Pandas (which chunking may honestly be the answer, just implement that into #seralouk's answer if so), otherwise, then yes! This below is in my mind would be the best approach, we just need to change a couple of things.
number_of_files = 6
filename = "yellow_tripdata_2018-{}.csv"
for file_num in range(number_of_files):
#notice I'm opening the original file as f in mode 'r' for read only
#and the new file as nf in mode 'a' for append
with open(filename.format(str(file_num).zfill(2)), 'r') as f,
open(filename.format((str(file_num)+"-new").zfill(2)), 'a') as nf:
#initialize the writer before looping every line
w = csv.writer(nf)
for row in csv.reader(f):
#do your "data cleaning" (THIS IS PER-LINE REMEMBER)
#save to file
w.writerow(row)
Note:
You may want to consider using a DictReader and/or DictWriter, I'd prefer them over regular reader/writers as I find them easier to understand.
Pandas Approach - using chunks
PLEASE READ this answer - if you'd like to steer away from my csv approach and stick with Pandas :) It literally seems like it's the same issue as yours and the answer is what you're asking for.
Basically Panda's allows for you to partially load a file as chunks, execute any alterations, then you can write those chunks to a new file. Below is majorly from that answer but I did do some more reading up myself in the docs
number_of_files = 6
chunksize = 500 #find the chunksize that works best for you
filename = "yellow_tripdata_2018-{}.csv"
for file_num in range(number_of_files):
for chunk in pd.read_csv(filename.format(str(file_num).zfill(2))chunksize=ch)
# Do your data cleaning
chunk.to_csv(filename.format((str(file_num)+"-new").zfill(2)), mode='a') #see again we're doing it in append mode so it creates the file in chunks
For more info on chunking the data see here as well it's good reading for those such as yourself getting headaches over these memory issues.
Use glob.glob to get all files with similar names:
import glob
files = glob.glob("yellow_tripdata_2018-0?.csv")
for f in files:
df = pd.read_csv(f)
# manipulate df
df.to_csv(f)
This will match yellow_tripdata_2018-0<any one character>.csv. You can also use yellow_tripdata_2018-0*.csv to match yellow_tripdata_2018-0<anything>.csv or even yellow_tripdata_*.csv to match all csv files that start with yellow_tripdata.
Note that this also only loads one file at a time.
Use os.listdir() to make a list of files you can loop through?
samplefiles = os.listdir(filepath)
for filename in samplefiles:
df = pd.read_csv(filename)
where filepath is the directory containing multiple csv's?
Or a loop that changes the filename:
for i in range(1, 7):
df = pd.read_csv(r"yellow_tripdata_2018-0%s.csv") % ( str(i))
# import libraries
import pandas as pd
import glob
# store file paths in a variable
project_folder = r"C:\file_path\"
# Save all file path in a variable
all_files_paths = glob.glob(project_folder + "/*.csv")
# Create a list to save whole data
li = []
# Use list comprehension to iterate over all files; and append data in each file to list
list_all_data = [li.append(pd.read_csv(filename, index_col=None, header=0)) for filename in all_files]
# Convert list to pandas dataframe
df = pd.concat(li, axis=0, ignore_index=True)
I have a 2 column CSV with download links in the first column and company symbols in the second column. For example:
http://data.com/data001.csv, BHP
http://data.com/data001.csv, TSA
I am trying to loop through the list so that Python opens each CSV via the download link and saves it separately as the company name. Therefore each file should be downloaded and saved as follows:
BHP.csv
TSA.csv
Below is the code I am using. It currently exports the entire CSV into a single row tabbed format, then loops back and does it again and again in an infinite loop.
import pandas as pd
data = pd.read_csv('download_links.csv', names=['download', 'symbol'])
file = pd.DataFrame()
cache = []
for d in data.download:
df = pd.read_csv(d,index_col=None, header=0)
cache.append(df)
file = pd.DataFrame(cache)
for s in data.symbol:
file.to_csv(s+'.csv')
print("done")
Up until I convert the list 'cache' into the DataFrame 'file' to export it, the data is formatted perfectly. It's only when it gets converted to a DataFrame when the trouble starts.
I'd love some help on this one as I've been stuck on it for a few hours.
import pandas as pd
data = pd.read_csv('download_links.csv')
links = data.download
file_names = data.symbol
for link, file_name in zip(links,file_names):
file = pd.read_csv(link).to_csv(file_name+'.csv', index=False)
Iterate over both fields in parallel:
for download, symbol in data.itertuples(index=False):
df = pd.read_csv(d,index_col=None, header=0)
df.to_csv('{}.csv'.format(symbol))
I have a folder with a large number of Excel workbooks. Is there a way to convert every file in this folder into a CSV file using Python's xlrd, xlutiles, and xlsxWriter?
I would like the newly converted CSV files to have the extension '_convert.csv'.
OTHERWISE...
Is there a way to merge all the Excel workbooks in the folder to create one large file?
I've been searching for ways to do both, but nothing has worked...
Using pywin32, this will find all the .xlsx files in the indicated directory and open and resave them as .csv. It is relatively easy to figure out the right commands with pywin32...just record an Excel macro and perform the open/save manually, then look at the resulting macro.
import os
import glob
import win32com.client
xl = win32com.client.gencache.EnsureDispatch('Excel.Application')
for f in glob.glob('tmp/*.xlsx'):
fullname = os.path.abspath(f)
xl.Workbooks.Open(fullname)
xl.ActiveWorkbook.SaveAs(Filename=fullname.replace('.xlsx','.csv'),
FileFormat=win32com.client.constants.xlCSVMSDOS,
CreateBackup=False)
xl.ActiveWorkbook.Close(SaveChanges=False)
I will give a try with my library pyexcel:
from pyexcel import Book, BookWriter
import glob
import os
for f in glob.glob("your_directory/*.xlsx"):
fullname = os.path.abspath(f)
converted_filename = fullname.replace(".xlsx", "_converted.csv")
book = Book(f)
converted_csvs = BookWriter(converted_filename)
converted_csvs.write_book_reader(book)
converted_csvs.close()
If you have a xlsx that has more than 2 sheets, I imagine you will have more than 2 csv files generated. The naming convention is: "file_converted_%s.csv" % your_sheet_name. The script will save all converted csv files in the same directory where you had xlsx files.
In addition, if you want to merge all in one, it is super easy as well.
from pyexcel.cookbook import merge_all_to_a_book
import glob
merge_all_to_a_book(glob.glob("your_directory/*.xlsx"), "output.xlsx")
If you want to do more, please read the tutorial
Look at openoffice's python library. Although, I suspect openoffice would support MS document files.
Python has no native support for Excel file.
Sure. Iterate over your files using something like glob and feed them into one of the modules you mention. With xlrd, you'd use open_workbook to open each file by name. That will give you back a Book object. You'll then want to have nested loops that iterate over each Sheet object in the Book, each row in the Sheet, and each Cell in the Row. If your rows aren't too wide, you can append each Cell in a Row into a Python list and then feed that list to the writerow method of a csv.writer object.
Since it's a high-level question, this answer glosses over some specifics like how to call xlrd.open_workbook and how to create a csv.writer. Hopefully googling for examples on those specific points will get you where you need to go.
You can use this function to read the data from each file
import xlrd
def getXLData(Filename, min_row_len=1, get_datemode=False, sheetnum=0):
Data = []
book = xlrd.open_workbook(Filename)
sheet = book.sheets()[sheetnum]
rowcount = 0
while rowcount < sheet.nrows:
row = sheet.row_values(rowcount)
if len(row)>=min_row_len: Data.append(row)
rowcount+=1
if get_datemode: return Data, book.datemode
else: return Data
and this function to write the data after you combine the lists together
import csv
def writeCSVFile(filename, data, headers = []):
import csv
if headers:
temp = [headers]
temp.extend(data)
data = temp
f = open(filename,"wb")
writer = csv.writer(f)
writer.writerows(data)
f.close()
Keep in mind you may have to re-format the data, especially if there are dates or integers in the Excel files since they're stored as floating point numbers.
Edited to add code calling the above functions:
import glob
filelist = glob.glob("*.xls*")
alldata = []
headers = []
for filename in filelist:
data = getXLData(filename)
headers = data.pop(0) # omit this line if files do not have a header row
alldata.extend(data)
writeCSVFile("Output.csv", alldata, headers)