Hope you can help me.
I have a folder where there are several .xlsx files with similar structure (NOTE that some of the files might be bigger than 50MB). I want to combine them all together and (eventually) send them to a database. But before that, I need to improve the performance of this block of code because sometimes it takes a lot of time to process all those files.
The code in question is this:
df_list = []
for file in location:
df_list.append(pd.read_excel(file, header=0, engine='openpyxl'))
df_concat = pd.concat(df_list)
Any suggestions?
Somewhere I read that converting Excel files to CSV might improve the performance, but should I do that before appending the files or after everything is concatenated?
And considering df_list is a list, can I do that conversion?
I've found a solution with xlsx2csv
xlsx_path = './data/Extract/'
csv_path = './data/csv/'
list_of_xlsx = glob.glob(xlsx_path+'*.xlsx')
for xlsx in list_of_xlsx:
# Extract File Name on group 2 "(.+)"
filename = re.search(r'(.+[\\|\/])(.+)(\.(xlsx))', xlsx).group(2)
# Setup the call for subprocess.call()
call = ["python", "./xlsx2csv.py", xlsx, csv_path+filename+'.csv']
try:
subprocess.call(call) # On Windows use shell=True
except:
print('Failed with {}'.format(filepath)
outputcsv = './data/bigcsv.csv' #specify filepath+filename of output csv
listofdataframes = []
for file in glob.glob(csv_path+'*.csv'):
df = pd.read_csv(file)
if df.shape[1] == 24: # make sure 24 columns
listofdataframes.append(df)
else:
print('{} has {} columns - skipping'.format(file,df.shape[1]))
bigdataframe = pd.concat(listofdataframes).reset_index(drop=True)
bigdataframe.to_csv(outputcsv,index=False)
I tried to make this work for me but had no success. Maybe you might be able to have it working for you? Or does anyone have any ideas?
Reading excel files is quite slow in pandas as you stated, you shoudld have a look at this answer. It bascally uses a vbscript before running the python script to convert excel file to csv file, which is way faster to read for the python script.
To be more specific and answer the second part of your question, you should convert teh excel files to csv before loading them with pandas. The read_excel function is the slow part.
The following code coverts .dat files into data frames with the use of its dictionary file in .dct format. It works well. But my problem is that I was unable to automate this process, creating a loop that takes the pairs of these files from lists is a little bit tricky, atleast for me. I could really use some help with that.
try:
from statadict import parse_stata_dict
except ImportError:
!pip install statadict
import pandas as pd
from statadict import parse_stata_dict
dict_file = '2015_2017_FemPregSetup.dct'
data_file = '2015_2017_FemPregData.dat'
stata_dict = parse_stata_dict(dict_file)
stata_dict
nsfg = pd.read_fwf(data_file,
names=stata_dict.names,
colspecs=stata_dict.colspecs)
# nsfg is now a pandas DataFrame
These are the lists of files that I would like to convert into data frames. Every .dat file has its own dictionary file:
dat_name = ['2002FemResp.dat',
'2002Male.dat'...
dct_name = ['2002FemResp.dct',
'2002Male.dct'...
Assuming both lists have the same length and you will want to save the csv dataframe you could try:
c=0
for dat,dct in zip(dat_name, dct_name):
c+=1
stata_dict = parse_stata_dict(dct)
pd.read_fwf(dat, names=stata_dict.names, colspecs=stata_dict.colspecs).to_csv(r'path_name\file_name_{}.csv'.format(c))
# don't forget the '.csv'!
Also consider that if you are not using windows you need to use '/' rather than '\' in your path (or you can use os.path.join() to avoid this issue.
I have six .csv files. They overall size is approximately 4gigs. I need to clean each and do some data analysis task on each. These operations are the same for all the frames.
This is my code for reading them.
#df = pd.read_csv(r"yellow_tripdata_2018-01.csv")
#df = pd.read_csv(r"yellow_tripdata_2018-02.csv")
#df = pd.read_csv(r"yellow_tripdata_2018-03.csv")
#df = pd.read_csv(r"yellow_tripdata_2018-04.csv")
#df = pd.read_csv(r"yellow_tripdata_2018-05.csv")
df = pd.read_csv(r"yellow_tripdata_2018-06.csv")
Each time I run the kernel, I activate one of the files to be read.
I am looking for a more elegant way to do this. I thought about doing a for-loop. Making a list of file names and then reading them one after the other but I don't want to merge them together so I think another approach must exist. I have been searching for it but it seems all the questions lead to concatenating the files read at the end.
Use for and format like this. I use this every single day:
number_of_files = 6
for i in range(1, number_of_files+1):
df = pd.read_csv("yellow_tripdata_2018-0{}.csv".format(i)))
#your code here, do analysis and then the loop will return and read the next dataframe
You could use a list to hold all of the dataframes:
number_of_files = 6
dfs = []
for file_num in range(len(number_of_files)):
dfs.append(pd.read_csv(f"yellow_tripdata_2018-0{file_num}.csv")) #I use Python 3.6, so I'm used to f-strings now. If you're using Python <3.6 use .format()
Then to get a certain dataframe use:
df1 = dfs[0]
Edit:
As you are trying to keep from loading all of these in memory, I'd resort to streaming them. Try changing the for loop to something like this:
for file_num in range(len(number_of_files)):
with open(f"yellow_tripdata_2018-0{file_num}.csv", 'wb') as f:
dfs.append(csv.reader(iter(f.readline, '')))
Then just use a for loop over dfs[n] or next(dfs[n]) to read each line into memory.
P.S.
You may need multi-threading to iterate through each one at the same time.
Loading/Editing/Saving: - using csv module
Ok, so I've done a lot of research, python's csv module does load one line at a time, it's most likely in the mode we are opening it in. (explained here)
If you don't want to use Pandas (which chunking may honestly be the answer, just implement that into #seralouk's answer if so), otherwise, then yes! This below is in my mind would be the best approach, we just need to change a couple of things.
number_of_files = 6
filename = "yellow_tripdata_2018-{}.csv"
for file_num in range(number_of_files):
#notice I'm opening the original file as f in mode 'r' for read only
#and the new file as nf in mode 'a' for append
with open(filename.format(str(file_num).zfill(2)), 'r') as f,
open(filename.format((str(file_num)+"-new").zfill(2)), 'a') as nf:
#initialize the writer before looping every line
w = csv.writer(nf)
for row in csv.reader(f):
#do your "data cleaning" (THIS IS PER-LINE REMEMBER)
#save to file
w.writerow(row)
Note:
You may want to consider using a DictReader and/or DictWriter, I'd prefer them over regular reader/writers as I find them easier to understand.
Pandas Approach - using chunks
PLEASE READ this answer - if you'd like to steer away from my csv approach and stick with Pandas :) It literally seems like it's the same issue as yours and the answer is what you're asking for.
Basically Panda's allows for you to partially load a file as chunks, execute any alterations, then you can write those chunks to a new file. Below is majorly from that answer but I did do some more reading up myself in the docs
number_of_files = 6
chunksize = 500 #find the chunksize that works best for you
filename = "yellow_tripdata_2018-{}.csv"
for file_num in range(number_of_files):
for chunk in pd.read_csv(filename.format(str(file_num).zfill(2))chunksize=ch)
# Do your data cleaning
chunk.to_csv(filename.format((str(file_num)+"-new").zfill(2)), mode='a') #see again we're doing it in append mode so it creates the file in chunks
For more info on chunking the data see here as well it's good reading for those such as yourself getting headaches over these memory issues.
Use glob.glob to get all files with similar names:
import glob
files = glob.glob("yellow_tripdata_2018-0?.csv")
for f in files:
df = pd.read_csv(f)
# manipulate df
df.to_csv(f)
This will match yellow_tripdata_2018-0<any one character>.csv. You can also use yellow_tripdata_2018-0*.csv to match yellow_tripdata_2018-0<anything>.csv or even yellow_tripdata_*.csv to match all csv files that start with yellow_tripdata.
Note that this also only loads one file at a time.
Use os.listdir() to make a list of files you can loop through?
samplefiles = os.listdir(filepath)
for filename in samplefiles:
df = pd.read_csv(filename)
where filepath is the directory containing multiple csv's?
Or a loop that changes the filename:
for i in range(1, 7):
df = pd.read_csv(r"yellow_tripdata_2018-0%s.csv") % ( str(i))
# import libraries
import pandas as pd
import glob
# store file paths in a variable
project_folder = r"C:\file_path\"
# Save all file path in a variable
all_files_paths = glob.glob(project_folder + "/*.csv")
# Create a list to save whole data
li = []
# Use list comprehension to iterate over all files; and append data in each file to list
list_all_data = [li.append(pd.read_csv(filename, index_col=None, header=0)) for filename in all_files]
# Convert list to pandas dataframe
df = pd.concat(li, axis=0, ignore_index=True)
I want to convert a stream of JSONs (nearly 10,000) pasted in a file to a CSV file with a particular format for headers and values.
I have the following streams of JSON data :
{"shortUrlClicks":"594","longUrlClicks":"594","countries":[{"count":"125","id":"IQ"},{"count":"94","id":"US"},{"count":"56","id":"TR"},{"count":"50","id":"SA"},{"count":"29","id":"DE"},{"count":"24","id":"TN"},{"count":"20","id":"DZ"},{"count":"14","id":"EG"},{"count":"13","id":"MA"},{"count":"12","id":"PS"}],"browsers":[{"count":"350","id":"Chrome"},{"count":"100","id":"Firefox"},{"count":"46","id":"Safari"},{"count":"35","id":"Mobile"},{"count":"20","id":"Mobile Safari"},{"count":"20","id":"SamsungBrowser"},{"count":"8","id":"MSIE"},{"count":"6","id":"Opera"},{"count":"3","id":"OS;FBSV"},{"count":"2","id":"Maxthon"}],"platforms":[{"count":"227","id":"Android"},{"count":"221","id":"Windows"},{"count":"67","id":"iPhone"},{"count":"30","id":"X11"},{"count":"25","id":"Macintosh"},{"count":"8","id":"iPad"},{"count":"2","id":"Android 4.2.2"},{"count":"1","id":"Android 4.1.2"},{"count":"1","id":"Android 4.3"},{"count":"1","id":"Android 5.0.1"}],"referrers":[{"count":"340","id":"unknown"},{"count":"193","id":"t.co"},{"count":"38","id":"m.facebook.com"},{"count":"12","id":"addpost.it"},{"count":"4","id":"plus.google.com"},{"count":"3","id":"www.facebook.com"},{"count":"1","id":"goo.gl"},{"count":"1","id":"l.facebook.com"},{"count":"1","id":"lm.facebook.com"},{"count":"1","id":"plus.url.google.com"}]}
{"shortUrlClicks":"594","longUrlClicks":"594","countries":[{"count":"125","id":"IQ"},{"count":"94","id":"US"},{"count":"56","id":"TR"},{"count":"50","id":"SA"},{"count":"29","id":"DE"},{"count":"24","id":"TN"},{"count":"20","id":"DZ"},{"count":"14","id":"EG"},{"count":"13","id":"MA"},{"count":"12","id":"PS"}],"browsers":[{"count":"350","id":"Chrome"},{"count":"100","id":"Firefox"},{"count":"46","id":"Safari"},{"count":"35","id":"Mobile"},{"count":"20","id":"Mobile Safari"},{"count":"20","id":"SamsungBrowser"},{"count":"8","id":"MSIE"},{"count":"6","id":"Opera"},{"count":"3","id":"OS;FBSV"},{"count":"2","id":"Maxthon"}],"platforms":[{"count":"227","id":"Android"},{"count":"221","id":"Windows"},{"count":"67","id":"iPhone"},{"count":"30","id":"X11"},{"count":"25","id":"Macintosh"},{"count":"8","id":"iPad"},{"count":"2","id":"Android 4.2.2"},{"count":"1","id":"Android 4.1.2"},{"count":"1","id":"Android 4.3"},{"count":"1","id":"Android 5.0.1"}],"referrers":[{"count":"340","id":"unknown"},{"count":"193","id":"t.co"},{"count":"38","id":"m.facebook.com"},{"count":"12","id":"addpost.it"},{"count":"4","id":"plus.google.com"},{"count":"3","id":"www.facebook.com"},{"count":"1","id":"goo.gl"},{"count":"1","id":"l.facebook.com"},{"count":"1","id":"lm.facebook.com"},{"count":"1","id":"plus.url.google.com"}]}
... and so on.
I want to convert it into this form in CSV with whatever the headers (shortUrlclicks, longUrclicks, etc.) are:
I would be thankful to if you could please help me in the same. Any code in python or any other language would be useful.
You can use JSON library from the Python Standard Library to read the JSON and read/write files using the OS Library (from Python Standard Library too).
It would be something like this:
f = File.open('file.json', 'r')
items = json.loads(f.read())
csv_file = ""
for row in items():
new_row = ""
# get columns somehow
for column in columns:
new_row = "%s," % column
# Finished row, append a '\n' char to the row string.
csv_file.append("%s\n" % new_row)
# write json file
out = File.open('out.csv', 'rw')
out.write(csv_file)
out.close()
PS: I didn't run this code before posting. This is something to you to get an idea.
You can use Pandas to do this for you.
read your JSON file like this
df = pandas.read_json('filename.json')
write to csv
df.to_csv('filename.csv', index=False) # set index false if u don't need it
Example:
http://hayd.github.io/2013/pandas-json
REF:
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_json.html
I am using Spark 1.3.1 (PySpark) and I have generated a table using a SQL query. I now have an object that is a DataFrame. I want to export this DataFrame object (I have called it "table") to a csv file so I can manipulate it and plot the columns. How do I export the DataFrame "table" to a csv file?
Thanks!
If data frame fits in a driver memory and you want to save to local files system you can convert Spark DataFrame to local Pandas DataFrame using toPandas method and then simply use to_csv:
df.toPandas().to_csv('mycsv.csv')
Otherwise you can use spark-csv:
Spark 1.3
df.save('mycsv.csv', 'com.databricks.spark.csv')
Spark 1.4+
df.write.format('com.databricks.spark.csv').save('mycsv.csv')
In Spark 2.0+ you can use csv data source directly:
df.write.csv('mycsv.csv')
For Apache Spark 2+, in order to save dataframe into single csv file. Use following command
query.repartition(1).write.csv("cc_out.csv", sep='|')
Here 1 indicate that I need one partition of csv only. you can change it according to your requirements.
If you cannot use spark-csv, you can do the following:
df.rdd.map(lambda x: ",".join(map(str, x))).coalesce(1).saveAsTextFile("file.csv")
If you need to handle strings with linebreaks or comma that will not work. Use this:
import csv
import cStringIO
def row2csv(row):
buffer = cStringIO.StringIO()
writer = csv.writer(buffer)
writer.writerow([str(s).encode("utf-8") for s in row])
buffer.seek(0)
return buffer.read().strip()
df.rdd.map(row2csv).coalesce(1).saveAsTextFile("file.csv")
You need to repartition the Dataframe in a single partition and then define the format, path and other parameter to the file in Unix file system format and here you go,
df.repartition(1).write.format('com.databricks.spark.csv').save("/path/to/file/myfile.csv",header = 'true')
Read more about the repartition function
Read more about the save function
However, repartition is a costly function and toPandas() is worst. Try using .coalesce(1) instead of .repartition(1) in previous syntax for better performance.
Read more on repartition vs coalesce functions.
Using PySpark
Easiest way to write in csv in Spark 3.0+
sdf.write.csv("/path/to/csv/data.csv")
this can generate multiple files based on the number of spark nodes you are using. In case you want to get it in a single file use repartition.
sdf.repartition(1).write.csv("/path/to/csv/data.csv")
Using Pandas
If your data is not too much and can be held in the local python, then you can make use of pandas too
sdf.toPandas().to_csv("/path/to/csv/data.csv", index=False)
Using Koalas
sdf.to_koalas().to_csv("/path/to/csv/data.csv", index=False)
How about this (in case you don't want a one liner) ?
for row in df.collect():
d = row.asDict()
s = "%d\t%s\t%s\n" % (d["int_column"], d["string_column"], d["string_column"])
f.write(s)
f is an opened file descriptor. Also the separator is a TAB char, but it's easy to change to whatever you want.
'''
I am late to the pary but: this will let me rename the file, move it to a desired directory and delete the unwanted additional directory spark made
'''
import shutil
import os
import glob
path = 'test_write'
#write single csv
students.repartition(1).write.csv(path)
#rename and relocate the csv
shutil.move(glob.glob(os.getcwd() + '\\' + path + '\\' + r'*.csv')[0], os.getcwd()+ '\\' + path+ '.csv')
#remove additional directory
shutil.rmtree(os.getcwd()+'\\'+path)
try display(df) and use the download option in the results. Please note: only 1 million rows can be downloaded with this option but its really quick.
I used the method with pandas and this gave me horrible performance. In the end it took so long that I stopped to look for another method.
If you are looking for a way to write to one csv instead of multiple csv's this would be what you are looking for:
df.coalesce(1).write.csv("train_dataset_processed", header=True)
It reduced processing my dataset from 2+ hours to 2 minutes