Pandas: Continuously write from function to csv - python

I have a function set up for Pandas that runs through a large number of rows in input.csv and inputs the results into a Series. It then writes the Series to output.csv.
However, if the process is interrupted (for example by an unexpected event) the program will terminate and all data that would have gone into the csv is lost.
Is there a way to write the data continuously to the csv, regardless of whether the function finishes for all rows?
Prefarably, each time the program starts, a blank output.csv is created, that is appended to while the function is running.
import pandas as pd
df = pd.read_csv("read.csv")
def crawl(a):
#Create x, y
return pd.Series([x, y])
df[["Column X", "Column Y"]] = df["Column A"].apply(crawl)
df.to_csv("write.csv", index=False)

This is a possible solution that will append the data to a new file as it reads the csv in chunks. If the process is interrupted the new file will contain all the information up until the interruption.
import pandas as pd
#csv file to be read in
in_csv = '/path/to/read/file.csv'
#csv to write data to
out_csv = 'path/to/write/file.csv'
#get the number of lines of the csv file to be read
number_lines = sum(1 for row in (open(in_csv)))
#size of chunks of data to write to the csv
chunksize = 10
#start looping through data writing it to a new file for each chunk
for i in range(1,number_lines,chunksize):
df = pd.read_csv(in_csv,
header=None,
nrows = chunksize,#number of rows to read at each loop
skiprows = i)#skip rows that have been read
df.to_csv(out_csv,
index=False,
header=False,
mode='a',#append data to csv file
chunksize=chunksize)#size of data to append for each loop

In the end, this is what I came up with. Thanks for helping out!
import pandas as pd
df1 = pd.read_csv("read.csv")
run = 0
def crawl(a):
global run
run = run + 1
#Create x, y
df2 = pd.DataFrame([[x, y]], columns=["X", "Y"])
if run == 1:
df2.to_csv("output.csv")
if run != 1:
df2.to_csv("output.csv", header=None, mode="a")
df1["Column A"].apply(crawl)

I would suggest this:
with open("write.csv","a") as f:
df.to_csv(f,header=False,index=False)
The argument "a" will append the new df to an existing file and the file gets closed after the with block is finished, so you should keep all of your intermediary results.

I've found a solution to a similar problem by looping the dataframe with iterrows() and saving each row to the csv file, which in your case it could be something like this:
for ix, row in df.iterrows():
row['Column A'] = crawl(row['Column A'])
# if you wish to mantain the header
if ix == 0:
df.iloc[ix - 1: ix].to_csv('output.csv', mode='a', index=False, sep=',', encoding='utf-8')
else:
df.iloc[ix - 1: ix].to_csv('output.csv', mode='a', index=False, sep=',', encoding='utf-8', header=False)

Related

I have a large CSV (53Gs) and I need to process it in chunks in parallel

I tried out the pool.map approach given in similar answers but I ended up with 8 files of 23G each which is worse.
import os
import pandas as pd
from tqdm import tqdm
from multiprocessing import Pool
#csv file name to be read in
in_csv = 'mycsv53G.csv'
#get the number of lines of the csv file to be read
number_lines = sum(1 for row in tqdm((open(in_csv, encoding = 'latin1')), desc = 'Reading number of lines....'))
print (number_lines)
#size of rows of data to write to the CSV,
#you can change the row size according to your need
rowsize = 11367260 #decided based on your CPU core count
#start looping through data writing it to a new file for each set
def reading_csv(filename):
for i in tqdm(range(1,number_lines,rowsize), desc = 'Reading CSVs...'):
print ('in reading csv')
df = pd.read_csv(in_csv, encoding='latin1',
low_memory=False,
header=None,
nrows = rowsize,#number of rows to read at each loop
skiprows = i)#skip rows that have been read
#csv to write data to a new file with indexed name. input_1.csv etc.
out_csv = './csvlist/input' + str(i) + '.csv'
df.to_csv(out_csv,
index=False,
header=False,
mode='a',#append data to csv file
chunksize=rowsize)#size of data to append for each loop
def main():
# get a list of file names
files = os.listdir('./csvlist')
file_list = [filename for filename in tqdm(files) if filename.split('.')[1]=='csv']
# set up your pool
with Pool(processes=8) as pool: # or whatever your hardware can support
print ('in Pool')
# have your pool map the file names to dataframes
try:
df_list = pool.map(reading_csv, file_list)
except Exception as e:
print (e)
if __name__ == '__main__':
main()
The above approach took 4 hours to split the files in a concurrent fashion and then parsing every CSV will be even more.. not sure if multiprocessing helped or not!
Currently, I read the CSV file through this code:
import pandas as pd
import datetime
import numpy as np
for chunk in dd.read_csv(filename, chunksize = 10**5, encoding='latin-1', skiprows=1, header=None):
#chunk processing
final_df = final_df.append(agg, ignore_index=True)
final_df.to_csv('Final_Output/'+output_name, encoding='utf-8', index=False)
It takes close to 12 hours to process the large CSV at once.
What will the improvements here be?
Any suggestions?
I am willing to try out dask now.. but no other options left.
I would read the csv file line by line and feed the lines into a queue from which the processes pick their tasks. This way, you don't have to split the file first.
See this example here: https://stackoverflow.com/a/53847284/4141279

Write data to CSV file without adding empty rows between sets of data

My program writes data into a CSV file using the pandas' to_csv function. At first run, the CSV file is originally empty and my code wrote data in it (which is supposed to be). At the second run, (take note that I'm still using the same CSV file), my code wrote data in it again (which is good). The problem is, there is a large number of empty rows between the data from the first run and the data from the second run.
Below is my code:
#place into a file
csvFile = open(file, 'a', newline = '',encoding='utf-8')
if file_empty == True:
df.to_csv(csvFile, sep=',', columns=COLS, index=False, mode='ab', encoding='utf-8') #header true
else:
df.to_csv(csvFile, sep=',', columns=COLS, header=False, index=False, mode='ab', encoding='utf-8') #header false
I used the variable file_empty in order for the program to not write column headers if there is already data present in the CSV file.
Below is the sample output from the CSV file:
Last data from first run is in line 396 of CSV file,
first row data from second run is in line 1308 of the same CSV file.
So there are empty rows starting from line 397 up to line 1307. How can I remove them so that when the program is run again, there is no empty rows between them?
Here is the data sample and code to append the data and remove blank lines..
below are the lines may help you
import pandas
conso_frame = pandas.read_csv('consofile1.csv')
df_2 = pandas.read_csv('csvfile2.csv')
# Column Names should be same
conso_frame = conso_frame.append(df_2)
print(conso_frame)
conso_frame.dropna(subset = ["Intent"], inplace=True)
print(conso_frame)
conso_frame.to_csv('consofile1.csv', index=False)

Writing single CSV header with pandas

I'm parsing data into lists and using pandas to frame and write to an CSV file. First my data is taken into a set where inv, name, and date are all lists with numerous entries. Then I use concat to concatenate each iteration through the datasets I parse through to a CSV file like so:
counter = True
data = {'Invention': inv, 'Inventor': name, 'Date': date}
if counter is True:
df = pd.DataFrame(data)
df = df[['Invetion', 'Inventor', 'Date']]
else:
df = pd.concat([df, pd.DataFrame(data)])
df = df[['Invention', 'Inventor', 'Date']]
with open('./new.csv', 'a', encoding = utf-8) as f:
if counter is True:
df.to_csv(f, index = False, header = True)
else:
df.to_csv(f, index = False, header = False)
counter = False
The counter = True statement resides outside of my iteration loop for all the data I'm parsing so it's not overwriting every time.
So this means it only runs once through my data to grab the first df set then concats it thereafter. The problem is that even though counter is only True the first round and works for my first if-statement for df it does not work for my writing to file.
What happens is that the header is written over and over again - regardless to the fact that counter is only True once. When I swap the header = False for when counter is True then it never writes the header.
I think this is because of the concatenation of df holding onto the header somehow but other than that I cannot figure out the logic error.
Is there perhaps another way I could also write a header once and only once to the same CSV file?
It's hard to tell what might be going wrong without seeing the rest of the code. I've developed some test data and logic that works; you can adapt it to fit your needs.
Please try this:
import pandas as pd
early_inventions = ['wheel', 'fire', 'bronze']
later_inventions = ['automobile', 'computer', 'rocket']
early_names = ['a', 'b', 'c']
later_names = ['z', 'y', 'x']
early_dates = ['2000-01-01', '2001-10-01', '2002-03-10']
later_dates = ['2010-01-28', '2011-10-10', '2012-12-31']
early_data = {'Invention': early_inventions,
'Inventor': early_names,
'Date': early_dates}
later_data = {'Invention': later_inventions,
'Inventor': later_names,
'Date': later_dates}
datasets = [early_data, later_data]
columns = ['Invention', 'Inventor', 'Date']
header = True
for dataset in datasets:
df = pd.DataFrame(dataset)
df = df[columns]
mode = 'w' if header else 'a'
df.to_csv('./new.csv', encoding='utf-8', mode=mode, header=header, index=False)
header = False
Alternatively, you can concatenate all of the data in the loop and write out the dataframe at the end:
df = pd.DataFrame(columns=columns)
for dataset in datasets:
df = pd.concat([df, pd.DataFrame(dataset)])
df = df[columns]
df.to_csv('./new.csv', encoding='utf-8', index=False)
If your code cannot be made to conform to this API, you can forego writing the header in to_csv altogether. You can detect whether the output file exists and write the header to it first if it does not:
import os
fn = './new.csv'
if not os.path.exists(fn):
with open(fn, mode='w', encoding='utf-8') as f:
f.write(','.join(columns) + '\n')
# Now append the dataframe without a header
df.to_csv(fn, encoding='utf-8', mode='a', header=False, index=False)
I found the same problem. Pandas dataframe to csv works fine if the dataframe is finished and no need to do anything beyond any tutorial.
However if our program is making results and we are appending them, it seems that we find the repetitive header writing problem
In order to solve this consider the following function:
def write_data_frame_to_csv_2(dict, path, header_list):
df = pd.DataFrame.from_dict(data=dict, orient='index')
filename = os.path.join(path, 'results_with_header.csv')
if os.path.isfile(filename):
mode = 'a'
header = 0
else:
mode = 'w'
header = header_list
with open(filename, mode=mode) as f:
df.to_csv(f, header=header, index_label='model')
If the file does not exist we use write mode and header is equal to header list. When this is false, and the file exists we use append and header changed to 0.
The function receives a simple dictionary as parameter, In my case I used:
model = { 'model_name':{'acc':0.9,
'loss':0.3,
'tp':840,
'tn':450}
}
Using the function form ipython console several times produces expected result:
write_data_frame_to_csv_2(model, './', header_list)
Csv generated:
model,acc,loss,tp,tn
model_name,0.9,0.3,840,450
model_name,0.9,0.3,840,450
model_name,0.9,0.3,840,450
model_name,0.9,0.3,840,450
Let me know if it helps.
Happy coding!
just add this check before setting header property if you are using an index to iterate over API calls to add data in csv file.
if i > 0:
dataset.to_csv('file_name.csv',index=False, mode='a', header=False)
else:
dataset.to_csv('file_name.csv',index=False, mode='a', header=True)

Save columns as csv pandas

I'm trying to save specific columns to a csv using pandas. However, there is only one line on the output file. Is there anything wrong with my code? My desired output is to save all columns where d.count() == 1 to a csv file.
import pandas as pd
results = pd.read_csv('employee.csv', sep=';', delimiter=';', low_memory=False)
results['index'] = results.groupby('Name').cumcount()
d = results.pivot(index='index', columns='Name', values='Job')
for columns in d:
if (d[columns]).count() > 1:
(d[columns]).dropna(how='any').to_csv('output.csv')
An alternative might be to populate a new dataframe containing what you want to save, and then save that one time.
import pandas as pd
results = pd.read_csv('employee.csv', sep=';', delimiter=';', low_memory=False)
results['index'] = results.groupby('Name').cumcount()
d = results.pivot(index='index', columns='Name', values='Job')
keepcols=[]
for columns in d:
if (d[columns]).count() > 1:
keepcols.append(columns)
output_df = results[keepcols]
output_df.to_csv('output.csv')
No doubt you could rationalise the above, and reduce the memory footprint by saving the output directly without first creating an object to hold it, but it helps identify what's going on in the example.

How to merge two csv files using multiprocessing with python pandas

I want to merge two csv files with common column using python panda
With 32 bit processor after 2 gb memory it will throw memory error
how can i do the same with multi processing or any other methods
import gc
import pandas as pd
csv1_chunk = pd.read_csv('/home/subin/Desktop/a.txt',dtype=str, iterator=True, chunksize=1000)
csv1 = pd.concat(csv1_chunk, axis=1, ignore_index=True)
csv2_chunk = pd.read_csv('/home/subin/Desktop/b.txt',dtype=str, iterator=True, chunksize=1000)
csv2 = pd.concat(csv2_chunk, axis=1, ignore_index=True)
new_df = csv1[csv1["PROFILE_MSISDN"].isin(csv2["L_MSISDN"])]
new_df.to_csv("/home/subin/Desktop/apyb.txt", index=False)
gc.collect()
please help me to fix this
thanks in advance
I think you only need one column from your second file (actually, only unique elements from this column are needed), so there is no need to load the whole data frame.
import pandas as pd
csv2 = pd.read_csv('/home/subin/Desktop/b.txt', usecols=['L_MSISDN'])
unique_msidns = set(csv2['L_MSISDN'])
If this still gives a memory error, try doing this in chunks:
chunk_reader = pd.read_csv('/home/subin/Desktop/b.txt', usecols=['L_MSISDN'], chunksize=1000)
unique_msidns = set()
for chunk in chunk_reader:
unique_msidns = unique_msidns | set(chunk['L_MSIDNS'])
Now, we can deal with the first data frame.
chunk_reader = pd.read_csv('/home/subin/Desktop/a.txt', chunksize=1000)
for chunk in chunk_reader:
bool_idx = chunk['PROFILE_MSISDN'].isin(unique_msidns)
# *append* selected lines from every chunk to a file (mode='a')
# col names are not written
chunk[bool_idx].to_csv('output_file', header=False, index=False, mode='a')
If you need column names to be written into the output file, you can do it with the first chunk (I've skipped it for code clarity).
I believe it's safe (and probably faster) to increase chunksize.
I didn't test this code, so be sure to double check it.

Categories

Resources