Writing single CSV header with pandas - python

I'm parsing data into lists and using pandas to frame and write to an CSV file. First my data is taken into a set where inv, name, and date are all lists with numerous entries. Then I use concat to concatenate each iteration through the datasets I parse through to a CSV file like so:
counter = True
data = {'Invention': inv, 'Inventor': name, 'Date': date}
if counter is True:
df = pd.DataFrame(data)
df = df[['Invetion', 'Inventor', 'Date']]
else:
df = pd.concat([df, pd.DataFrame(data)])
df = df[['Invention', 'Inventor', 'Date']]
with open('./new.csv', 'a', encoding = utf-8) as f:
if counter is True:
df.to_csv(f, index = False, header = True)
else:
df.to_csv(f, index = False, header = False)
counter = False
The counter = True statement resides outside of my iteration loop for all the data I'm parsing so it's not overwriting every time.
So this means it only runs once through my data to grab the first df set then concats it thereafter. The problem is that even though counter is only True the first round and works for my first if-statement for df it does not work for my writing to file.
What happens is that the header is written over and over again - regardless to the fact that counter is only True once. When I swap the header = False for when counter is True then it never writes the header.
I think this is because of the concatenation of df holding onto the header somehow but other than that I cannot figure out the logic error.
Is there perhaps another way I could also write a header once and only once to the same CSV file?

It's hard to tell what might be going wrong without seeing the rest of the code. I've developed some test data and logic that works; you can adapt it to fit your needs.
Please try this:
import pandas as pd
early_inventions = ['wheel', 'fire', 'bronze']
later_inventions = ['automobile', 'computer', 'rocket']
early_names = ['a', 'b', 'c']
later_names = ['z', 'y', 'x']
early_dates = ['2000-01-01', '2001-10-01', '2002-03-10']
later_dates = ['2010-01-28', '2011-10-10', '2012-12-31']
early_data = {'Invention': early_inventions,
'Inventor': early_names,
'Date': early_dates}
later_data = {'Invention': later_inventions,
'Inventor': later_names,
'Date': later_dates}
datasets = [early_data, later_data]
columns = ['Invention', 'Inventor', 'Date']
header = True
for dataset in datasets:
df = pd.DataFrame(dataset)
df = df[columns]
mode = 'w' if header else 'a'
df.to_csv('./new.csv', encoding='utf-8', mode=mode, header=header, index=False)
header = False
Alternatively, you can concatenate all of the data in the loop and write out the dataframe at the end:
df = pd.DataFrame(columns=columns)
for dataset in datasets:
df = pd.concat([df, pd.DataFrame(dataset)])
df = df[columns]
df.to_csv('./new.csv', encoding='utf-8', index=False)
If your code cannot be made to conform to this API, you can forego writing the header in to_csv altogether. You can detect whether the output file exists and write the header to it first if it does not:
import os
fn = './new.csv'
if not os.path.exists(fn):
with open(fn, mode='w', encoding='utf-8') as f:
f.write(','.join(columns) + '\n')
# Now append the dataframe without a header
df.to_csv(fn, encoding='utf-8', mode='a', header=False, index=False)

I found the same problem. Pandas dataframe to csv works fine if the dataframe is finished and no need to do anything beyond any tutorial.
However if our program is making results and we are appending them, it seems that we find the repetitive header writing problem
In order to solve this consider the following function:
def write_data_frame_to_csv_2(dict, path, header_list):
df = pd.DataFrame.from_dict(data=dict, orient='index')
filename = os.path.join(path, 'results_with_header.csv')
if os.path.isfile(filename):
mode = 'a'
header = 0
else:
mode = 'w'
header = header_list
with open(filename, mode=mode) as f:
df.to_csv(f, header=header, index_label='model')
If the file does not exist we use write mode and header is equal to header list. When this is false, and the file exists we use append and header changed to 0.
The function receives a simple dictionary as parameter, In my case I used:
model = { 'model_name':{'acc':0.9,
'loss':0.3,
'tp':840,
'tn':450}
}
Using the function form ipython console several times produces expected result:
write_data_frame_to_csv_2(model, './', header_list)
Csv generated:
model,acc,loss,tp,tn
model_name,0.9,0.3,840,450
model_name,0.9,0.3,840,450
model_name,0.9,0.3,840,450
model_name,0.9,0.3,840,450
Let me know if it helps.
Happy coding!

just add this check before setting header property if you are using an index to iterate over API calls to add data in csv file.
if i > 0:
dataset.to_csv('file_name.csv',index=False, mode='a', header=False)
else:
dataset.to_csv('file_name.csv',index=False, mode='a', header=True)

Related

Python: use the arguments in enumerate as variable names for the nested loop

I would like to generate two data frames (and subsequently export to CSV) from two CSV files. I come up with the following (incomplete) code, which focuses on dealing with a.csv. I create an empty data frame (df_a) to store rows from itterows iteration (df_b is missing).
The problem is I do not know how to process b.csv without manually describing all avariables of empty dataframes in advance (i.e. df_a = pd.DataFrame(columns=['start', 'end']) and df_b = pd.DataFrame(columns=['start', 'end'])).
I hope I can use the arguments of enumerate (ie. the content of file) as variables (ie. something like df_file) for the data frames (instead of df_a and df_b).
list_files = [a.csv, b.csv]
for i, file in enumerate(list_file):
df = pd.read_csv(file)
# Create empty data frame to store data for each iteration below
df_a = pd.DataFrame(columns=['start', 'end'])
for index, row in df.iterrows():
var = df.loc[index, 'name']
df_new = SomeFunction(var)
# Append a new row to the empty data frame
dicts = {'start': df_new['column1'], 'end': df_new['column2']}
df_dicts = pd.DataFrame([dicts])
df_a = pd.concat([df_a, df_dicts], ignore_index=True)
df_a_csv = df_a.to_csv('df_a.csv')
Ideally, it could look a bit like (note: file is used as a part of variable name df_file)
list_files = [a.csv, b.csv]
for i, file in enumerate(list_file):
df = pd.read_csv(file)
# Create empty data frame to store data for each iteration below
df_file = pd.DataFrame(columns=['start', 'end'])
for index, row in df.iterrows():
var = df.loc[index, 'name']
df_new = SomeFunction(var)
# Append a new row to the empty data frame
dicts = {'start': df_new['column1'], 'end': df_new['column2']}
df_dicts = pd.DataFrame([dicts])
df_file = pd.concat([df_file, df_dicts], ignore_index=True)
df_file_csv = df_file.to_csv('df_' + file + '.csv')
Different approaches are also welcome. I just need to save the dataframe outcome for each input file. Many Thanks!
SomeFunction(var) aside, can you get the result you seek without pandas for the most part?
import csv
import pandas
## -----------
## mocked
## -----------
def SomeFunction(var):
return None
## -----------
list_files = ["a.csv", "b.csv"]
for file_path in list_files:
with open(file_path, "r") as file_in:
results = []
for row in csv.DictReader(file_in):
df_new = SomeFunction(row['name'])
start, end = df_new['column1'], df_new['column2']
results.append({"start": start, "end": end})
with open(f"df_{file_path}", "w") as file_out:
writer = csv.DictWriter(file_out, fieldnames=list(results[0].keys())):
writer.writeheader()
writer.writerows(results)
Note that you can also stream rows from the input to the output if you would rather not read them all into memory.
There are many things we could comment, but I understand that you are concerned about not having to specify the loop for a and for b, given that you already are doing it in list_files.
If this is the issue, what about doing something like this?
# CHANGED list only the stem of the base name, we will use them for many things
file_name_stems = ["a", "b"]
# CHANGED we save a dictionary for the dataframes
dataframes = {}
# CHANGED did you really need the enumerate?
for file_stem in file_name_stems:
filename = file_stem + ".csv"
df = pd.read_csv(filename)
# Create empty data frame to store data for each iteration below
# CHANGED let's use df_x as a generic name. Knowing your code, you will surely find better names
df_x = pd.DataFrame(columns=['start', 'end'])
for index, row in df.iterrows():
var = df.loc[index, 'name']
df_new = SomeFunction(var)
# Append a new row to the empty data frame
dicts = {'start': df_new['column1'], 'end': df_new['column2']}
df_dicts = pd.DataFrame([dicts])
df_x = pd.concat([df_a, df_dicts], ignore_index=True)
# CHANGED and now, we print to the file
csv_x = df_x.to_csv(f'df_{file_stem}.csv')
# CHANGED and save it to a dictionary in case you need it
dataframes[stem] = csv_x
So, instead of listing the exact filenames, you can list the stem of their name, and then compose de source filename and the output one.
Another option could be to list the source filenames and replace some part of the filename to generate the output filename:
list_files = ["a.csv", "b.csv"]
for filename in list_files:
# ...
output_file_name = filename.replace(".csv", "_df.csv")
# this produces "a_df.csv" and "b_df.csv"
Does any of this look to solve your problem? :)

How to turn multiple rows of dictionaries from a file into a dataframe

I have a script that I use to fire orders from a csv file, to an exchange using a for loop.
data = pd.read_csv('orderparameters.csv')
df = pd.DataFrame(data)
for i in range(len(df)):
order = Client.new_order(...
...)
file = open('orderData.txt', 'a')
original_stdout = sys.stdout
with file as f:
sys.stdout = f
print(order)
file.close()
sys.stdout = original_stdout
I put the response from the exchange in a txt file like this...
I want to turn the multiple responses into 1 single dataframe. I would hope it would look something like...
(I did that manually).
I tried;
data = pd.read_csv('orderData.txt', header=None)
dfData = pd.DataFrame(data)
print(dfData)
but I got;
I have also tried
data = pd.read_csv('orderData.txt', header=None)
organised = data.apply(pd.Series)
print(organised)
but I got the same output.
I can print order['symbol'] within the loop etc.
I'm not certain whether I should be populating this dataframe within the loop, or by capturing and writing the response and processing it afterwards. Appreciate your advice.
It looks like you are getting json strings back, you could read json objects into dictionaries and then create a dataframe from that. Perhaps try something like this (no longer needs a file)
data = pd.read_csv('orderparameters.csv')
df = pd.DataFrame(data)
response_data = []
for i in range(len(df)):
order_json = Client.new_order(...
...)
response_data.append(eval(order_json))
response_dataframe = pd.DataFrame(response_data)
If I understand your question correctly, you can simply do the following:
import pandas as pd
orders = pd.read_csv('orderparameters.csv')
responses = pd.DataFrame(Client.new_order(...) for _ in range(len(orders)))

Python Script: replacing values in CSV's

In my python script, I'm trying to read into csv files and if it has a column "PROD_NAME", it finds a value within that column and replaces it with another value. Currently, whenever I run the script, everything is going through the "try" clause and acts like it is working but when I look into the file itself, the values remain unchanged.. Nothing is hitting the "except" clause and the Command prompt prints replace for each file it supposedly changed.. any help would be appreciated. Thanks!
def worker():
filenames = glob.glob(dest_dir + '\\*.csv')
for filename in filenames:# this is loop over files***************************
my_file = Path(os.path.join(dest_dir, filename))
try:
with open(filename) as f:
# read data
df1 = pd.read_csv(filename, skiprows=1, encoding='ISO-8859-1') # read column header only - to get the list of columns
dtypes = {}
#print(filename, df1)
for col in df1.columns:# make all columns text, to avoid formatting errors
dtypes[col] = 'str'
df1 = pd.read_csv(filename, dtype=dtypes, skiprows=1, encoding='ISO-8859-1')
if 'PROD_NAME' in df1.columns:
df1 = df1.replace("NA_NRF", "FA_GUAR")
print("Replaced" + filename)
except:
if 'PROD_NAME' in df1.columns:
print(filename)
worker()
Original DF:
!4 PROD_NAME ENTRY_YEAR
* NA_NRF 2014
The NA_NRF is supposed to change to FA_GUAR
This should do the job:
with open(filename) as f:
df_before = pd.read_csv(f, sep=';')
for i in df_before.columns.values:
if i == "PROD_NAME":
df_after = df_before.replace("NA_NRF", "FA_GUAR")
df_after.to_csv(filename, index=False, sep=';')
else:
print("nothing to change")
When I added sep=';' it stopped giving me headaches about quotes...

Mapping CSV Header using a Dictionary

I have a reference file that looks like this:
Experiment,Array,Drug
8983,Genechip,Famotidine
8878,Microarray,Dicyclomine
8988,Genechip,Etidronate
8981,Microarray,Flunarizine
I successfully created a dictionary mapping the Experiment numbers to the Drug name using the following:
reader = csv.reader(open('C:\Users\Troy\Documents\ExPSRef.txt'))
#Configure dictionary
result = {}
for row in reader:
key = row[0]
result[key] = row[2]
di = result
I want to map this dictionary to the header of another file which consists of the experiment number. It currently looks like this:
Gene,8988,8981,8878,8983
Vcp,0.011,-0.018,-0.032,-0.034
Ube2d2,0.034,0.225,-0.402,0.418
Becn1,0.145,-0.108,-0.421,-0.048
Lypla2,-0.146,-0.026,-0.101,-0.011
But it should look like this:
Gene,Etidronate,Flunarizine,Dicyclomine,Famotidine
Vcp,0.011,-0.018,-0.032,-0.034
Ube2d2,0.034,0.225,-0.402,0.418
Becn1,0.145,-0.108,-0.421,-0.048
Lypla2,-0.146,-0.026,-0.101,-0.011
I tried using:
import csv
import pandas as pd
reader = csv.reader(open('C:\Users\Troy\Documents\ExPSRef.txt'))
result = {}
for row in reader:
key = row[0]
result[key] = row[2]
di = result
df = pd.read_csv('C:\Users\Troy\Documents\ExPS2.txt')
df['row[0]'].replace(di, inplace=True)
but it returned a KeyError: 'row[0]'.
I tried the following as well, even transposing in order to merge:
import pandas as pd
df1 = pd.read_csv('C:\Users\Troy\Documents\ExPS2.txt',).transpose()
df2 = pd.read_csv('C:\Users\Troy\Documents\ExPSRef.txt', delimiter=',', engine='python')
df3 = df1.merge(df2)
df4 = df3.set_index('Drug').drop(['Experiment', 'Array'], axis=1)
df4.index.name = 'Drug'
print df4
and this time received MergeError('No common columns to perform merge on').
Is there a simpler way to map my dictionary to the header that would work?
One of the things to keep in mind would be to making sure that both the keys corresponding to the mapper dictionary as well as the header which it is mapped to are of the same data type.
Here, one is a string and the other of integer type. So while reading itself, we'll let it not interpret dtype by setting it to str for the reference DF.
df1 = pd.read_csv('C:\Users\Troy\Documents\ExPS2.txt') # Original
df2 = pd.read_csv('C:\Users\Troy\Documents\ExPSRef.txt', dtype=str) # Reference
Convert the columns of the original DF to it's series representation and then replace the old value which were Experiment Nos. with the new Drug name retrieved from the reference DF.
df1.columns = df1.columns.to_series().replace(df2.set_index('Experiment').Drug)
df1
I used csv for the whole script. This fixes the header you wanted and saves into a new file. The new filename can be replaced with the same one if that's what you prefer. This program is written with python3.
import csv
with open('sample.txt', 'r') as ref:
reader = csv.reader(ref)
# skip header line
next(reader)
# make dictionary
di = dict([(row[0], row[2]) for row in reader])
data = []
with open('sample1.txt', 'r') as df:
reader = csv.reader(df)
header = next(reader)
new_header = [header[0]] + [di[i] for i in header if i in di]
data = list(reader)
# used to make new file, can also replace with the same file name
with open('new_sample1.txt', 'w') as df_new:
writer = csv.writer(df_new)
writer.writerow(new_header)
writer.writerows(data)

Pandas: Continuously write from function to csv

I have a function set up for Pandas that runs through a large number of rows in input.csv and inputs the results into a Series. It then writes the Series to output.csv.
However, if the process is interrupted (for example by an unexpected event) the program will terminate and all data that would have gone into the csv is lost.
Is there a way to write the data continuously to the csv, regardless of whether the function finishes for all rows?
Prefarably, each time the program starts, a blank output.csv is created, that is appended to while the function is running.
import pandas as pd
df = pd.read_csv("read.csv")
def crawl(a):
#Create x, y
return pd.Series([x, y])
df[["Column X", "Column Y"]] = df["Column A"].apply(crawl)
df.to_csv("write.csv", index=False)
This is a possible solution that will append the data to a new file as it reads the csv in chunks. If the process is interrupted the new file will contain all the information up until the interruption.
import pandas as pd
#csv file to be read in
in_csv = '/path/to/read/file.csv'
#csv to write data to
out_csv = 'path/to/write/file.csv'
#get the number of lines of the csv file to be read
number_lines = sum(1 for row in (open(in_csv)))
#size of chunks of data to write to the csv
chunksize = 10
#start looping through data writing it to a new file for each chunk
for i in range(1,number_lines,chunksize):
df = pd.read_csv(in_csv,
header=None,
nrows = chunksize,#number of rows to read at each loop
skiprows = i)#skip rows that have been read
df.to_csv(out_csv,
index=False,
header=False,
mode='a',#append data to csv file
chunksize=chunksize)#size of data to append for each loop
In the end, this is what I came up with. Thanks for helping out!
import pandas as pd
df1 = pd.read_csv("read.csv")
run = 0
def crawl(a):
global run
run = run + 1
#Create x, y
df2 = pd.DataFrame([[x, y]], columns=["X", "Y"])
if run == 1:
df2.to_csv("output.csv")
if run != 1:
df2.to_csv("output.csv", header=None, mode="a")
df1["Column A"].apply(crawl)
I would suggest this:
with open("write.csv","a") as f:
df.to_csv(f,header=False,index=False)
The argument "a" will append the new df to an existing file and the file gets closed after the with block is finished, so you should keep all of your intermediary results.
I've found a solution to a similar problem by looping the dataframe with iterrows() and saving each row to the csv file, which in your case it could be something like this:
for ix, row in df.iterrows():
row['Column A'] = crawl(row['Column A'])
# if you wish to mantain the header
if ix == 0:
df.iloc[ix - 1: ix].to_csv('output.csv', mode='a', index=False, sep=',', encoding='utf-8')
else:
df.iloc[ix - 1: ix].to_csv('output.csv', mode='a', index=False, sep=',', encoding='utf-8', header=False)

Categories

Resources