I want to know if it is possible to use the pandas to_csv() function to add a dataframe to an existing csv file. The csv file has the same structure as the loaded data.
You can specify a python write mode in the pandas to_csv function. For append it is 'a'.
In your case:
df.to_csv('my_csv.csv', mode='a', header=False)
The default mode is 'w'.
If the file initially might be missing, you can make sure the header is printed at the first write using this variation:
output_path='my_csv.csv'
df.to_csv(output_path, mode='a', header=not os.path.exists(output_path))
You can append to a csv by opening the file in append mode:
with open('my_csv.csv', 'a') as f:
df.to_csv(f, header=False)
If this was your csv, foo.csv:
,A,B,C
0,1,2,3
1,4,5,6
If you read that and then append, for example, df + 6:
In [1]: df = pd.read_csv('foo.csv', index_col=0)
In [2]: df
Out[2]:
A B C
0 1 2 3
1 4 5 6
In [3]: df + 6
Out[3]:
A B C
0 7 8 9
1 10 11 12
In [4]: with open('foo.csv', 'a') as f:
(df + 6).to_csv(f, header=False)
foo.csv becomes:
,A,B,C
0,1,2,3
1,4,5,6
0,7,8,9
1,10,11,12
with open(filename, 'a') as f:
df.to_csv(f, header=f.tell()==0)
Create file unless exists, otherwise append
Add header if file is being created, otherwise skip it
A little helper function I use with some header checking safeguards to handle it all:
def appendDFToCSV_void(df, csvFilePath, sep=","):
import os
if not os.path.isfile(csvFilePath):
df.to_csv(csvFilePath, mode='a', index=False, sep=sep)
elif len(df.columns) != len(pd.read_csv(csvFilePath, nrows=1, sep=sep).columns):
raise Exception("Columns do not match!! Dataframe has " + str(len(df.columns)) + " columns. CSV file has " + str(len(pd.read_csv(csvFilePath, nrows=1, sep=sep).columns)) + " columns.")
elif not (df.columns == pd.read_csv(csvFilePath, nrows=1, sep=sep).columns).all():
raise Exception("Columns and column order of dataframe and csv file do not match!!")
else:
df.to_csv(csvFilePath, mode='a', index=False, sep=sep, header=False)
Initially starting with a pyspark dataframes - I got type conversion errors (when converting to pandas df's and then appending to csv) given the schema/column types in my pyspark dataframes
Solved the problem by forcing all columns in each df to be of type string and then appending this to csv as follows:
with open('testAppend.csv', 'a') as f:
df2.toPandas().astype(str).to_csv(f, header=False)
This is how I did it in 2021
Let us say I have a csv sales.csv which has the following data in it:
sales.csv:
Order Name,Price,Qty
oil,200,2
butter,180,10
and to add more rows I can load them in a data frame and append it to the csv like this:
import pandas
data = [
['matchstick', '60', '11'],
['cookies', '10', '120']
]
dataframe = pandas.DataFrame(data)
dataframe.to_csv("sales.csv", index=False, mode='a', header=False)
and the output will be:
Order Name,Price,Qty
oil,200,2
butter,180,10
matchstick,60,11
cookies,10,120
A bit late to the party but you can also use a context manager, if you're opening and closing your file multiple times, or logging data, statistics, etc.
from contextlib import contextmanager
import pandas as pd
#contextmanager
def open_file(path, mode):
file_to=open(path,mode)
yield file_to
file_to.close()
##later
saved_df=pd.DataFrame(data)
with open_file('yourcsv.csv','r') as infile:
saved_df.to_csv('yourcsv.csv',mode='a',header=False)`
Related
I have got multiple csv files which look like this:
ID,Text,Value
1,"I play football",10
2,"I am hungry",12
3,"Unfortunately",I get an error",15
I am currently importing the data using the pandas read_csv() function.
df = pd.read_csv(filename, sep = ',', quotechar='"')
This works for the first two rows in my csv file, unfortunately I get an error in row 3. The reason is that within the 'Text' column there is a quotechar character-comma combination before the end of the column.
ParserError: Error tokenizing data. C error: Expected 3 fields in line 4, saw 4
Is there a way to solve this issue?
Expected output:
ID Text Value
1 I play football 10
2 I am hungry 12
3 Unfortunately, I get an error 15
You can try to fix the CSV using re module:
import re
import pandas as pd
from io import StringIO
with open("your_file.csv", "r") as f_in:
s = re.sub(
r'"(.*)"',
lambda g: '"' + g.group(1).replace('"', "\\") + '"',
f_in.read(),
)
df = pd.read_csv(StringIO(s), sep=r",", quotechar='"', escapechar="\\")
print(df)
Prints:
ID Text Value
0 1 I play football 10
1 2 I am hungry 12
2 3 Unfortunately,I get an error 15
One (not so flexible) approach would be to firstly remove all " quotes from the csv, and then enclose the elements of the specific column with "" quotes(this is done to avoid misinterpreting the "," seperator while parsing), like this:
import csv
# Specify the column index (0-based)
column_index = 1
# Open the input CSV file
with open('input.csv', 'r') as f:
reader = csv.reader(f)
# Open the output CSV file
with open('output.csv', 'w', newline='') as g:
writer = csv.writer(g)
# Iterate through the rows of the input CSV file
for row in reader:
# Replace the " character with an empty string
row[column_index] = row[column_index].replace('"', '')
# Enclose the modified element in "" quotes
row[column_index] = f'"{row[column_index]}"'
# Write the modified row to the output CSV file
writer.writerow(row)
This code creates a new modified csv file
Then your problematic csv row will look like that:
3,"Unfortunately,I get an error",15"
Then you can import the data like you did: df = pd.read_csv(filename, sep = ',', quotechar='"')
To automate this conversion for all csv files within a directory:
import csv
import glob
# Specify the column index (0-based)
column_index = 1
# Get a list of all CSV files in the current directory
csv_files = glob.glob('*.csv')
# Iterate through the CSV files
for csv_file in csv_files:
# Open the input CSV file
with open(csv_file, 'r') as f:
reader = csv.reader(f)
# Open the output CSV file
output_file = csv_file.replace('.csv', '_new.csv')
with open(output_file, 'w', newline='') as g:
writer = csv.writer(g)
# Iterate through the rows of the input CSV file
for row in reader:
# Replace the " character with an empty string
row[column_index] = row[column_index].replace('"', '')
# Enclose the modified element in "" quotes
row[column_index] = f'"{row[column_index]}"'
# Write the modified row to the output CSV file
writer.writerow(row)
this names the new csv files as the old ones but with "_new.csv" instead of just ".csv".
A possible solution:
df = pd.read_csv(filename, sep='(?<=\d),|,(?=\d)', engine='python')
df = df.reset_index().set_axis(['ID', 'Text', 'Value'], axis=1)
df['Text'] = df['Text'].replace('\"', '', regex=True)
Another possible solution:
df = pd.read_csv(StringIO(text), sep='\t')
df[['ID', 'Text']] = df.iloc[:, 0].str.split(',', expand=True, n=1)
df[['Text', 'Value']] = df['Text'].str.rsplit(',', expand=True, n=1)
df = df.drop(df.columns[0], axis=1).assign(
Text=df['Text'].replace('\"', '', regex=True))
Output:
ID Text Value
0 1 I play football 10
1 2 I am hungry 12
2 3 Unfortunately,I get an error 15
I write into a csv by this function:
def write_csv(hlavicka: Tuple[str, ...], zaznam: list, pomocne_csv: str) -> None:
if not os.path.isfile(pomocne_csv):
with open(pomocne_csv, "w", encoding=cfg.ENCODING, newline="") as soubor:
writer = csv.writer(soubor, delimiter=cfg.DELIMITER)
writer.writerow(hlavicka)
with open(pomocne_csv, "a", encoding=cfg.ENCODING, newline="") as soubor:
writer = csv.writer(soubor, delimiter=cfg.DELIMITER)
writer.writerows([zaznam])
However, when I open the csv in MS Office, I see that long numbers are in the scientific notation. For example 102043292003060000 is displayed as 1.02E+17. Of course, I put 102043292003060000 into my write_csv() function.
The problem is that when I read the csv using:
def generuj_zaznamy(input_path):
with open(input_path, "r", encoding="cp1250") as file_object:
reader = csv.reader(file_object, delimiter=";")
for entry in enumerate(reader, start=1):
print(entry)
I got 1.02E+17 instead of 102043292003060000.
Is there a way how to format the cell as a number directly in csv.writer or csv.reader? Thanks a lot.
Using the text editor like notepad.exe to open the csv file, you should see the value of a long numbers accurately. So, the problem comes from office excel but not csv.writer.
If you want to see the long numbers accurately from csv file, you should create a new xlsx file and use the function(Data->Get External Data->From text) to select the csv file for importing, and then choose the data format of the column as Text.
Edited:
I tried the code and it seems that the problem also happens to pandas.DataFrame.to_csv() but not only csv.writer() when the length of the number comes to 20 or more, which is out of the range of np.int64.
I readed the offical document and seems that float_format arg can't solve this problems.
The solution I can give now is here, if you can read the original data in string format for the length of the number more than 20:
import numpy as np
import pandas as pd
import csv
df = pd.DataFrame(["3100000035155588379531799826432", "3100000035155588433002733375488", "3100000035155588355694446120960"])
df = "\t" + df
print(df)
df.to_csv("test.csv", index=False, header=False)
rng = np.random.default_rng(0)
big_nums = rng.random(10) * (10**19) # OverflowError while comes to 10**20
df = pd.DataFrame(big_nums, dtype=np.int64).astype(str)
# df = "\t" + df
print(df)
df.to_csv("test.csv", index=False, header=False)
and the output will like that:
0
0 \t3100000035155588379531799826432
1 \t3100000035155588433002733375488
2 \t3100000035155588355694446120960
0
0 6369616873214542848
1 2697867137638703104
2 409735239361946880
3 165276355285290944
4 8132702392002723840
5 9127555772777217024
6 6066357757671798784
7 7294965609839983616
8 5436249914654228480
9 -9223372036854775808
I want to know if it is possible to use the pandas to_csv() function to add a dataframe to an existing csv file. The csv file has the same structure as the loaded data.
You can specify a python write mode in the pandas to_csv function. For append it is 'a'.
In your case:
df.to_csv('my_csv.csv', mode='a', header=False)
The default mode is 'w'.
If the file initially might be missing, you can make sure the header is printed at the first write using this variation:
output_path='my_csv.csv'
df.to_csv(output_path, mode='a', header=not os.path.exists(output_path))
You can append to a csv by opening the file in append mode:
with open('my_csv.csv', 'a') as f:
df.to_csv(f, header=False)
If this was your csv, foo.csv:
,A,B,C
0,1,2,3
1,4,5,6
If you read that and then append, for example, df + 6:
In [1]: df = pd.read_csv('foo.csv', index_col=0)
In [2]: df
Out[2]:
A B C
0 1 2 3
1 4 5 6
In [3]: df + 6
Out[3]:
A B C
0 7 8 9
1 10 11 12
In [4]: with open('foo.csv', 'a') as f:
(df + 6).to_csv(f, header=False)
foo.csv becomes:
,A,B,C
0,1,2,3
1,4,5,6
0,7,8,9
1,10,11,12
with open(filename, 'a') as f:
df.to_csv(f, header=f.tell()==0)
Create file unless exists, otherwise append
Add header if file is being created, otherwise skip it
A little helper function I use with some header checking safeguards to handle it all:
def appendDFToCSV_void(df, csvFilePath, sep=","):
import os
if not os.path.isfile(csvFilePath):
df.to_csv(csvFilePath, mode='a', index=False, sep=sep)
elif len(df.columns) != len(pd.read_csv(csvFilePath, nrows=1, sep=sep).columns):
raise Exception("Columns do not match!! Dataframe has " + str(len(df.columns)) + " columns. CSV file has " + str(len(pd.read_csv(csvFilePath, nrows=1, sep=sep).columns)) + " columns.")
elif not (df.columns == pd.read_csv(csvFilePath, nrows=1, sep=sep).columns).all():
raise Exception("Columns and column order of dataframe and csv file do not match!!")
else:
df.to_csv(csvFilePath, mode='a', index=False, sep=sep, header=False)
Initially starting with a pyspark dataframes - I got type conversion errors (when converting to pandas df's and then appending to csv) given the schema/column types in my pyspark dataframes
Solved the problem by forcing all columns in each df to be of type string and then appending this to csv as follows:
with open('testAppend.csv', 'a') as f:
df2.toPandas().astype(str).to_csv(f, header=False)
This is how I did it in 2021
Let us say I have a csv sales.csv which has the following data in it:
sales.csv:
Order Name,Price,Qty
oil,200,2
butter,180,10
and to add more rows I can load them in a data frame and append it to the csv like this:
import pandas
data = [
['matchstick', '60', '11'],
['cookies', '10', '120']
]
dataframe = pandas.DataFrame(data)
dataframe.to_csv("sales.csv", index=False, mode='a', header=False)
and the output will be:
Order Name,Price,Qty
oil,200,2
butter,180,10
matchstick,60,11
cookies,10,120
A bit late to the party but you can also use a context manager, if you're opening and closing your file multiple times, or logging data, statistics, etc.
from contextlib import contextmanager
import pandas as pd
#contextmanager
def open_file(path, mode):
file_to=open(path,mode)
yield file_to
file_to.close()
##later
saved_df=pd.DataFrame(data)
with open_file('yourcsv.csv','r') as infile:
saved_df.to_csv('yourcsv.csv',mode='a',header=False)`
import csv
f = csv.reader(open('lmt.csv','r')) # open input file for reading
Date, Open, Hihh, mLow, Close, Volume = zip(*f) #s plit it into separate columns
ofile = open("MYFILEnew1.csv", "wb") # output csv file
c = csv.writer(ofile)
item = Date
item2 = Volume
rows = zip(item, item)
i = 0
for row in item2:
print row
writer = csv.writer(ofile, delimiter='\t')
writer.writerow([row])
ofile.close()
Above is what I have produced so far.
As you can see in the 3rd line, I have extracted 6 columns from a spreadsheet.
I want to create a .csv file under the name of MYFILEnew1.csv which only has two columns, Date and Volume.
What I have above creates a .csv that only writes Volume column into the first column of the new .csv file.
How would you go about placing Date into the second column?
For example
Date Open High Low Close Volume
17-Feb-16 210 212.97 209.1 212.74 1237731
is what i have. and Id like to produce a new csv file such that it has
Date Volume
17-Feb-16 1237731
If I understand you question correctly, you can achieve that very easily using panda's read_csv and to_csv (#downvoter: Could you explain your downvote, please!?); the final solution to your problem can be found below EDIT2:
import pandas as pd
# this assumes that your file is comma separated
# if it is e.g. tab separated you should use pd.read_csv('data.csv', sep = '\t')
df = pd.read_csv('data.csv')
# select desired columns
df = df[['Date', 'Volume']]
#write to the file (tab separated)
df.to_csv('MYFILEnew1.csv', sep='\t', index=False)
So, if your data.csv file looks like this:
Date,Open,Hihh,mLow,Close,Volume
1,5,9,13,17,21
2,6,10,14,18,22
3,7,11,15,19,23
4,8,12,16,20,24
The the MYFILEnew1.csv would look like this after running the script above:
Date Volume
1 21
2 22
3 23
4 24
EDIT
Using your data (tab separated, stored in the file data3.csv):
Date Open Hihh mLow Close Volume
17-Feb-16 210 212.97 209.1 212.74 1237731
Then
import pandas as pd
df = pd.read_csv('data3.csv', sep='\t')
# select desired columns
df = df[['Date', 'Volume']]
# write to the file (tab separated)
df.to_csv('MYFILEnew1.csv', sep='\t', index=False)
gives the desired output
Date Volume
17-Feb-16 1237731
EDIT2
Since your header in your input csv file seems to be messed up (as discussed in the comments), you have to rename the first column. The following now works fine for me using your entire dataset:
import pandas as pd
df = pd.read_csv('lmt.csv', sep=',')
# get rid of the wrongly formatted column name
df.rename(columns={df.columns[0]: 'Date' }, inplace=True)
# select desired columns
df = df[['Date', 'Volume']]
# write to the file (tab separated)
df.to_csv('MYFILEnew1.csv', sep='\t', index=False)
Here I would suggest using the csv module's csv.DictReader object to read and write from the files. To read the file, you would do something like
import csv
fieldnames=('Date', 'Open', 'High', 'mLow', 'Close', 'Volume')
with open('myfilename.csv') as f:
reader = csv.DictReader(f, fieldnames=fieldnames)
Beyond this, you will just need to filter out the keys you don't want from each row and similarly use the csv.DictWriter class to write to your export file.
You were so close:
import csv
f = csv.reader(open('lmt.csv','rb')) # csv is binary
Date, Open, Hihh, mLow, Close, Volume = zip(*f)
rows = zip(Date, Volume)
ofile = open("MYFILEnew1.csv", "wb")
writer = csv.writer(ofile)
for row in rows:
writer.writerow(row) # row is already a tuple so no need to make it a list
ofile.close()
I have a function set up for Pandas that runs through a large number of rows in input.csv and inputs the results into a Series. It then writes the Series to output.csv.
However, if the process is interrupted (for example by an unexpected event) the program will terminate and all data that would have gone into the csv is lost.
Is there a way to write the data continuously to the csv, regardless of whether the function finishes for all rows?
Prefarably, each time the program starts, a blank output.csv is created, that is appended to while the function is running.
import pandas as pd
df = pd.read_csv("read.csv")
def crawl(a):
#Create x, y
return pd.Series([x, y])
df[["Column X", "Column Y"]] = df["Column A"].apply(crawl)
df.to_csv("write.csv", index=False)
This is a possible solution that will append the data to a new file as it reads the csv in chunks. If the process is interrupted the new file will contain all the information up until the interruption.
import pandas as pd
#csv file to be read in
in_csv = '/path/to/read/file.csv'
#csv to write data to
out_csv = 'path/to/write/file.csv'
#get the number of lines of the csv file to be read
number_lines = sum(1 for row in (open(in_csv)))
#size of chunks of data to write to the csv
chunksize = 10
#start looping through data writing it to a new file for each chunk
for i in range(1,number_lines,chunksize):
df = pd.read_csv(in_csv,
header=None,
nrows = chunksize,#number of rows to read at each loop
skiprows = i)#skip rows that have been read
df.to_csv(out_csv,
index=False,
header=False,
mode='a',#append data to csv file
chunksize=chunksize)#size of data to append for each loop
In the end, this is what I came up with. Thanks for helping out!
import pandas as pd
df1 = pd.read_csv("read.csv")
run = 0
def crawl(a):
global run
run = run + 1
#Create x, y
df2 = pd.DataFrame([[x, y]], columns=["X", "Y"])
if run == 1:
df2.to_csv("output.csv")
if run != 1:
df2.to_csv("output.csv", header=None, mode="a")
df1["Column A"].apply(crawl)
I would suggest this:
with open("write.csv","a") as f:
df.to_csv(f,header=False,index=False)
The argument "a" will append the new df to an existing file and the file gets closed after the with block is finished, so you should keep all of your intermediary results.
I've found a solution to a similar problem by looping the dataframe with iterrows() and saving each row to the csv file, which in your case it could be something like this:
for ix, row in df.iterrows():
row['Column A'] = crawl(row['Column A'])
# if you wish to mantain the header
if ix == 0:
df.iloc[ix - 1: ix].to_csv('output.csv', mode='a', index=False, sep=',', encoding='utf-8')
else:
df.iloc[ix - 1: ix].to_csv('output.csv', mode='a', index=False, sep=',', encoding='utf-8', header=False)