Add rows to a csvfile without creating an intermediate copy - python

How can I add rows to a csvfile by editing in place? I want to avoid the pattern of writing to a temp file and then replacing the original, (pseudocode):
add_records_to_csv(newdata, infile, tmpfile)
delete(infile)
rename(tmpfile, infile)
Here's the actual function. The lines "# <--" are what I want to get rid of and/or condense into something more straightforward:
def add_records_to_csv(dic, csvfile):
""" Append a dictionary to a CSV file.
Adapted from http://pymotw.com/2/csv/
"""
f_old = open(csvfile, 'rb') # <--
csv_old = csv.DictReader(f_old) # <--
fpath, fname = os.path.split(csvfile) # <--
csvfile_new = os.path.join(fpath, 'new_' + fname ) # <--
print(csvfile_new) # <--
f = open(csvfile_new, 'wb') # <--
try:
fieldnames = sorted(set(dic.keys() + csv_old.fieldnames))
writer = csv.DictWriter(f, fieldnames=fieldnames)
headers = dict( (n,n) for n in fieldnames )
writer.writerow(headers)
for row in csv_old:
writer.writerow(row)
writer.writerow(dic)
finally:
f_old.close()
f.close()
return csvfile_new

This is not going to be possible in general. Here is the reason, from your code:
fieldnames = sorted(set(dic.keys() + csv_old.fieldnames))
To me, this says that at least in some cases your new row contains columns that were not in the previous rows. When you add a row like this, you will have to update the header of the file (the first line), in addition to appending new rows at the end. If you need to have the column names in alphabetized order, then you may have to rearrange the fields in all the other rows in order to retain the ordering of the columns.
Because you may need to edit the first line of the file, in addition to appending new lines at the end and possibly editing all the lines in-between, there isn't a reasonable way to make this work in-place.
My suggestion is to try and figure out, ahead of time, all the fields/columns that you may need to include so that you guarantee your program will never have to edit the header and can simply add new rows.

If your new row has the same structure as the existing records the following will work:
import csv
def append_record_to_csv(dic, csvfile):
with open(csvfile, 'rb') as f:
# discover order of field names in header row
fieldnames = next(csv.reader(f))
with open(csvfile, 'ab') as f:
# assumes that dic contains only fieldnames in csv file
dwriter = csv.DictWriter(f, fieldnames=fieldnames)
dwriter.writerow(dic)
On the other hand, if your new row as a different structure than the existing rows a csv file is probably the wrong file format. In order to add a new column to a csv file every row needs to be edited. The performance of this approach is very bad and will be quite noticeable with a large csv file.

Related

large file delete rows python

Need some help with a use case. I have two file one is about 9GB (test_data) and the other 42MB(master_data). test_data contains data with several columns one of the column i.e. #7 contains email address . master_data is my master data file which has just one column which is email address only.
What I am trying to achieve is to compare the emails in master_data file with the emails in test_data if they match, the whole row is to be deleted. I need an efficient way to achieve the same.
The below piece of code is written to achieve but I am stuck at deleting the lines from master_data file but am not sure if this is an efficient way of achieving this requirement.
import csv
import time
# open the file in read mode
filename = open('master_data.csv', 'r')
# creating dictreader object
file = csv.DictReader(filename)
# creating empty lists
email = []
# iterating over each row and append
# values to empty list
for col in file:
email.append(col['EMAIL'])
# printing lists
print('Email:', email)
datafile = open('test_data.csv', 'r+')
for line in datafile:
#print(line)
# str1,id=line.split(',')
split_line=line.split(',')
str1=split_line[7] # Whatever columns
id1=split_line[0]
for w in email:
print(w)
print(str1)
#time.sleep(2.4)
if w in str1:
print(id1)
datafile.remove(id1)
Removing lines from a file is difficult. Its a lot easier to write a new file, filtering out rows as you go. Put your existing emails in a set for easy lookup, write to a temporary file and rename when done. This also has the advantage that you don't loose data if something goes wrong along the way.
You'll need to "normalize" the email. Most email systems are case-insensitive and ignore periods in addresses. Addresses can also contain extra name information as in John Doe <j.doe#Gmail.com>. Write a function that puts addresses into a single form and use it for both of your files.
import csv
import os
import email.utils
def email_normalize(val):
# discard optional full name. lower case, remove '.' in local name
_, addr = email.utils.parseaddr(val)
local, domain = addr.lower().split('#', 1)
local = local.replace('.', '')
return f'{local}#{domain)'
# create set of user emails to keep
with open('master_data.csv', newline='') as file:
emails = set(email_normalize(row[0]) for row in csv.reader(file))
with open('test_data.csv', newline='') as infile, \
open('test_data.csv.tmp', 'w', newline='') as outfile:
reader = csv.reader(infile)
writer = csv.writer(outfile)
writer.writerow(next(reader)) # write header
writer.writerows(row for row in reader
if email_normalize(row[7]) not in emails) # email is column #7
del reader, writer
os.rename('test_data.csv', 'test_data.csv.deleteme')
os.rename('test_data.csv.tmp', 'test_data.csv')
os.remove('test_data.csv.deleteme')
You could load the master file and store the emails in a dict, then as you iterate the rows of test, you can check if the email for a row is in that (master) dict.
Given these CSVs:
test.csv:
Col1,Col2,Col3,Col4,Col5,Col6,Col7
1,,,,,,foo#mail.com
2,,,,,,bar#mail.com
3,,,,,,baz#mail.com
4,,,,,,dog#mail.com
5,,,,,,foo#mail.com
master.csv:
Col1
foo#mail.com
cat#mail.com
dog#mail.com
When I run:
import csv
emails: dict[str, None] = {}
with open("master.csv", newline="") as f:
reader = csv.reader(f)
next(reader) # skip header
for row in reader:
emails[row[0]] = None
out_line = "{:<20} {:>8}"
with open("test.csv", newline="") as f:
reader = csv.reader(f)
next(reader) # skip header
print(out_line.format("Email", "Test ID"))
for row in reader:
if row[6] in emails:
print(out_line.format(row[6], row[0]))
I get:
Email Test ID
foo#mail.com 1
dog#mail.com 4
foo#mail.com 5
That proves that you can read the emails in from master and compare with them while reading from test.
As others have pointed out, actually deleting anything from a file is difficult; it's far easier to create a new file and just exclude (filter out) the things you don't want:
f_in = open("test.csv", newline="")
reader = csv.reader(f_in)
f_out = open("output.csv", "w", newline="")
writer = csv.writer(f_out)
for row in reader:
if row[6] in emails:
continue
writer.writerow(row)
f_in.close()
f_out.close()
Iterating your CSV reader and writing-out with your CSV writer is a very efficient means of transforming a CSV (test.csv → output.csv, in this case): you only need the memory to hold row in each step of the loop.
When I run that, after having populated the emails dict like before, my output.csv looks like:
Col1,Col2,Col3,Col4,Col5,Col6,Col7
2,,,,,,bar#mail.com
3,,,,,,baz#mail.com
For the real-world performance of your situation, I mocked up a 42MB CSV file for master—1.35M rows of 32-character-long hex strings. Reading those 1.35M unique strings and saving them in the dict took less than 1 second real time and used 176 MB of RAM (on my M1 Macbook Air, with dual-channel SSD).
Also, I recommend using the csv module every time you need to read/write a CSV. No matter how simple the CSV looks, using the csv readers/writers will be 100% correct and there's almost 0 overhead compared to trying and manually split or join on a comma.

Removing the end of line character from a read csv file

I tried sever times to use strip() but I can't get it to work.
I removed that piece from this snip but every time I tried it I had
an error or it did nothing. The sort is fine I just want to strip the newline before writing to the new file?
import sys, csv, operator
data = csv.reader(open('tickets.csv'),delimiter=',')
sortedlist = sorted(data, key=operator.itemgetter(6))
# 0 specifies according to first column we want to sort
#now write the sort result into new CSV file
with open("newfiles.csv", "w") as f:
#writablefile = csv.writer(f)
fileWriter = csv.writer(f, delimiter=',')
for row in sortedlist:
#print(row)
lst = (row)
fileWriter.writerow(lst)
You need to add newline='' to your open() when writing a CSV file. This is explained in the documentation. Without it, your file can end up having a blank line per row.
import sys, csv, operator
data = csv.reader(open('tickets.csv'),delimiter=',')
header = next(data)
sortedlist = sorted(data, key=operator.itemgetter(6))
# 0 specifies according to first column we want to sort
#now write the sort result into a new CSV file
with open("newfiles.csv", "w", newline="") as f:
fileWriter = csv.writer(f)
fileWriter.writerow(header) # keep the header at the top
fileWriter.writerows(sortedlist)
Also you need to first read in the header row before loading everything for sorting. This avoids it being sorted. It can then be output separately when writing your sorted output CSV.
If your tickets.csv file contains blank lines, you would need to also remove these. For example:
for row in sortedList:
if row:
fileWriter.writerow(row)

Reading a CSV and writing the exact file back with one column changed

I'm looking to read a CSV file and make some calculations to the data in there (that part I have done).
After that I need to write all the data into a new file Exactly the way it is in the original with the exception of one column which will be changed to the result of the calculations.
I can't show the actual code (confidentiality issues) but here is an example of the code.
headings = ["losts", "of", "headings"]
with open("read_file.csv", mode="r") as read,\
open("write.csv", mode='w') as write:
reader = csv.DictReader(read, delimiter = ',')
writer = csv.DictWriter(write, fieldnames=headings)
writer.writeheader()
for row in reader:
writer.writerows()
At this stage I am just trying to return the same CSV in "write" as I have in "read"
I haven't used CSV much so not sure if I'm going about it the wrong way, also understand that this example is super simple but I can't seem to get my head around the logic of it.
You're really close!
headings = ["losts", "of", "headings"]
with open("read_file.csv", mode="r") as read, open("write.csv", mode='w') as write:
reader = csv.DictReader(read, delimiter = ',')
writer = csv.DictWriter(write, fieldnames=headings)
writer.writeheader()
for row in reader:
# do processing here to change the values in that one column
processed_row = some_function(row)
writer.writerow(processed_row)
Why you don't use pandas?
import pandas as pd
csv_file = pd.read_csv('read_file.csv')
# do_something_and_add_a_column_here()
csv_file.to_csv('write_file.csv')

Code swap. How would I swap the value of one CSV file column to another?

I have two CSV files. The first file(state_abbreviations.csv) has only states abbreviations and their full state names side by side(like the image below), the second file(test.csv) has the state abbreviations with additional info.
I want to replace each state abbreviation in test.csv with its associated state full name from the first file.
My approach was to read reach file, built a dict of the first file(state_abbreviations.csv). Read the second file(test.csv), then compare if an abbreviation matches the first file, if so replace it with the full name.
Any help is appreacited
import csv
state_initials = ("state_abbr")
state_names = ("state_name")
state_file = open("state_abbreviations.csv","r")
state_reader = csv.reader(state_file)
headers = None
final_state_initial= []
for row in state_reader:
if not headers:
headers = []
for i, col in enumerate(row):
if col in state_initials:
headers.append(i)
else:
final_state_initial.append((row[0]))
print final_state_initial
headers = None
final_state_abbre= []
for row in state_reader:
if not headers:
headers = []
for i, col in enumerate(row):
if col in state_initials:
headers.append(i)
else:
final_state_abbre.append((row[1]))
print final_state_abbre
final_state_initial
final_state_abbre
state_dictionary = dict(zip(final_state_initial, final_state_abbre))
print state_dictionary
You almost got it, the approach that is - building out a dict out of the abbreviations is the easiest way to do this:
with open("state_abbreviations.csv", "r") as f:
# you can use csv.DictReader() instead but lets strive for performance
reader = csv.reader(f)
next(reader) # skip the header
# assuming the first column holds the abbreviation, second the full state name
state_map = {state[0]: state[1] for state in reader}
Now you have state_map containing a map of all your state abbreviations, for example: state_map["FL"] contains Florida.
To replace the values in your test.csv, tho, you'll either have to load the whole file into memory, parse it, do the replacement and save it, or create a temporary file and stream-write to it the changes, then overwrite the original file with the temporary file. Assuming that test.csv is not too big to fit into your memory, the first approach is much simpler:
with open("test.csv", "r+U") as f: # open the file in read-write mode
# again, you can use csv.DictReader() for convenience, but this is significantly faster
reader = csv.reader(f)
header = next(reader) # get the header
rows = [] # hold our rows
if "state" in header: # proceed only if `state` column is found in the header
state_index = header.index("state") # find the state column index
for row in reader: # read the CSV row by row
current_state = row[state_index] # get the abbreviated state value
# replace the abbreviation if it exists in our state_map
row[state_index] = state_map.get(current_state, current_state)
rows.append(row) # append the processed row to our `rows` list
# now lets overwrite the file with updated data
f.seek(0) # seek to the file begining
f.truncate() # truncate the rest of the content
writer = csv.writer(f) # create a CSV writer
writer.writerow(header) # write back the header
writer.writerows(rows) # write our modified rows
It seems like you are trying to go through the file twice? This is absolutely not necessary: the first time you go through you are already reading all the lines, so you can then create your dictionary items directly.
In addition, comprehension can be very useful when creating lists or dictionaries. In this case it might be a bit less readable though. The alternative would be to create an empty dictionary, start a "real" for-loop and adding all the key:value pairs manually. (i.e: with state_dict[row[abbr]] = row[name])
Finally, I used the with statement when opening the file to ensure it is safely closed when we're done with it. This is good practice when opening files.
import csv
with open("state_abbreviations.csv") as state_file:
state_reader = csv.DictReader(state_file)
state_dict = {row['state_abbr']: row['state_name'] for row in state_reader}
print(state_dict)
Edit: note that, like the code you showed, this only creates the dictionary that maps abbreviations to state names. Actually replacing them in the second file would be the next step.
Step 1: Ask Python to remember the abbreviated full names, so we are using dictionary for that
with open('state_abbreviations.csv', 'r') as f:
csvreader = csv.reader(f)
next(csvreader)
abs = {r[0]: r[1] for r in csvreader}
step 2: Replace the abbreviations with full names and write to an output, I used "test_output.csv"
with open('test.csv', 'r') as reading:
csvreader = csv.reader(reading)
next(csvreader)
header = ['name', 'gender', 'birthdate', 'address', 'city', 'state']
with open( 'test_output.csv', 'w' ) as f:
writer = csv.writer(f)
writer.writerow(header)
for a in csvreader:
writer.writerow(a[0], a[1], a[2], a[3], a[4], abs[a[5]])

Python- Import Multiple Files to a single .csv file

I have 125 data files containing two columns and 21 rows of data and I'd like to import them into a single .csv file (as 125 pairs of columns and only 21 rows).
This is what my data files look like:
I am fairly new to python but I have come up with the following code:
import glob
Results = glob.glob('./*.data')
fout='c:/Results/res.csv'
fout=open ("res.csv", 'w')
for file in Results:
g = open( file, "r" )
fout.write(g.read())
g.close()
fout.close()
The problem with the above code is that all the data are copied into only two columns with 125*21 rows.
Any help is very much appreciated!
This should work:
import glob
files = [open(f) for f in glob.glob('./*.data')] #Make list of open files
fout = open("res.csv", 'w')
for row in range(21):
for f in files:
fout.write( f.readline().strip() ) # strip removes trailing newline
fout.write(',')
fout.write('\n')
fout.close()
Note that this method will probably fail if you try a large number of files, I believe the default limit in Python is 256.
You may want to try the python CSV module (http://docs.python.org/library/csv.html), which provides very useful methods for reading and writing CSV files. Since you stated that you want only 21 rows with 250 columns of data, I would suggest creating 21 python lists as your rows and then appending data to each row as you loop through your files.
something like:
import csv
rows = []
for i in range(0,21):
row = []
rows.append(row)
#not sure the structure of your input files or how they are delimited, but for each one, as you have it open and iterate through the rows, you would want to append the values in each row to the end of the corresponding list contained within the rows list.
#then, write each row to the new csv:
writer = csv.writer(open('output.csv', 'wb'), delimiter=',')
for row in rows:
writer.writerow(row)
(Sorry, I cannot add comments, yet.)
[Edited later, the following statement is wrong!!!] "The davesnitty's generating the rows loop can be replaced by rows = [[]] * 21." It is wrong because this would create the list of empty lists, but the empty lists would be a single empty list shared by all elements of the outer list.
My +1 to using the standard csv module. But the file should be always closed -- especially when you open that much of them. Also, there is a bug. The row read from the file via the -- even though you only write the result here. The solution is actually missing. Basically, the row read from the file should be appended to the sublist related to the line number. The line number should be obtained via enumerate(reader) where reader is csv.reader(fin, ...).
[added later] Try the following code, fix the paths for your puprose:
import csv
import glob
import os
datapath = './data'
resultpath = './result'
if not os.path.isdir(resultpath):
os.makedirs(resultpath)
# Initialize the empty rows. It does not check how many rows are
# in the file.
rows = []
# Read data from the files to the above matrix.
for fname in glob.glob(os.path.join(datapath, '*.data')):
with open(fname, 'rb') as f:
reader = csv.reader(f)
for n, row in enumerate(reader):
if len(rows) < n+1:
rows.append([]) # add another row
rows[n].extend(row) # append the elements from the file
# Write the data from memory to the result file.
fname = os.path.join(resultpath, 'result.csv')
with open(fname, 'wb') as f:
writer = csv.writer(f)
for row in rows:
writer.writerow(row)

Categories

Resources