Fixing a CSV using Python - python

I am trying to clean up the formating on a CSV file in order to import it into a database, and I'm using the following to edit it:
f1 = open('visit_summary.csv', 'r')
f2 = open('clinics.csv', 'w')
for line in f1:
f2.write(line.replace('Calendar: ', ''))
f1.close()
f2.close()
This works fine if there is only 1 edit to make, however, I have to repeat this code 19 times in order to make all the changes required; opening and closing each file several times and having multiple placeholder fiels in order to use for intermediate steps ebtween the first and last edit). Is there a simpler way to do this? I tried adding more "f2.write(line.replace"... lines, however, this creates a final file with duplicated lines each of which has only 1 edit. I think I see my problem (I am writing each line multiple times with each edit), however, I cannot seem to find a solution. I am very new to python and am self teachign myself so any help, or direction to better resources would be appreciated.

There's no reason you can't do lots of things to the line before you write it:
with open('visit_summary.csv', 'r') as f1, open('clinics.csv', 'w') as f2:
for line in f1:
line = line.replace('Calendar: ', '')
line = line.replace('Something else', '')
f2.write(line)
(I also replaced open, close with the with statement)

f1 = open('visit_summary.csv', 'r')
f2 = open('clinics.csv', 'w')
for line in f1:
f2.write(line.replace('Calendar: ', '').replace('String2', '').replace('String3', ''))
f1.close()
f2.close()
Will this work? Although I don't think its very "pythonic". In this case, you have to be careful about the ordering!

import csv
file1 = csv.DictReader(open('visit_summary.csv', 'r'))
output =[]
for line in file1:
row= {}
row['calender'] = line.replace('Calender:','')
row['string'] = line.replace('string','')
output.append(row)
file = csv.DictReader(open('clinics.csv', 'w'))
fileWriter = csv.writer(file , delimiter=",",quotechar='"', quoting=csv.QUOTE_MINIMAL)
Header = ['something' , 'something']
fileWriter.writerow(Header)
for x in output:
temp = [x['something'],x['something']]
fileWriter.writerow(temp)
file.close()

Related

Comparing contents of two txt.files for deleted lines or changes in python

I'm trying to compare two .txt files for changes or deleted lines. If its deleted I want to output what the deleted line was and if it was changed I want to output the new line. I originally tried comparing line to line but when something was deleted it wouldn't work for my purpose:
for line1 in f1:
for line1 in f2:
if line1==line1:
print("SAME",file=x)
else:
print(f"Original:{line1} / New:{line1}", file=x)
Then I tried not comparing line to line so I could figure out if something was deleted but I'm not getting any output:
def check_diff(f1,f2):
check = {}
for file in [f1,f2]:
with open(file,'r') as f:
check[file] = []
for line in f:
check[file].append(line)
diff = set(check[f1]) - set(check[f2])
for line in diff:
print(line.rstrip(),file=x)
I tried combining a lot of other questions previously asked similar to my problem to get this far, but I'm new to python so I need a little extra help. Thanks! Please let me know if I need to add any additional information.
The concept is simple. Lets say file1,txt is the original file, and file2 is the one we need to see what changes were made to it:
with open('file1.txt', 'r') as f:
f1 = f.readlines()
with open('file2.txt', 'r') as f:
f2 = f.readlines()
deleted = []
added = []
for l in f1:
if l not in f2:
deleted.append(l)
for l in f2:
if l not in f1:
added.append(l)
print('Deleted lines:')
print(''.join(deleted))
print('Added lines:')
print(''.join(added))
For every line in the original file, if that line isn't in the other file, then that means that the line have been deleted. If it's the other way around, that means the line have been added.
I am not sure how you would quantify a changed line (since you could count it as one deleted plus one added line), but perhaps something like the below would be of some aid. Note that if your files are large, it might be faster to store the data in a set instead of a list, since the former has typically a search time complexity of O(1), while the latter has O(n):
with open('file1.txt', 'r') as f1, open('file2.txt', 'r') as f2:
file1 = set(f1.read().splitlines())
file2 = set(f2.read().splitlines())
changed_lines = [line for line in file1 if line not in file2]
deleted_lines = [line for line in file2 if line not in file1]
print('Changed lines:\n' + '\n'.join(changed_lines))
print('Deleted lines:\n' + '\n'.join(deleted_lines))

Save each line as separate .txt file using Notepad++

I am using Notepad++ to restructure some data. Each .txt file has 99 lines. I am trying to run a python script to create 99 single-line files.
Here is the .py script I am currently running, which I found in a previous thread on the topic. I'm not sure why, but it isn't quite doing the job:
yourfile = open('filename.TXT', 'r')
counter = 0
magic = yourfile.readlines()
for i in magic:
counter += 1
newfile = open(('filename_' + str(counter) + '.TXT'), "w")
newfile.write(i)
newfile.close()
When I run this particular script, it simply creates a copy of the host file, and it still has 99 lines.
You may want to change the structure of your script a bit:
with open('filename.txt', 'r') as f:
for i, line in enumerate(f):
with open('filename_{}.txt'.format(i), 'w') as wf:
wf.write(line)
In this format you have the benefit of relying on context managers to close your file handler and also you don't have to read things separately, there isa better logical flow.
You can use the following piece of code to achieve that. It's commented, but feel free to ask.
#reading info from infile with 99 lines
infile = 'filename.txt'
#using context handler to open infile and readlines
with open(infile, 'r') as f:
lines = f.readlines()
#initializing counter
counter = 0
#for each line, create a new file and write line to it.
for line in lines:
#define outfile name
outfile = 'filename_' + str(counter) + '.txt'
#create outfile and write line
with open(outfile, 'w') as g:
g.write(line)
#add +1 to counter
counter += 1
magic = yourfile.readlines(99)
Please try remove '99' like this.
magic = yourfile.readlines()
I tried it and I have 99 file that have a single line each one.

How do I concatenate multiple CSV files row-wise using python?

I have a dataset of about 10 CSV files. I want to combine those files row-wise into a single CSV file.
What I tried:
import csv
fout = open("claaassA.csv","a")
# first file:
writer = csv.writer(fout)
for line in open("a01.ihr.60.ann.csv"):
print line
writer.writerow(line)
# now the rest:
for num in range(2, 10):
print num
f = open("a0"+str(num)+".ihr.60.ann.csv")
#f.next() # skip the header
for line in f:
print line
writer.writerow(line)
#f.close() # not really needed
fout.close()
Definitively need more details in the question (ideally examples of the inputs and expected output).
Given the little information provided, I will assume that you know that all files are valid CSV and they all have the same number or lines (rows). I'll also assume that memory is not a concern (i.e. they are "small" files that fit together in memory). Furthermore, I assume that line endings are new line (\n).
If all these assumptions are valid, then you can do something like this:
input_files = ['file1.csv', 'file2.csv', 'file3.csv']
output_file = 'output.csv'
output = None
for infile in input_files:
with open(infile, 'r') as fh:
if output:
for i, l in enumerate(fh.readlines()):
output[i] = "{},{}".format(output[i].rstrip('\n'), l)
else:
output = fh.readlines()
with open(output_file, 'w') as fh:
for line in output:
fh.write(line)
There are probably more efficient ways, but this is a quick and dirty way to achieve what I think you are asking for.
The previous answer implicitly assumes we need to do this in python. If bash is an option then you could use the paste command. For example:
paste -d, file1.csv file2.csv file3.csv > output.csv
I don't understand fully why you use the library csv. Actually, it's enough to fill the output file with the lines from given files (it they have the same columns' manes and orders).
input_path_list = [
"a01.ihr.60.ann.csv",
"a02.ihr.60.ann.csv",
"a03.ihr.60.ann.csv",
"a04.ihr.60.ann.csv",
"a05.ihr.60.ann.csv",
"a06.ihr.60.ann.csv",
"a07.ihr.60.ann.csv",
"a08.ihr.60.ann.csv",
"a09.ihr.60.ann.csv",
]
output_path = "claaassA.csv"
with open(output_path, "w") as fout:
header_written = False
for intput_path in input_path_list:
with open(intput_path) as fin:
header = fin.next()
# it adds the header at the beginning and skips other headers
if not header_written:
fout.write(header)
header_written = True
# it adds all rows
for line in fin:
fout.write(line)

Using multiple re.sub() calls in one file with Python

I have a file with a large amount of random strings contained with in it. There are certain patterns that I wan't to remove, so I decided to use RegEX to check for them. So far this code, does exactly what I want it to:
#!/usr/bin/python
import csv
import re
import sys
import pdb
f=open('output.csv', 'w')
with open('retweet.csv', 'rb') as inputfile:
read=csv.reader(inputfile, delimiter=',')
for row in read:
f.write(re.sub(r'#\s\w+', ' ', row[0]))
f.write("\n")
f.close()
f=open('output2.csv', 'w')
with open('output.csv', 'rb') as inputfile2:
read2=csv.reader(inputfile2, delimiter='\n')
for row in read2:
a= re.sub('[^a-zA-Z0-9]', ' ', row[0])
b= str.split(a)
c= "+".join(b)
f.write("http://www.google.com/webhp#q="+c+"&btnI\n")
f.close()
The problem is, I would like to avoid having to open and close a file as this can get messy if I need to check for more patterns. How can I perform multiple re.sub() calls on the same file and write it out to a new file with all substitutions?
Thanks for any help!
Apply all your substitutions in one go on the current line:
with open('retweet.csv', 'rb') as inputfile:
read=csv.reader(inputfile, delimiter=',')
for row in read:
text = row[0]
text = re.sub(r'#\s\w+', ' ', text)
text = re.sub(another_expression, another_replacement, text)
# etc.
f.write(text + '\n')
Note that opening a file with csv.reader(..., delimiter='\n') sounds awfully much as if you are treating that file as a sequence of lines; you could just loop over the file:
with open('output.csv', 'rb') as inputfile2:
for line in inputfile2:

Better approach for reading/writing files in python?

Suppose I have a file (say file1.txt) with data around 3mb or more. If I want to write this data to a second file (say file2.txt), which one of the following approaches will be better?
Language used: Python 2.7.3
Approach 1:
file1_handler = file("file1.txt", 'r')
for lines in file1_handler:
line = lines.strip()
# Perform some operation
file2_handler = file("file2.txt", 'a')
file2_handler.write(line)
file2_handler.write('\r\n')
file2_handler.close()
file1_handler.close()
Approach 2:
file1_handler = file("file1.txt", 'r')
file2_handler = file("file2.txt", 'a')
for lines in file1_handler:
line = lines.strip()
# Perform some operation
file2_handler.write(line)
file2_handler.write('\r\n')
file2_handler.close()
file1_handler.close()
I think approach two will be better because you just have to open and close file2.txt once. What do you say?
Use with, it will close the files automatically for you:
with open("file1.txt", 'r') as in_file, open("file2.txt", 'a') as out_file:
for lines in in_file:
line = lines.strip()
# Perform some operation
out_file.write(line)
out_file.write('\r\n')
Use open instead of file, file is deprecated.
Of course it's unreasonable to open file2 on every line of file1.
I was recently doing something similar (if I understood you well). How about:
file = open('file1.txt', 'r')
file2 = open('file2.txt', 'wt')
for line in file:
newLine = line.strip()
# You can do your operation here on newLine
file2.write(newLine)
file2.write('\r\n')
file.close()
file2.close()
This approach works like a charm!
My solution (derived from Pavel Anossov + buffering):
dim = 1000
buffer = []
with open("file1.txt", 'r') as in_file, open("file2.txt", 'a') as out_file:
for i, lines in enumerate(in_file):
line = lines.strip()
# Perform some operation
buffer.append(line)
if i%dim == dim-1:
for bline in buffer:
out_file.write(bline)
out_file.write('\r\n')
buffer = []
Pavel Anossov gave the right solution first: this is just a suggestion ;)
Probably it exists a more elegant way to implement this function. If anyone knows it, please tell us.

Categories

Resources