import csv
impFileName = []
impFileName.append("file_1.csv")
impFileName.append("file_2.csv")
expFileName = "MasterFile.csv"
l = []
overWrite = False
comma = ","
for f in range(len(impFileName)):
with open(impFileName[f], "r") as impFile:
table = csv.reader(impFile, delimiter = comma)
for row in table:
data_1 = row[0]
data_2 = row[1]
data_3 = row[2]
data_4 = row[3]
data_5 = row[4]
data_6 = row[5]
dic = {"one":data_1, "two":data_2, "three":data_3, "four":data_4, "five":data_5, "six":data_6}
for i in range(len(l)):
if l[i]["one"] == data_1:
print("Data, where one = " + data_1 + " has been updated using the data from " + impFileName[f])
l[i] = dic
overWrite = True
break
if overWrite == False:
l.append(dic)
else:
overWrite = False
print(impFileName[f] + " has been added to the list 'l'")
with open(expFileName, "a") as expFile:
print("Master file now being created...")
for i in range(len(l)):
expFile.write(l[i]["one"] + comma + l[i]["two"] + comma + l[i]["three"] + comma + l[i]["four"] + comma + l[i]["five"] + comma + l[i]["six"] + "\r\n")
print("Process Complete")
This program takes 2 (or more) .csv files and compares the uniqueID (data_1) of each row to all others. If they match, it then assumes that the current row is an updated version so overwrites it. If there is no match then it's a new entry.
I store each row's data in a dictionary, which is then stored in the list "l".
Once all the files have been processed, I output the list "l" to the "MasterFile.csv" in the specified format.
---THE PROBLEM---
The last row of "File_1.csv" and the first row of "File_2.csv" end up on the same line in the output file. I would like it to continue on a new line.
--Visual
...
data_1,data_2,data_3,data_4,data_5,data_6
data_1,data_2,data_3,data_4,data_5,data_6DATA_1,DATA_2,DATA_3,DATA_4,DATA_5,DATA_6
DATA_1,DATA_2,DATA_3,DATA_4,DATA_5,DATA_6
...
NOTE: There are no header rows in any of the .csv files.
I've also tried this using only "\n" at the end of the "expFile.write" - Same result
Just a little suggestion. Comparing two files in your way looks too expensive . Try using pandas in the following way.
import pandas
data1 = pandas.read_csv("file_1.csv")
data2 = pandas.read_csv("file_2.csv")
# Merging Two Dataframes
combinedData = data1.append(data2,ignore_index=True)
# Dropping Duplicates
# give the name of the column on which you are comparing the uniqueness
uniqueData = combinedData.drop_duplicates(["columnName"])
I tried running your program and it is OK. Your only problem is in the line
with open(expFileName, "a") as expFile:
where you use "a" (as append), so if you run your program again and again, it will append to this file.
Use "w" instead of "a".
A'ight guys. I think I made a booboo.
1) Because I was using "a" (append) not "w" (write) at the end; and my last 2 or 3 tests I'd forgotten to clear the file, I was always looking at the same (top 50 or so) rows. Which meant I'd fixed my bug ages ago but was still looking at the old data....
2) Carriage returns were being read into the last value of the dictionary (data_6) so when they were appended to the Master file I ended up with "\r\r\n" at the end.
Thanks Vivek Srinivasan for expanding my python knowledge. I will look at pandas and have a play.
Thanks to MarianD for pointing out the "a"/"w" error.
Thanks to Moses Koledoye for pointing out the "\r" error.
Sorry for wasting your time.
Related
My code attempts to split several data tables into year long chunks, then correlate them against all other years and return the correlation values to a matrix. I am attempting to write these outputs to a csv file, and while it is working fine for the matrices themselves, when I try to write the name of the column and table, they are split by their individual characters.
def split_into_yearly_chunks(egauge, column, offset):
split_into_chunks_stmnt = " SELECT " + column + \
" FROM " + egauge + \
" OFFSET " + offset + " LIMIT 525600 "
year_long_chunk = pd.read_sql_query(split_into_chunks_stmnt, engine)
return year_long_chunk
for x in prof_list:
for y in col_list:
list_of_year_long_chunks = []
for z in off_list:
year_long_chunk = split_into_yearly_chunks(x,y,z)
if len(year_long_chunk) == 525600:
list_of_year_long_chunks.append(year_long_chunk)
corr_matrix = []
for corr_year in list_of_year_long_chunks:
corr_row = []
for corr_partner in list_of_year_long_chunks:
corr_value, p_coef = spearmanr(corr_year, corr_partner)
corr_row.append(corr_value)
corr_matrix.append(corr_row)
print(x,'; ',y,'; ')
with open('correlation_data_58_profiles.csv', 'a') as f:
thewriter = csv.writer(f)
thewriter.writerow(x)
thewriter.writerow(y)
for row in corr_matrix:
print(row)
with open('correlation_data_58_profiles.csv', 'a', newline = '') as f:
thewriter = csv.writer(f)
thewriter.writerow(row)
(Really only the last 10 or so lines in my code are the problem, but I figured I'd give the whole thing for background). My prof,col,and off_lists are all lists of strings.
The way that this stores in my csv file looks like this:
e,g,a,u,g,e,1,3,8,3,0
g,r,i,d
1.0,0.7811790818745755,0.7678768782119194,0.7217461539833535
0.7811790818745755,0.9999999999999998,0.7614854144434556,0.714875063672517
0.7678768782119194,0.7614854144434556,0.9999999999999998,0.7215907332829061
0.7217461539833535,0.7148750636725169,0.7215907332829061,0.9999999999999998
I'd like egauge13830 and grid not to be separated by commas, and other answers I've seen wouldn't work for the for loop that I have. How can I do this?
csv.writer(...).writerow expects a list of values, representing the values of a single row.
At some places in the code you are giving it single strings:
thewriter.writerow(x); # x is a string out of prof_list
thewriter.writerow(y); # y is a string out of col_list
Since it expects lists of strings, it treats each of these strings as a list of individual characters; that's why you get each character as its own "column", separated by commas.
If you want each of these single strings to appear in its own row as a single column value, then you'll need to make each of them into a one-element-list, indicating to the CSV writer that you want a row consisting of a single value:
thewriter.writerow([x]); # `[x]` means "a list comprised only of x"
thewriter.writerow([y]);
Also bear in mind that a CSV containing two rows of a single value each, followed by N rows of K values each would be kind of hard to further process; so you should consider if that's really what you want.
I have a list of sorted data arranged so that each item in the list is a csv line to be written to file.
The final step of the script checks the contents of each field and if all but the last field match then it will copy the current line's last field onto the previous line's last field.
I would like to as I've found and processed one of these matches skip the current line where the field was copied from thus only leaving one of the lines.
Here's an example set of data
field1,field2,field3,field4,something
field1,field2,field3,field4,else
Desired output
field1,field2,field3,field4,something else
This is what I have so far
output_csv = ['field1,field2,field3,field4,something',
'field1,field2,field3,field4,else']
# run through the output
# open and create a csv file to save output
with open('output_table.csv', 'w') as f:
previous_line = None
part_duplicate_line = None
part_duplicate_flag = False
for line in output_csv:
part_duplicate_flag = False
if previous_line is not None:
previous = previous_line.split(',')
current = line.split(',')
if (previous[0] == current[0]
and previous[1] == current[1]
and previous[2] == current[2]
and previous[3] == current[3]):
print(previous[0], current[0])
previous[4] = previous[4].replace('\n', '') + ' ' + current[4]
part_duplicate_line = ','.join(previous)
part_duplicate_flag = True
f.write(part_duplicate_line)
if part_duplicate_flag is False:
f.write(previous_line)
previous_line = line
ATM script adds the line but doesn't skip the next line, I've tried various renditions of continue statements after part_duplicate_line is written to file but to no avail.
Looks like you want one entry for each combination of the first 4 fields
You can use a dict to aggregate data -
#First we extract the key and values
output_csv_keys = list(map(lambda x: ','.join(x.split(',')[:-1]), output_csv))
output_csv_values = list(map(lambda x: x.split(',')[-1], output_csv))
#Then we construct a dictionary with these keys and combine the values into a list
from collections import defaultdict
output_csv_dict = defaultdict(list)
for key, value in zip(output_csv_keys, output_csv_values):
output_csv_dict[key].append(value)
#Then we extract the key/value combinations from this dictionary into a list
for_printing = [','.join([k, ' '.join(v)]) for k, v in output_csv_dict.items()]
print(for_printing)
#Output is ['field1,field2,field3,field4,something else']
#Each entry of this list can be output to the csv file
I propose to encapsulate what you want to do in a function where the important part obeys this logic:
either join the new info to the old record
or output the old record and forget it
of course at the end of the loop we have in any case a dangling old record to output
def join(inp_fname, out_fname):
'''Input file contains sorted records, when two (or more) records differ
only in the last field, we join the last fields with a space
and output only once, otherwise output the record as-is.'''
######################### Prepare for action ##########################
from csv import reader, writer
with open(inp_fname) as finp, open(out_fname, 'w') as fout:
r, w = reader(finp), writer(fout)
######################### Important Part starts here ##############
old = next(r)
for new in r:
if old[:-1] == new[:-1]:
old[-1] += ' '+new[-1]
else:
w.writerow(old)
old = new
w.writerow(old)
To check what I've proposed you can use these two snippets (note that these records are shorter than yours, but it's an example and it doesn't matter because we use only -1 to index our records).
The 1st one has a "regular" last record
open('a0.csv', 'w').write('1,1,2\n1,1,3\n1,2,0\n1,3,1\n1,3,2\n3,3,0\n')
join('a0.csv', 'a1.csv')
while the 2nd has a last record that must be joined to the previous one.
open('b0.csv', 'w').write('1,1,2\n1,1,3\n1,2,0\n1,3,1\n1,3,2\n')
join('b0.csv', 'b1.csv')
If you run the snippets, as I have done before posting, in the environment where you have defined join you should get what you want.
So I have two csv files. Book1.csv has more data than similarities.csv so I want to pull out the rows in Book1.csv that do not occur in similarities.csv Here's what I have so far
with open('Book1.csv', 'rb') as csvMasterForDiff:
with open('similarities.csv', 'rb') as csvSlaveForDiff:
masterReaderDiff = csv.reader(csvMasterForDiff)
slaveReaderDiff = csv.reader(csvSlaveForDiff)
testNotInCount = 0
testInCount = 0
for row in masterReaderDiff:
if row not in slaveReaderDiff:
testNotInCount = testNotInCount + 1
else :
testInCount = testInCount + 1
print('Not in file: '+ str(testNotInCount))
print('Exists in file: '+ str(testInCount))
However, the results are
Not in file: 2093
Exists in file: 0
I know this is incorrect because at least the first 16 entries in Book1.csv do not exist in similarities.csv not all of them. What am I doing wrong?
A csv.reader object is an iterator, which means you can only iterate through it once. You should be using lists/sets for containment checking, e.g.:
slave_rows = set(slaveReaderDiff)
for row in masterReaderDiff:
if row not in slave_rows:
testNotInCount += 1
else:
testInCount += 1
After converting it into sets, you can do a lot of set related & helpful operation without writing much of a code.
slave_rows = set(slaveReaderDiff)
master_rows = set(masterReaderDiff)
master_minus_slave_rows = master_rows - slave_rows
common_rows = master_rows & slave_rows
print('Not in file: '+ str(len(master_minus_slave_rows)))
print('Exists in file: '+ str(len(common_rows)))
Here are various set operations that you can do.
this little snippet of code is my attempt to pull multiple unique values out of rows in a CSV. the CSV looks something like this in the header:
descr1, fee part1, fee part2, descr2, fee part1, fee part2,
with the descr columns having many unique names in a single column. I want to take these unique fee names and make a new header out of them. to do this I decided to start by getting all the different descr columns names, so that when I start pulling data from the actual rows I can check to see if that row has a fee amount or one of the fee names I need. There are probably a lot of things wrong with this code, but I am a beginner. I really just want to know why my first if statement is never triggered when the l in fin does equal a comma, I know it must at some point as it writes a comma to my row string. thanks!
row = ''
header = ''
columnames = ''
cc = ''
#fout = open(","w")
fin = open ("raw data.csv","rb")
for l in fin:
if ',' == l:
if 'start of cust data' not in row:
if 'descr' in row:
columnames = columnames + ' ' + row
row = ''
else:
pass
else:
pass
else:
row = row+l
print(columnames)
print(columnames)
When you iterate over a file, you get lines, not characters -- and they have the newline character, \n, at the end. Your if ',' == l: statement will never succeed because even if you had a line with only a single comma in it, the value of l would be ",\n".
I suggest using the csv module: you'll get much better results than trying to do this by hand like you're doing.
I'm stuck in a script I have to write and can't find a way out...
I have two files with partly overlapping information. Based on the information in one file I have to extract info from the other and save it into multiple new files.
The first is simply a table with IDs and group information (which is used for the splitting).
The other contains the same IDs, but each twice with slightly different information.
What I'm doing:
I create a list of lists with ID and group informazion, like this:
table = [[ID, group], [ID, group], [ID, group], ...]
Then, because the second file is huge and not sorted in the same way as the first, I want to create a dictionary as index. In this index, I would like to save the ID and where it can be found inside the file so I can quickly jump there later. The problem there, of course, is that every ID appears twice. My simple solution (but I'm in doubt about this) is adding an -a or -b to the ID:
index = {"ID-a": [FPos, length], "ID-b": [FPOS, length], "ID-a": [FPos, length], ...}
The code for this:
for line in file:
read = (line.split("\t"))[0]
if not (read+"-a") in indices:
index = read + "-a"
length = len(line)
indices[index] = [FPos, length]
else:
index = read + "-b"
length = len(line)
indices[index] = [FPos, length]
FPos += length
What I am wondering now is if the next step is actually valid (I don't get errors, but I have some doubts about the output files).
for name in table:
head = name[0]
## first round
(FPos,length) = indices[head+"-a"]
file.seek(FPos)
line = file.read(length)
line = line.rstrip()
items = line.split("\t")
output = ["#" + head +" "+ "1:N:0:" +"\n"+ items[9] +"\n"+ "+" +"\n"+ items[10] +"\n"]
name.append(output)
##second round
(FPos,length) = indices[head+"-b"]
file.seek(FPos)
line = file.read(length)
line = line.rstrip()
items = line.split("\t")
output = ["#" + head +" "+ "2:N:0:" +"\n"+ items[9] +"\n"+ "+" +"\n"+ items[10] +"\n"]
name.append(output)
Is it ok to use a for loop like that?
Is there a better, cleaner way to do this?
Use a defaultdict(list) to save all your file offsets by ID:
from collections import defaultdict
index = defaultdict(list)
for line in file:
# ...code that loops through file finding ID lines...
index[id_value].append((fileposn,length))
The defaultdict will take care of initializing to an empty list on the first occurrence of a given id_value, and then the (fileposn,length) tuple will be appended to it.
This will accumulate all references to each id into the index, whether there are 1, 2, or 20 references. Then you can just search through the given fileposn's for the related data.