Compare 2 large CSVs using python - output the differences - python

I am writing a program to compare all files and directories between two filepaths (basically the files metadata, content, and internal directories should match)
File content comparison is done row by row. Dimensions of the csv may or may not be the same, but below approaches generally manages scenerios whereby dimensions are not the same.
The problem is that processing time is too slow.
Some context:
The two files are identified to be different using filecmp
This particular problematic csv is ~11k columns and 800 rows.
My program will not know what is the data type within
the csv beforehand, so defining the dtype for pandas is
not an option
Difflib does an excellent job if the csv file is small, but not for this particular usecase
I've looked at all the related questions on SO, and tried these approaches, but the processing time was terrible. Approach 3 gives weird results
Approach 1 (Pandas) - Terrible wait and I keep getting this error
UserWarning: You are merging on int and float columns where the float values are not equal to their int representation.
import pandas as pd
import numpy as np
df1 = pd.read_csv(f1)
df2 = pd.read_csv(f2)
diff = df1.merge(df2, how='outer', indicator='exists').query("exists!='both'")
print(diff)
Approach 2 (Difflib) - Terrible wait for this huge csv
import difflib
def CompareUsingDiffLib(f1, f2 ):
html = h.make_file(file1_lines, file2_lines, context=True,numlines=0)
htmlfilepath = filePath + "\\htmlFiles"
with open(htmlfilepath, 'w') as fh:
fh.write(html)
with open (file1) as f, open(file2) as z:
f1 = f.readlines()
f2 = z.readlines()
CompareUsingDiffLib(f1, f2 )
Approach 3 (Pure python) - Incorrect results
with open (f1) as f, open(f2) as z:
file1 = f.readlines()
file2 = z.readlines()
# check row number of diff in file 1
for line in file1:
if line not in file2:
print(file1.index(line))
# it shows from all the row from row number 278 to last row
# is not in file 2, which is incorrect
# I checked using difflib, and using excel as well
# no idea why the results are like that
# running below code shows the same result as the first block of code
for line in file2:
if line not in file1:
print(file2.index(line))
Approach 4 (csv-diff) - Terrible wait
from csv_diff import load_csv, compare
diff = compare(
load_csv(open("one.csv")),
load_csv(open("two.csv"))
)
Can anybody please help on either:
An approach with less processing time
Debugging Approach 3

Comparing the files with readlines() and just testing for membership ("this in that?") does not equal diff'ing the lines.
with open (f1) as f, open(f2) as z:
file1 = f.readlines()
file2 = z.readlines()
for line in file1:
if line not in file2:
print(file1.index(line))
Consider these two CSVs:
file1.csv file2.csv
----------- -----------
a,b,c,d a,b,c,d
1,2,3,4 1,2,3,4
A,B,C,D i,ii,iii,iv
i,ii,iii,iv A,B,C,D
That script will produce nothing (and give the false impression there's no diff) because every line in file 1 is in file 2, even though the files differ line-for-line. (I cannot say why you think you were getting false positives, though, without seeing the files.)
I recommend using the CSV module and iterating the files row by row, and then even column by column:
import csv
path1 = "file1.csv"
path2 = "file2.csv"
with open(path1) as f1, open(path2) as f2:
reader1 = csv.reader(f1)
reader2 = csv.reader(f2)
for i, row1 in enumerate(reader1):
try:
row2 = next(reader2)
except StopIteration:
print(f"Row {i+1}, f1 has this extra row compared to f2")
continue
if row1 == row2:
continue
if len(row1) != len(row2):
print(f"Row {i+1} of f1 has {len(row1)} cols, f2 has {len(row2)} cols")
continue
for j, cell1 in enumerate(row1):
cell2 = row2[j]
if cell1 != cell2:
print(f'Row {i+1}, Col {j+1} of f1 is "{cell1}", f2 is "{cell2}"')
for row2 in reader2:
i += 1
print(f"Row {i+1}, f2 has this extra row compared to f1")
This uses an iterator of file1 to drive an iterator for file2, accounts for any difference in row counts between the two files by just noting a StopIteration exception if file1 has more rows than file2, and printing a difference if there are any rows left to read in file2 (reader2) at the very bottom.
When I run that against these files:
file1 file2
----------- ----------
a,b,c,d a,b,c
1,2,3,4 1,2,3,4
A,B,C,D A,B,C,Z
i,ii,iii,iv i,ii,iii,iv
x,xo,xox,xoxo
I get:
Row 1 of f1 has 4 cols, f2 has 3 cols
Row 3, Col 4 of f1 is "D", f2 is "Z"
Row 5, f2 has an extra row compared to f1
If I swap path1 and path2, I get this:
Row 1 of f1 has 3 cols, f2 has 4 cols
Row 3, Col 4 of f1 is "Z", f2 is "D"
Row 5, f1 has this extra row compared to f2
And it does this fast. I mocked up two 800 x 11_000 CSVs with very, very small differences between rows (if any) and it processed all diffs in under a second of user time (not counting printing).

use can use filecmp to byte by byte compare two files. docs
implementation
>>> import filecmp
>>> filecmp.cmp('somepath/file1.csv', 'otherpath/file1.csv')
True
>>> filecmp.cmp('somepath/file1.csv', 'otherpath/file2.csv')
True
note: file name doesnt matter.
speed comparison against hashing: https://stackoverflow.com/a/1072576/16239119

Related

Loop within loop when comparing csv files in Python

I have two csv files. I am trying to look up a value the first column in one file (file 1) in the first column in the other file (file 2). If they match then print the row from file 2.
Pseudo code:
read file1.csv
read file2.csv
loop through file1
compare each row with each row of file 2 in turn
if file1[0] == file2[0]:
print row of file 2
file1:
45,John
46,Fred
47,Bill
File2:
46,Roger
48,Pete
49,Bob
I want it to print :
46 Roger
EDIT - these are examples, the actual file is much bigger (5,000 rows, 7 columns)
I have the following:
import csv
with open('csvfile1.csv', 'rt') as csvfile1, open('csvfile2.csv', 'rt') as csvfile2:
csv1reader = csv.reader(csvfile1)
csv2reader = csv.reader(csvfile2)
for rowcsv1 in csv1reader:
for rowcsv2 in csv2reader:
if rowcsv1[0] == rowcsv2[0]:
print(rowcsv1)
However I am getting no output.
I am aware there are other ways of doing it (with dict, pandas) but I cam keen to know why my approach is not working.
EDIT: I now see that it is only iterating through the first row of file 1 and then closing, but I am unclear how to stop it closing (I also understand that this is not the best way to do do it).
You open csv2reader = csv.reader(csvfile2) then iterate through it vs the first row of csv1reader - it has now reached end of file and will not produce any more data.
So for the second through last rows of csv1reader you are comparing against the items of an empty list, ie no comparison takes place.
In any case, this is a very inefficient method; unless you are working on very large files, it would be much better to do
import csv
# load second file as lookup table
data = {}
with open("csv2file.csv") as inf2:
for row in csv.reader(inf2):
data[row[0]] = row
# now process first file against it
with open("csv1file.csv") as inf1:
for row in csv.reader(inf1):
if row[0] in data:
print(data[row[0]])
See Hugh Bothwell's answer for why your code isn't working. For a fast way of doing what you stated you want to do in your question, try this:
import csv
with open('csvfile1.csv', 'rt') as csvfile1, open('csvfile2.csv', 'rt') as csvfile2:
csv1 = list(csv.reader(csvfile1))
csv2 = list(csv.reader(csvfile2))
duplicates = {a[0] for a in csv1} & {a[0] for a in csv2}
for row in csv2:
if row[0] in duplicates:
print(row)
It gets the duplicate numbers from the two csv files, then loops through the second cvs file, printing the row if the number at index 0 is in the first cvs file. This is a much faster algorithm than what you were attempting to do.
If order matters, as #hugh-bothwell's mentioned in #will-da-silva's answer, you could do:
import csv
from collections import OrderedDict
with open('csvfile1.csv', 'rt') as csvfile1, open('csvfile2.csv', 'rt') as csvfile2:
csv1 = list(csv.reader(csvfile1))
csv2 = list(csv.reader(csvfile2))
d = {row[0]: row for row in csv2}
k = OrderedDict.fromkeys([a[0] for a in csv1]).keys()
duplicate_keys = [k for k in k if k in d]
for k in duplicate_keys:
print(d[k])
I'm pretty sure there's a better way to do this, but try out this solution, it should work.
counter = 0
import csv
with open('csvfile1.csv', 'rt') as csvfile1, open('csvfile2.csv', 'rt') as
csvfile2:
csv1reader = csv.reader(csvfile1)
csv2reader = csv.reader(csvfile2)
for rowcsv1 in csv1reader:
for rowcsv2 in csv2reader:
if rowcsv1[counter] == rowcsv2[counter]:
print(rowcsv1)
counter += 1 #increment it out of the IF statement.

How to add data to the new column in existing CSV using Python script

Team,
2 things i am trying to do as per below code
1) write data in column 2,3,4 and 8 of file1 to a new file.
2) data in 1st column(copied in new file)should be searched file2. if found,pick the data in 3rd column of same row of file 2 and write the same to the new column of new file.
point 1 is working fine..finding issue in getting output as per 2nd point
import csv
f1 = csv.reader(open("C:/Users/file1.csv","rb"))
f2 = csv.writer(open("C:/Users/newfile.csv","wb"))
f3 = csv.reader(open("C:/Users/file2.csv","rb"))
for row in f1:
if not row[0].startswith("-"):
f2.writerow((row[1],row[2],row[3],row[7]))
var1 = row[1]
for row in f3:
if var1 in row:
f2.append(row[2])
There are several issues with your code:
you're opening all files in binary mode.
for each row in f1, you're (potentially) iterating over all rows in f3. This will decrease your performance.
Once you iterate through all rows in f3, you're at the end of the file and the next time the iteration will not return any rows.
Here's my suggestion (not tested):
# Create lookup from f3
lookup = {}
with open('C:/Users/file2.csv', newline='') as f3:
csv_f3 = csv.reader(f3)
for row in csv_f3:
lookup[row[1]] = row[2]
# Process the rows in f1
with open('C:/Users/file1.csv', newline='') as f1:
with open('C:/Users/newfile.csv', 'w', newline='') as f2:
csv_f1 = csv.reader(f1)
csv_f2 = csv.writer(f2)
for row in csv_f1:
if not row[0].startswith("-"):
try:
csv_f2.writerow(row[1],row[2],row[3],row[7],lookup[row[1]])
except KeyError:
csv_f2.writerow(row[1],row[2],row[3],row[7])
I suspect that your re-use of the "row" variable name in the second for loop is clobbering the one held in "var1". I would always steer clear of that kind of recycling of variable names in nested loops.
for row_f1 in f1:
if not row_f1[0].startswith("-"):
f2.writerow((row_f1[1],row_f1[2],row_f1[3],row_f1[7]))
var1 = row_f1[1]
for row_f3 in f3:
if var1 in row_f3:
f2.append(row_f3[2])
However, I don't know that that append would do what you need, as f2 will have no append method, as far as I can see. From your use of .append, it looks like you wanted to put the elements from f1 into a list
for row_f1 in f1:
if not row_f1[0].startswith("-"):
temp_list = [row_f1[1],row_f1[2],row_f1[3],row_f1[7]]
for row_f3 in f3:
if temp_list[0] in row_f3:
temp_list.append(row_f3[2])
f2.writerow(temp_list]
Though your explanation of what you want to achieve isn't completely clear to me.
EDIT: I think Kristof's solution is far better, I was just trying to arrive at a solution that required minimal changes to your existing code. If you provided an example of what you expect your output to be with given inputs, that would definitely clarify things.
#Andrew, i have made change in my code as per your input.But the value in 1st row for row_f1[1] is only going to the 2nd for loop.which means the value for row_f1[1] in other rows are not being considered for the 2nd for loop.
for row_f1 in f1:
if not row_f1[0].startswith("-"):
temp_list = [row_f1[1],row_f1[2],row_f1[3],row_f1[7]]
for row_f3 in f3:
if temp_list[0] in row_f3:
temp_list.append(row_f3[2])
f2.writerow(temp_list)

Combine data from csv files

I have 100 csv files with the same number of columns (different number of rows) in the following pattern:
Files 1:
A1,B1,C1
A2,B2,C2
A3,B3,C3
A4,B4,C4
File 2:
*A1*,*B1*,*C1*
*A2*,*B2*,*C2*
*A3*,*B3*,*C3*
File ...
Output:
A1+*A1*+...,B1+*B1*+...,C1+*C1*+...
A2+*A2*+...,B2+*B2*+...,C2+*C2*+...
A3+*A3*+...,B3+*B3*+...,C3+*C3*+...
A4+... ,B4+... ,C4+...
For example:
Files 1:
1,0,0
1,0,1
1,0,0
0,1,0
Files 2:
1,1,0
1,1,1
0,1,0
Output:
2,1,0
2,1,2
1,1,0
0,1,0
I am really breaking my head on how to solve this... Could any body give me some advise?
Thanks a lot and best regards,
Julian
Edit:
I want to thank 'pepr' a lot for his very elaborated answer but I would like to find a solution using pandas as suggested by 'furas'.
I have found a way to create the variables for all my files like this:
dic={}
for i in range(14253,14352):
try:
dic['df_{0}'.format(i)]=pandas.read_csv('output_'+str(i)+'.csv')
except:
pass
but if I try the suggested
df1['column_A'] += df2['column_*A*']
Because I have 100 files in my case it would have to be something like
for residue in residues:
for number in range(14254,14255):
df=dic['df_14253'][residue]
df+=dic['df_'+str(number)][residue]
I have the problem that my files have different numbers of rows and are only summed up until the last row of df1. How could I solve this? I think groupby.sum by panda could be an option but I don't understand how to use it.
PS: residues is a list which contains all the column headers.
The solution with standard modules can be like this:
#!python3
import csv
import itertools
fname1 = 'file1.csv'
fname2 = 'file2.csv'
fname_out = 'output.csv'
with open(fname1, newline='') as f1,\
open(fname2, newline='') as f2,\
open(fname_out, 'w', newline='') as fout:
reader1 = csv.reader(f1)
reader2 = csv.reader(f2)
writer = csv.writer(fout)
for row1, row2 in itertools.zip_longest(reader1, reader2, fillvalue=['0', '0', '0']):
row_out = [int(a) + int(b) for a, b in zip(row1, row2)]
writer.writerow(row_out)
The itertools implements the zip_longest(), that is similar to the built-in zip(); however, it can process the sequences of different lengths. Here the third parameter fillvalue is a quick hack -- 3 columns hardwired. Actually, it can be set to [0, 0, 0] (that is integers instead of strings) because int(0) is also zero.
Each zip_longest() extract a tuple of two rows -- the elements are assigned to row1 and row2. Inside the loop, the normal zip() can be used as you will always have the row from the file or the fillvalue with zeros. You always get tupple with one element from the first row, and second element from the second row. They have to be converted from string to int and then they are added to form a single element in row_out.
A better solution of the loop, that does not rely on the fixed number of columns, uses the default None as the fillvalue. If one of the rows is None, then it is set to the list with the same number of zeros that has the other row. It means that you can even have rows of different length in the same file (but must be the same i both files; the opposite could also be solved easily using zip_longest() also in the body of the loop.
for row1, row2 in itertools.zip_longest(reader1, reader2):
if row1 is None:
row1 = [0] * len(row2)
elif row2 is None:
row2 = [0] * len(row1)
row_out = [int(a) + int(b) for a, b in zip(row1, row2)]
writer.writerow(row_out)
Use pandas.
It can read CSV files and it can add two columns.
import pandas as pd
df1 = pd.read_csv(filename_1)
df2 = pd.read_csv(filename_2)
df1['column_A'] += df2['column_*A*']

Merging two files by one common set of identifiers with python

I would like to merge two tab-delimited text files that share one common column. I have an 'identifier file' that looks like this (2 columns by 1050 rows):
module 1 gene 1
module 1 gene 2
..
module x gene y
I also have a tab-delimited 'target' text file that looks like this (36 columns by 12000 rows):
gene 1 sample 1 sample 2 etc
gene 2 sample 1 sample 2 etc
..
gene z sample 1 sample 2 etc
I would like to merge the two files based on the gene identifier and have both the matching expression values and module affiliations from the identifier and target files. Essentially to take the genes from the identifier file, find them in the target file and create a new file with module #, gene # and expression values all in one file. Any suggestions would be welcome.
My desired output is gene ID tab module affiliation tab sample values separated by tabs.
Here is the script I came up with. The script written does not produce any error messages but it gives me an empty file.
expression_values = {}
matches = []
with open("identifiers.txt") as ids, open("target.txt") as target:
for line in target:
expression_values = {line.split()[0]:line.split()}
for line in ids:
block_idents=line.split()
for gene in expression_values.iterkeys():
if gene==block_idents[1]:
matches.append(block_idents[0]+block_idents[1]+expression_values)
csvfile = "modules.csv"
with open(csvfile, "w") as output:
writer = csv.writer(output, lineterminator='\n')
for val in matches:
writer.writerow([val])
Thanks!
These lines of code are not doing what you are expecting them to do:
for line in target:
expression_values = {line.split()[0]:line.split()}
for line in ids:
block_idents=line.split()
for gene in expression_values.iterkeys():
if gene==block_idents[1]:
matches.append(block_idents[0]+block_idents[1]+expression_values)
The expression values and block_idents will have the values only according to the current line of the files you are updating them with. In other words, the dictionary and the list are not "growing" as more lines are being read. Also TSV files, can be parsed with less effort using csv module.
There are a few assumptions I am making with this rough solution I am suggesting:
The "genes" in the first file are the only "genes" that will appear
in the second file.
There could duplicate "genes" in the first file.
First construct a map of the data in the first file as:
import csv
from collections import defaultdict
gene_map = defaultdict(list)
with open(first_file, 'rb') as file_one:
csv_reader = csv.reader(file_one, delimiter='\t')
for row in csv_reader:
gene_map[row[1]].append(row[0])
Read the second file and write to the output file simultaneously.
with open(sec_file, 'rb') as file_two, open(op_file, 'w') as out_file:
csv_reader = csv.reader(file_two, delimiter='\t')
csv_writer = csv.writer(out_file, delimiter='\t')
for row in csv_reader:
values = gene_map.get(row[0], [])
op_list = []
op_list.append(row[0])
op_list.extend(values)
values.extend(row[1:])
csv_writer.writerow(op_list)
There are a number of problems with the existing approach, not least of which is that you are throwing away all data from the files except for the last line in each. The assignment under each "for line in" will replace the contents of the variable, so only the last assignment, for the last line, will have effect.
Assuming each gene appears in only one module, I suggest instead you read the "ids" into a dictionary, saving the module for each geneid:
geneMod = {}
for line in ids:
id = line.split()
geneMod[ id[0] ] = id[1]
Then you can just go though target lines, and for each line, split it, get the gene id gene= targetsplit[0] and save (or output) the same split fields but inserting the module value, e.g.: print gene, geneMod[gene], targetsplit[1:]

Comparing two csv files and getting difference

I have two csv file I need to compare and then spit out the differnces:
CSV FORMAT:
Name Produce Number
Adam Apple 5
Tom Orange 4
Adam Orange 11
I need to compare the two csv files and then tell me if there is a difference between Adams apples on sheet and sheet 2 and do that for all names and produce numbers. Both CSV files will be formated the same.
Any pointers will be greatly appreciated
I have used csvdiff
$pip install csvdiff
$csvdiff --style=compact col1 a.csv b.csv
Link to package on pypi
I found this link useful
If your CSV files aren't so large they'll bring your machine to its knees if you load them into memory, then you could try something like:
import csv
csv1 = list(csv.DictReader(open('file1.csv')))
csv2 = list(csv.DictReader(open('file2.csv')))
set1 = set(csv1)
set2 = set(csv2)
print set1 - set2 # in 1, not in 2
print set2 - set1 # in 2, not in 1
print set1 & set2 # in both
For large files, you could load them into a SQLite3 database and use SQL queries to do the same, or sort by relevant keys and then do a match-merge.
One of the best utilities for comparing two different files is diff.
See Python implementation here: Comparing two .txt files using difflib in Python
import csv
def load_csv_to_dict(fname, get_key, get_data):
with open(fname, 'rb') as inf:
incsv = csv.reader(inf)
incsv.next() # skip header
return {get_key(row):get_data(row) for row in incsv}
def main():
key = lambda r: tuple(r[0:2])
data = lambda r: int(r[2])
f1 = load_csv_to_dict('file1.csv', key, data)
f2 = load_csv_to_dict('file2.csv', key, data)
f1keys = set(f1.iterkeys())
f2keys = set(f2.iterkeys())
print("Keys in file1 but not file2:")
print(", ".join(str(a)+":"+str(b) for a,b in (f1keys-f2keys)))
print("Keys in file2 but not file1:")
print(", ".join(str(a)+":"+str(b) for a,b in (f2keys-f1keys)))
print("Differing values:")
for k in (f1keys & f2keys):
a,b = f1[k], f2[k]
if a != b:
print("{}:{} {} <> {}".format(k[0],k[1], a, b))
if __name__=="__main__":
main()
If you want to use Python's csv module along with a function generator, you can use nested looping and compare large .csv files. The example below compares each row using a cursory comparision:
import csv
def csv_lazy_get(csvfile):
with open(csvfile) as f:
r = csv.reader(f)
for row in r:
yield row
def csv_cmp_lazy(csvfile1, csvfile2):
gen_2 = csv_lazy_get(csvfile2)
for row_1 in csv_lazy_get(csvfile1):
row_2 = gen_2.next()
print("row_1: ", row_1)
print("row_2: ", row_2)
if row_2 == row_1:
print("row_1 is equal to row_2.")
else:
print("row_1 is not equal to row_2.")
gen_2.close()
Here a start that does not use difflib. It is really just a point to build from because maybe Adam and apples appear twice on the sheet; can you ensure that is not the case? Should the apples be summed, or is that an error?
import csv
fsock = open('sheet.csv','rU')
rdr = csv.reader(fsock)
sheet1 = {}
for row in rdr:
name, produce, amount = row
sheet1[(name, produce)] = int(amount) # always an integer?
fsock.close()
# repeat the above for the second sheet, then compare
You get the idea?

Categories

Resources