Compare all the CSV files in a folder and print duplicate rows

Compare all the CSV files in a folder and print duplicate rows - python

I have multiple CSV files in a folder, which I want to compare and print the matching rows (where the number of columns could be different). I know how to get duplicates within a file but this case is a little different. Let's say there are two files in a folder and I want to compare them.
CSV1:
H1,H2,H4
C01,23,F
C2,45,M
CSV2:
H1,H2,H3,H4
C01,23,data,F
C01,23,some other data,M
C4,34,data,M
I need my output to check if all the available data (from the one with the least number of columns) matches exactly in another file in the same folder. My output could be like
CSV1,CSV2 (H1:C01,H2:23,H4:F(H3:data))

What about something like:
def duplines(csv_least_cols, csv_most_cols):
rowset = set()
with open(csv_least_cols) as csv1:
r = csv.reader(csv1)
csv1_cols = next(r)
for row in r:
rowset.add(tuple(row))
with open(csv_most_cols) as csv2:
dr = csv.DictReader(csv2)
for drow in dr:
refcols = tuple(drow[c] for c in csv1_cols)
if refcols in rowset: yield csv1_cols, refcols, drow
You can call this in a loop and perform whatever formatting you want -- this generator deals with the underlying logic, separating out the formatting task to its caller.
So for example to get your peculiar desired CSV1,CSV2 (H1:C01,H2:23,H4:F(H3:data)) style output you could have...:
def formatit(csv_least, csv_most):
out_start = '{},{} ('.format(csv_least, csv_most)
for c1cols, refvals, c2dict in duplines(csv_least, csv_most):
out_middle = []
for c, v in zip(c1cols, refvals):
out_middle.append('{}:{}'.format(c, v))
out_end = []
for c in c2dict:
if c in c1cols: continue
out_end.append('{}:{}'.format(c, c2dict[c]))
out = '{}{}({}))'.format(out_start, ','.join(out_middle), ','.join(out_end))
print(out)
You'll notice that the formatting work is substantially more complex than the actual logic (and hence more likely to hide bugs:-) which is why I call your desired format "peculiar".
But I hope this can at least get you started (and you can try out each function separately, making sure the logic is as you desire it before worrying about the formatting:-).

Related

Python: Use the "i" counter in while loop as digit for expressions

This seems like it should be very simple but am not sure the proper syntax in Python. To streamline my code I want a while loop (or for loop if better) to cycle through 9 datasets and use the counter to call each file out using the counter as a way to call on correct file.
I would like to use the "i" variable within the while loop so that for each file with sequential names I can get the average of 2 arrays, the max-min of this delta, and the max-min of another array.
Example code of what I am trying to do but the avg(i) and calling out temp(i) in loop does not seem proper. Thank you very much for any help and I will continue to look for solutions but am unsure how to best phrase this to search for them.
temp1 = pd.read_excel("/content/113VW.xlsx")
temp2 = pd.read_excel("/content/113W6.xlsx")
..-> temp9
i=1
while i<=9
avg(i) =np.mean(np.array([temp(i)['CC_H='],temp(i)['CC_V=']]),axis=0)
Delta(i)=(np.max(avg(i)))-(np.min(avg(i)))
deltaT(i)=(np.max(temp(i)['temperature='])-np.min(temp(i)['temperature=']))
i+= 1
EG: The slow method would be repeating code this for each file
avg1 =np.mean(np.array([temp1['CC_H='],temp1['CC_V=']]),axis=0)
Delta1=(np.max(avg1))-(np.min(avg1))
deltaT1=(np.max(temp1['temperature='])-np.min(temp1['temperature=']))
avg2 =np.mean(np.array([temp2['CC_H='],temp2['CC_V=']]),axis=0)
Delta2=(np.max(avg2))-(np.min(avg2))
deltaT2=(np.max(temp2['temperature='])-np.min(temp2['temperature=']))
......

Think of things in terms of lists.
temps = []
for name in ('113VW','113W6',...):
temps.append( pd.read_excel(f"/content/{name}.xlsx") )
avg = []
Delta = []
deltaT = []
for data in temps:
avg.append(np.mean(np.array([data['CC_H='],data['CC_V=']]),axis=0)
Delta.append(np.max(avg[-1]))-(np.min(avg[-1]))
deltaT.append((np.max(data['temperature='])-np.min(data['temperature=']))
You could just do your computations inside the first loop, if you don't need the dataframes after that point.

The way that I would tackle this problem would be to create a list of filenames, and then iterate through them to do the necessary calculations as per the following:
import pandas as pd
# Place the files to read into this list
files_to_read = ["/content/113VW.xlsx", "/content/113W6.xlsx"]
results = []
for i, filename in enumerate(files_to_read):
temp = pd.read_excel(filename)
avg_val =np.mean(np.array([temp(i)['CC_H='],temp['CC_V=']]),axis=0)
Delta=(np.max(avg_val))-(np.min(avg_val))
deltaT=(np.max(temp['temperature='])-np.min(temp['temperature=']))
results.append({"avg":avg_val, "Delta":Delta, "deltaT":deltaT})
# Create a dataframe to show the results
df = pd.DataFrame(results)
print(df)
I have included the enumerate feature to grab the index (or i) should you want to access it for anything, or include it in the results. For example, you could change the the results.append line to something like this:
results.append({"index":i, "Filename":filename, "avg":avg_val, "Delta":Delta, "deltaT":deltaT})

Not sure if I understood the question correctly. But if you want to read the files inside a loop using indexes (i variable), you can create a list to hold the contents of the excel files instead of using 9 different variables.
something like
files = []
files.append(pd.read_excel("/content/113VW.xlsx"))
files.append(pd.read_excel("/content/113W6.xlsx"))
...
then use the index variable to iterate over the list
i=1
while i<=9
avg(i) = np.mean(np.array([files[i]['CC_H='],files[i]['CC_V=']]),axis=0)
...
i+=1
P.S.: I am not a Pandas/NumPy expert, so you may have to adapt the code to your needs

Pandas groupby and file writing problems

I have some pandas groupby functions that write data to file, but for some reason I'm getting redundant data written to file. Here's the code:
This function gets applied to each item in the dataframe
def item_grouper(df):
# Get the frequency of each tag applied to the item
tag_counts = df['tag'].value_counts()
# Get the most frequent tag (or tags, assuming a tie)
max_tags = tag_counts[tag_counts==tag_counts.max()]
# Get the total nummber of annotations for the item
total_anno = len(df)
# Now, process each user who tagged the item
return df.groupby('uid').apply(user_grouper,total_anno,max_tags,tag_counts)
# This function gets applied to each user who tagged an item
def user_grouper(df,total_anno,max_tags,tag_counts):
# subtract user's annoations from total annoations for the item
total_anno = total_anno - len(df)
# calculate weight
weight = np.log10(total_anno)
# check if user has used (one of) the top tag(s), and adjust max_tag_count
if len(np.intersect1d(max_tags.index.values,df['iid']))>0:
max_tag_count = float(max_tags[0]-1)
else:
max_tag_count = float(max_tags[0])
# for each annotation...
for i,row in df.iterrows():
# calculate raw score
raw_score = (tag_counts[row['tag']]-1) / max_tag_count
# write to file
out.write('\t'.join(map(str,[row['uid'],row['iid'],row['tag'],raw_score,weight]))+'\n')
return df
So, one grouping function groups the data by iid (item id), does some processing, and then groups each sub-dataframe by uid (user_id), does some calculation, and writes to an output file. Now, the output file should have exactly one line per row in the original dataframe, but it doesn't! I keep getting the same data written to file multiple times. For instance, if I run:
out = open('data/test','w')
df.head(1000).groupby('iid').apply(item_grouper)
out.close()
The output should have 1000 lines (the code only writes one line per row in the dataframe), but the result output file has 1,997 lines. Looking at the file shows the exact same lines written multiple (2-4) times, seemingly at random (i.e. not all lines are double-written). Any idea what I'm doing wrong here?

See the docs on apply. Pandas will call the function twice on the first group (to determine between a fast/slow code path), so the side effects of the function (IO) will happen twice for the first group.
Your best bet here is probably to iterate over the groups directly, like this:
for group_name, group_df in df.head(1000).groupby('iid'):
item_grouper(group_df)

I agree with chrisb's determination of the problem. As a cleaner way, consider having your user_grouper() function not save any values, but instead return these. With a structure as
def user_grouper(df, ...):
(...)
df['max_tag_count'] = some_calculation
return df
results = df.groupby(...).apply(user_grouper, ...)
for i,row in results.iterrows():
# calculate raw score
raw_score = (tag_counts[row['tag']]-1) / row['max_tag_count']
# write to file
out.write('\t'.join(map(str,[row['uid'],row['iid'],row['tag'],raw_score,weight]))+'\n')

Re-writing a python program into VB, how to sort CSV?

About a year back, I wrote a little program in python that basically automates a part of my job (with quite a bit of assistance from you guys!) However, I ran into a problem. As I kept making the program better and better, I realized that Python did not want to play nice with excel, and (without boring you with the details suffice to say xlutils will not copy formulas) I NEED to have more access to excel for my intentions.
So I am starting back at square one with VB (2010 Express if it helps.) The only programming course I ever took in my life was on it, and it was pretty straight forward so I decided I'd go back to it for this. Unfortunately, I've forgotten much of what I had learned, and we never really got this far down the rabbit hole in the first place. So, long story short I am trying to:
1) Read data from a .csv structured as so:
41,332.568825,22.221759,-0.489714,eow
42,347.142926,-2.488763,-0.19358,eow
46,414.9969,19.932693,1.306851,r
47,450.626074,21.878299,1.841957,r
48,468.909171,21.362568,1.741944,r
49,506.227269,15.441723,1.40972,r
50,566.199838,17.656284,1.719818,r
51,359.069935,-11.773073,2.443772,l
52,396.321911,-8.711589,1.83507,l
53,423.766684,-4.238343,1.85591,l
2) Sort that data alphabetically by column 5
3) Then selecting only the ones with an "l" in column 5, sort THOSE numerically by column 2 (ascending order) AND copy them to a new file called coil.csv
4) Then selecting only the ones that have an "r" in column 5, sort those numerically by column 2 (descending order) and copy them to the SAME file coil.csv (appended after the others obviously)
After all of that hoopla I wish to get out:
51,359.069935,-11.773073,2.443772,l
52,396.321911,-8.711589,1.83507,l
53,423.766684,-4.238343,1.85591,l
50,566.199838,17.656284,1.719818,r
49,506.227269,15.441723,1.40972,r
48,468.909171,21.362568,1.741944,r
47,450.626074,21.878299,1.841957,r
46,414.9969,19.932693,1.306851,r
I realize that this may be a pretty involved question, and I certainly understand if no one wants to deal with all this bs, lol. Anyway, some full on code, snippets, ideas or even relevant links would be GREATLY appreciated. I've been, and still am googling, but it's harder than expected to find good reliable information pertaining to this.
P.S. Here is the piece of python code that did what I am talking about (although it created two seperate files for the lefts and rights which I don't really need) - if it helps you at all.
msgbox(msg="Please locate your survey file in the next window.")
mainfile = fileopenbox(title="Open survey file")
toponame = boolbox(msg="What is the name of the shots I should use for topography? Note: TOPO is used automatically",choices=("Left","Right"))
fieldnames = ["A","B","C","D","E"]
surveyfile = open(mainfile, "r")
left_file = open("left.csv",'wb')
right_file = open("right.csv",'wb')
coil_file = open("coil1.csv","wb")
reader = csv.DictReader(surveyfile, fieldnames=fieldnames, delimiter=",")
left_writer = csv.DictWriter(left_file, fieldnames + ["F"], delimiter=",")
sortedlefts = sorted(reader,key=lambda x:float(x["B"]))
surveyfile.seek(0,0)
right_writer = csv.DictWriter(right_file, fieldnames + ["F"], delimiter=",")
sortedrights = sorted(reader,key=lambda x:float(x["B"]), reverse=True)
coil_writer = csv.DictWriter(coil_file, fieldnames, delimiter=",",extrasaction='ignore')
for row in sortedlefts:
if row["E"] == "l" or row["E"] == "cl+l":
row['F'] = '%s,%s' % (row['B'], row['D'])
left_writer.writerow(row)
coil_writer.writerow(row)
for row in sortedrights:
if row["E"] == "r":
row['F'] = '%s,%s' % (row['B'], row['D'])
right_writer.writerow(row)
coil_writer.writerow(row)

One option you have is to start with a class to hold the fields. This allows you to override the ToString method to facilitate the output. Then, it's a fairly simple matter of reading each line and assigning the values to a list of the class. In your case you'll want the extra step of making 2 lists sorting one descending and combining them:
Class Fields
Property A As Double = 0
Property B As Double = 0
Property C As Double = 0
Property D As Double = 0
Property E As String = ""
Public Overrides Function ToString() As String
Return Join({A.ToString, B.ToString, C.ToString, D.ToString, E}, ",")
End Function
End Class
Function SortedFields(filename As String) As List(Of Fields)
SortedFields = New List(Of Fields)
Dim test As New List(Of Fields)
Dim sr As New IO.StreamReader(filename)
Using sr As New IO.StreamReader(filename)
Do Until sr.EndOfStream
Dim fieldarray() As String = sr.ReadLine.Split(","c)
If fieldarray.Length = 5 AndAlso Not fieldarray(4)(0) = "e"c Then
If fieldarray(4) = "r" Then
test.Add(New Fields With {.A = Double.Parse(fieldarray(0)), .B = Double.Parse(fieldarray(1)), .C = Double.Parse(fieldarray(2)), .D = Double.Parse(fieldarray(3)), .E = fieldarray(4)})
Else
SortedFields.Add(New Fields With {.A = Double.Parse(fieldarray(0)), .B = Double.Parse(fieldarray(1)), .C = Double.Parse(fieldarray(2)), .D = Double.Parse(fieldarray(3)), .E = fieldarray(4)})
End If
End If
Loop
End Using
SortedFields = SortedFields.OrderBy(Function(x) x.B).Concat(test.OrderByDescending(Function(x) x.B)).ToList
End Function
One simple way of writing the data to a csv file is to use the IO.File.WriteAllLines methods and the ConvertAll method of the List:
IO.File.WriteAllLines(" coil.csv", SortedFields("textfile1.txt").ConvertAll(New Converter(Of Fields, String)(Function(x As Fields) x.ToString)))
You'll notice how the ToString method facilitates this quite easily.
If the class will only be used for this you do have the option to make all the fields string.

Optimize python file comparison script

I have written a script which works, but I'm guessing isn't the most efficient. What I need to do is the following:
Compare two csv files that contain user information. It's essentially a member list where one file is a more updated version of the other.
The files contain data such as ID, name, status, etc, etc
Write to a third csv file ONLY the records in the new file that either don't exist in the older file, or contain updated information. For each record, there is a unique ID that allows me to determine if a record is new or previously existed.
Here is the code I have written so far:
import csv
fileAin = open('old.csv','rb')
fOld = csv.reader(fileAin)
fileBin = open('new.csv','rb')
fNew = csv.reader(fileBin)
fileCout = open('NewAndUpdated.csv','wb')
fNewUpdate = csv.writer(fileCout)
old = []
new = []
for row in fOld:
old.append(row)
for row in fNew:
new.append(row)
output = []
x = len(new)
i = 0
num = 0
while i < x:
if new[num] not in old:
fNewUpdate.writerow(new[num])
num += 1
i += 1
fileAin.close()
fileBin.close()
fileCout.close()
In terms of functionality, this script works. However I'm trying to run this on files that contain hundreds of thousands of records and it's taking hours to complete. I am guessing the problem lies with reading both files to lists and treating the entire row of data as a single string for comparison.
My question is, for what I am trying to do is this there a faster, more efficient, way to process the two files to create the third file containing only new and updated records? I don't really have a target time, just mostly wanting to understand if there are better ways in Python to process these files.
Thanks in advance for any help.
UPDATE to include sample row of data:
123456789,34,DOE,JOHN,1764756,1234 MAIN ST.,CITY,STATE,305,1,A

How about something like this? One of the biggest inefficiencies of your code is checking whether new[num] is in old every time because old is a list so you have to iterate through the entire list. Using a dictionary is much much faster.
import csv
fileAin = open('old.csv','rb')
fOld = csv.reader(fileAin)
fileBin = open('new.csv','rb')
fNew = csv.reader(fileBin)
fileCout = open('NewAndUpdated.csv','wb')
fNewUpdate = csv.writer(fileCout)
old = {row[0]:row[1:] for row in fOld}
new = {row[0]:row[1:] for row in fNew}
fileAin.close()
fileBin.close()
output = {}
for row_id in new:
if row_id not in old or not old[row_id] == new[row_id]:
output[row_id] = new[row_id]
for row_id in output:
fNewUpdate.writerow([row_id] + output[row_id])
fileCout.close()

difflib is quite efficient: http://docs.python.org/library/difflib.html

Sort the data by your unique field(s), and then use a comparison process analogous to the merge step of merge sort:
http://en.wikipedia.org/wiki/Merge_sort

How to Compare 2 very large matrices using Python

I have an interesting problem.
I have a very large (larger than 300MB, more than 10,000,000 lines/rows in the file) CSV file with time series data points inside. Every month I get a new CSV file that is almost the same as the previous file, except for a few new lines have been added and/or removed and perhaps a couple of lines have been modified.
I want to use Python to compare the 2 files and identify which lines have been added, removed and modified.
The issue is that the file is very large, so I need a solution that can handle the large file size and execute efficiently within a reasonable time, the faster the better.
Example of what a file and its new file might look like:
Old file
A,2008-01-01,23
A,2008-02-01,45
B,2008-01-01,56
B,2008-02-01,60
C,2008-01-01,3
C,2008-02-01,7
C,2008-03-01,9
etc...
New file
A,2008-01-01,23
A,2008-02-01,45
A,2008-03-01,67 (added)
B,2008-01-01,56
B,2008-03-01,33 (removed and added)
C,2008-01-01,3
C,2008-02-01,7
C,2008-03-01,22 (modified)
etc...
Basically the 2 files can be seen as matrices that need to be compared, and I have begun thinking of using PyTable. Any ideas on how to solve this problem would be greatly appreciated.

Like this.
Step 1. Sort.
Step 2. Read each file, doing line-by-line comparison. Write differences to another file.
You can easily write this yourself. Or you can use difflib. http://docs.python.org/library/difflib.html
Note that the general solution is quite slow as it searches for matching lines near a difference. Writing your own solution can run faster because you know things about how the files are supposed to match. You can optimize that "resynch-after-a-diff" algorithm.
And 10,000,000 lines hardly matters. It's not that big. Two 300Mb files easily fit into memory.

This is a little bit of a naive implementation but will deal with unsorted data:
import csv
file1_dict = {}
file2_dict = {}
with open('file1.csv') as handle:
for row in csv.reader(handle):
file1_dict[tuple(row[:2])] = row[2:]
with open('file2.csv') as handle:
for row in csv.reader(handle):
file2_dict[tuple(row[:2])] = row[2:]
with open('outfile.csv', 'w') as handle:
writer = csv.writer(handle)
for key, val in file1_dict.iteritems():
if key in file2_dict:
#deal with keys that are in both
if file2_dict[key] == val:
writer.writerow(key+val+('Same',))
else:
writer.writerow(key+file2_dict[key]+('Modified',))
file2_dict.pop(key)
else:
writer.writerow(key+val+('Removed',))
#deal with added keys!
for key, val in file2_dict.iteritems():
writer.writerow(key+val+('Added',))
You probably won't be able to "drop in" this solution but it should get you ~95% of the way there. #S.Lott is right, 2 300mb files will easily fit in memory ... if your files get into the 1-2gb range then this may have to be modified with the assumption of sorted data.
Something like this is close ... although you may have to change the comparisons around for the added a modified to make sense:
#assumming both files are sorted by columns 1 and 2
import datetime
from itertools import imap
def str2date(in):
return datetime.date(*map(int,in.split('-')))
def convert_tups(row):
key = (row[0], str2date(row[1]))
val = tuple(row[2:])
return key, val
with open('file1.csv') as handle1:
with open('file2.csv') as handle2:
with open('outfile.csv', 'w') as outhandle:
writer = csv.writer(outhandle)
gen1 = imap(convert_tups, csv.reader(handle1))
gen2 = imap(convert_tups, csv.reader(handle2))
gen2key, gen2val = gen2.next()
for gen1key, gen1val in gen1:
if gen1key == gen2key and gen1val == gen2val:
writer.writerow(gen1key+gen1val+('Same',))
gen2key, gen2val = gen2.next()
elif gen1key == gen2key and gen1val != gen2val:
writer.writerow(gen2key+gen2val+('Modified',))
gen2key, gen2val = gen2.next()
elif gen1key > gen2key:
while gen1key>gen2key:
writer.writerow(gen2key+gen2val+('Added',))
gen2key, gen2val = gen2.next()
else:
writer.writerow(gen1key+gen1val+('Removed',))

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Compare all the CSV files in a folder and print duplicate rows - python

Related

Python: Use the "i" counter in while loop as digit for expressions

Pandas groupby and file writing problems

Re-writing a python program into VB, how to sort CSV?

Optimize python file comparison script

How to Compare 2 very large matrices using Python

Categories

Resources