Comparing content in two csv files - python

So I have two csv files. Book1.csv has more data than similarities.csv so I want to pull out the rows in Book1.csv that do not occur in similarities.csv Here's what I have so far
with open('Book1.csv', 'rb') as csvMasterForDiff:
with open('similarities.csv', 'rb') as csvSlaveForDiff:
masterReaderDiff = csv.reader(csvMasterForDiff)
slaveReaderDiff = csv.reader(csvSlaveForDiff)
testNotInCount = 0
testInCount = 0
for row in masterReaderDiff:
if row not in slaveReaderDiff:
testNotInCount = testNotInCount + 1
else :
testInCount = testInCount + 1
print('Not in file: '+ str(testNotInCount))
print('Exists in file: '+ str(testInCount))
However, the results are
Not in file: 2093
Exists in file: 0
I know this is incorrect because at least the first 16 entries in Book1.csv do not exist in similarities.csv not all of them. What am I doing wrong?

A csv.reader object is an iterator, which means you can only iterate through it once. You should be using lists/sets for containment checking, e.g.:
slave_rows = set(slaveReaderDiff)
for row in masterReaderDiff:
if row not in slave_rows:
testNotInCount += 1
else:
testInCount += 1

After converting it into sets, you can do a lot of set related & helpful operation without writing much of a code.
slave_rows = set(slaveReaderDiff)
master_rows = set(masterReaderDiff)
master_minus_slave_rows = master_rows - slave_rows
common_rows = master_rows & slave_rows
print('Not in file: '+ str(len(master_minus_slave_rows)))
print('Exists in file: '+ str(len(common_rows)))
Here are various set operations that you can do.

Related

How to count the changes done in new csv file compared to the previous

We have two csv files - new.csv and old.csv.
old.csv contains with four rows:
abc done
xyz done
pqr done
rst pending
The new.csv contains four new rows:
abc pending
xyz not_done
pqr pending
rst done
I need to use count two things without using pandas
count1 = number of entries changed from done to pending = 2 (abc, pqr)
count2 = number of entries changed from done to not_done = 1 (xyz)
CASE 1: CSV Files are in the same order
Firstly import the two files into python lists:
oldcsv = []
with open("old.csv") as f:
for line in f:
oldcsv.append(line.strip().split(","))
newcsv = []
with open("new.csv") as f:
for line in f:
newcsv.append(line.strip().split(","))
Now you would simply iterate through both lists simultaneously, using zip(). I am assuming that both CSV files list the entries in the same order.
count1 = 0
count2 = 0
for oldentry, newentry in zip(oldcsv, newcsv):
assert(oldentry[0] == newentry[0]) # Throw error if entry names do not match
if oldentry[1] == "done":
if newentry[1] == "pending":
count1 += 1
elif newentry[1] == "not_done":
count2 += 1
CASE 2: CSV Files are in arbitrary order
Here, given you are going to be needing to look up entries by their names, I would use a dictionary rather than a list to store the old.csv data, mapping the entry names to their values:
# Load old.csv data into a dictionary mapping entry_name: entry_value
old_values = {}
with open("old.csv") as f:
for line in f:
old_entry = line.strip().split(",")
entry_name, old_entry_value = old_entry[0], old_entry[1]
old_values[entry_name] = old_entry_value
count1 = 0
count2 = 0
with open("new.csv") as f:
for line in f:
# For each entry in new_entry, look up the corresponding old entry in old_entries, and compare their values.
new_entry = line.strip().split(",")
entry_name, new_entry_value = new_entry[0], new_entry[1]
old_entry_value = old_values.get(entry_name) # Get the old value for this entry (will be None if there is no old entry)
# Essentially same code as before:
print(f"{entry_name}: old entry status is {old_entry_value} and new entry status is {new_entry_value}")
if old_entry_value == "done":
if new_entry_value == "pending":
print("Incrementing count1")
count1 += 1
elif new_entry_value == "not_done":
print("Incrementing count2")
count2 += 1
print(count1)
print(count2)
This should work, as long as the input data is properly formatted. I am assuming each .csv file has one entry per line, and each line begins with the entry name (e.g. "abc"), then a comma, then the entry value (e.g. "done","not_done").
Here is a pure python straightforward implementation:
import csv
with open("old.csv") as old_fl:
with open("new.csv") as new_fl:
old = csv.reader(old_fl)
new = csv.reader(new_fl)
old_rows = [row for row in old]
new_rows = [row for row in new]
# see if this is really needed
assert len(old_rows) == len(new_rows)
n = len(old_rows)
# assume that left key is identical,
# and in the same order in both files
assert all(old_rows[i][0] == new_rows[i][0] for i in range(n))
# once the data is guaranteed to align,
# just count what you want
done_to_pending = [
f"row[{i}]( {old_rows[i][0]} )"
for i in range(n)
if old_rows[i][1] == "done" and new_rows[i][1] == "pending"
]
done_to_notdone = [
f"row[{i}]( {old_rows[i][0]} )"
for i in range(n)
if old_rows[i][1] == "done" and new_rows[i][1] == "not_done"
]
It uses the python native csv reader so you don't need to parse the csv yourself. Note that there are various assumptions (assert statements) throughout the code - you might need to adjust the code to handle more cases.

Python pandas verification block text with rule

I am trying to do text verification. I have to verify the candidate of block text belongs to content or non content.
The input for this program is csv file.
The candidate column is shown the the sequence number of candidates text block.
So the line number 82-87 is one text block, 111-116 is the other text block, 1552-1553 is the othre one and so on. And i want to do check each candiddate text block and if candidate fullfil one of the rules then will be used as the output.
Rules for verification the candidate of text block are:
The candiate must be contain h1 and The number of TC column must be > 0 and the LTC column must be < 0.
the number of TC in text block must be more than Threshold TC
The number of TC in text block means the sum of TC in a thext block. for example in candidate 0, number of TC is 0+1+5+7+4+0 = 17.
The threshold TC is 30
If candidate fullfils one of those rules it will be used as the output.
And then I just want present the column words from text block as the output and will be save in txt.
So based on the rule, the output will be the candidate number 0 and 5.
My expected output like
UPDATE MY PROGRAM
import pandas as pd
from listTV import get_filepaths_tv
filenames = get_filepaths_tv(r"C:\Users\firlyarmanda\PycharmProjects\EkstraksiBerita\TC_0.1.5")
index = 0
for f in filenames:
file_html=open(str(f),"r")
dataf = pd.read_csv(file_html)
df = dataf.dropna() #menghilangkan kolom NaN
candidate_groups = df.groupby('candidate')
#f = open('textfile.txt', 'w')
for _, group_df in candidate_groups:
if group_df['TC'].sum() > 40 or (group_df['TAG'] == "['h1']").any() and (group_df['LTC'] == 0).all():
a = '\n'.join(group_df['Words'].astype(str)) + '\n'
#f.write('\n'.join(group_df['Words'].astype(str)) + '\n')
#f.close()
index += 1
stored_file = "textverification/" + '{0:03}'.format(index) + ".txt"
filewrite = open(stored_file, "w")
filewrite.write(a)
filewrite.close
But i got the output separately. I want to join all the output and save to text.
It's not clear what your rules are precisely, but you can use groupby's filter method. First, define a function that checks if a group satisfies the conditions:
def rules(group):
return (group['HTML'].str.contains('<h1>').any() and
group['TC'].sum() > 0 and
group['LTC'].sum() <= 0)
Then filter the dataframe:
result = df.groupby('candidate').filter(rules)
Lastly it's not clear how you want to print the text of selected candidates, but you can get the text of each candidate like this:
result.groupby('candidate')['Words'].apply(lambda w: '\n'.join(w))
This will join all the words in the 'Words' column by the newline character '\n'.
Edit: After discussion, here is what worked for the asker (which includes some code provided in the other answer by user3712352).
candidate_groups = df.groupby('candidate')
f = open('textfile.txt', 'w')
for _, group_df in candidate_groups:
if group_df['TC'].sum() > 30 and (group_df['TAG'] == "['h1']").any() and (group_df['LTC'] == 0).all():
f.write('\n'.join(group_df['Words'].astype(str)) + '\n')
f.close()
After loading the csv:
import pandas as pd
df = pd.read_csv(INPUT_FILE)
A good start for this task would be grouping candidates' rows:
candidate_groups = df.groupby('candidate')
Then you can iterate over candidates and test the requirements:
def print_x(x):
print x
for _, group_df in candidate_groups:
if group_df['TC'].sum() > 30: # 30 is the threshold
if group_df[(group_df['TAG'] == "['h1']") & (group_df['LTC'] < 0) & (group_df['TC'] > 0)].shape[0] > 0:
group_df['Words'].apply(print_x) #print word

"\r\n" is ignored at csv file end

import csv
impFileName = []
impFileName.append("file_1.csv")
impFileName.append("file_2.csv")
expFileName = "MasterFile.csv"
l = []
overWrite = False
comma = ","
for f in range(len(impFileName)):
with open(impFileName[f], "r") as impFile:
table = csv.reader(impFile, delimiter = comma)
for row in table:
data_1 = row[0]
data_2 = row[1]
data_3 = row[2]
data_4 = row[3]
data_5 = row[4]
data_6 = row[5]
dic = {"one":data_1, "two":data_2, "three":data_3, "four":data_4, "five":data_5, "six":data_6}
for i in range(len(l)):
if l[i]["one"] == data_1:
print("Data, where one = " + data_1 + " has been updated using the data from " + impFileName[f])
l[i] = dic
overWrite = True
break
if overWrite == False:
l.append(dic)
else:
overWrite = False
print(impFileName[f] + " has been added to the list 'l'")
with open(expFileName, "a") as expFile:
print("Master file now being created...")
for i in range(len(l)):
expFile.write(l[i]["one"] + comma + l[i]["two"] + comma + l[i]["three"] + comma + l[i]["four"] + comma + l[i]["five"] + comma + l[i]["six"] + "\r\n")
print("Process Complete")
This program takes 2 (or more) .csv files and compares the uniqueID (data_1) of each row to all others. If they match, it then assumes that the current row is an updated version so overwrites it. If there is no match then it's a new entry.
I store each row's data in a dictionary, which is then stored in the list "l".
Once all the files have been processed, I output the list "l" to the "MasterFile.csv" in the specified format.
---THE PROBLEM---
The last row of "File_1.csv" and the first row of "File_2.csv" end up on the same line in the output file. I would like it to continue on a new line.
--Visual
...
data_1,data_2,data_3,data_4,data_5,data_6
data_1,data_2,data_3,data_4,data_5,data_6DATA_1,DATA_2,DATA_3,DATA_4,DATA_5,DATA_6
DATA_1,DATA_2,DATA_3,DATA_4,DATA_5,DATA_6
...
NOTE: There are no header rows in any of the .csv files.
I've also tried this using only "\n" at the end of the "expFile.write" - Same result
Just a little suggestion. Comparing two files in your way looks too expensive . Try using pandas in the following way.
import pandas
data1 = pandas.read_csv("file_1.csv")
data2 = pandas.read_csv("file_2.csv")
# Merging Two Dataframes
combinedData = data1.append(data2,ignore_index=True)
# Dropping Duplicates
# give the name of the column on which you are comparing the uniqueness
uniqueData = combinedData.drop_duplicates(["columnName"])
I tried running your program and it is OK. Your only problem is in the line
with open(expFileName, "a") as expFile:
where you use "a" (as append), so if you run your program again and again, it will append to this file.
Use "w" instead of "a".
A'ight guys. I think I made a booboo.
1) Because I was using "a" (append) not "w" (write) at the end; and my last 2 or 3 tests I'd forgotten to clear the file, I was always looking at the same (top 50 or so) rows. Which meant I'd fixed my bug ages ago but was still looking at the old data....
2) Carriage returns were being read into the last value of the dictionary (data_6) so when they were appended to the Master file I ended up with "\r\r\n" at the end.
Thanks Vivek Srinivasan for expanding my python knowledge. I will look at pandas and have a play.
Thanks to MarianD for pointing out the "a"/"w" error.
Thanks to Moses Koledoye for pointing out the "\r" error.
Sorry for wasting your time.

Extracting certain columns from multiple files simultaneously by Python

My purpose is to extract one certain column from the multiple data files.
So, I tried to use glob module to read files and tried to extract one column from each file with for statements like below:
filin = diri + '*_7.txt'
FileList=sorted(glob.glob(filin))
for INPUT in FileList:
a = []
b = []
c = []
T = []
f = open(INPUT,'r')
f.seek(0,0)
for columns in ( raw.strip().split() for raw in f):
b.append(columns[11])
t = np.array(b, float)
print t
t = list(t)
T = T + [t]
f.close()
print T
The number of data files which I used is 32. So, I expected the second 'for' statement ran only 32 times while generating only 32 arrays of t. However, the result doesn't look like what I expected.
I assume that it may be due to the influence from the first 'for' statement, but I am not sure.
Any idea or help would be really appreciated.
Thank you,
Isaac
You clear T = [] for every file. Move T = [] line before first loop.

How do I cycle through a csv in python, writing lines to a new file that meet new criteria

I've been at this a while now, and I think it in my best interest to ask advice of the experts. I know I'm not writing this the best way possible, and I've gone down a rabbit hole and confused myself.
I have a csv. A bunch, actually. That part is not the problem.
The lines at the top of the CSV are not really CSV data, but it does contain an important piece of info, the data for which the data is valid. For certain kinds of a report, it is on one line, and on others another.
My data starts on some line down from the top, usually 10 or 11, but I can't always be certain. I do know that the first column always has the same info (the header of the table of data).
I want to pull the report date from the preceding lines, and for file type A, do stuffA, and for file tpye B, do stuffB, then write out that row to a new file. I'm having a problem incrementing the row and I have no idea what I'm doing wrong.
Sample data:
"Attribute ""OPSURVEYLEVEL2_O"" [Category = ""Retail v1""]"
Date exported: 2/16/13
Exported by user: William
Project:
Classification: Online Retail v1
Report type: Attributes
Date range: from 12/14/12 to 12/14/12
"Filter OpSurvey Level 2(mine): [ LEVEL:SENTENCE TYPE:KEYWORD {OPSURVEYLEVEL2_O:""gift certificate redemption"", OPSURVEYLEVEL2_O:""combine accounts"", OPSURVEYLEVEL2_O:""cancel account"", OPSURVEYLEVEL2_O:""saved project moved to purchased project"", OPSURVEYLEVEL2_O:""unlock account"", OPSURVEYLEVEL2_O:""affiliate promotions"", OPSURVEYLEVEL2_O:""print to store coupons"", OPSURVEYLEVEL2_O:""disclaimer not clear"", OPSURVEYLEVEL2_O:""prepaid issue"", OPSURVEYLEVEL2_O:""customer wants to use coupons for print to store"", OPSURVEYLEVEL2_O:""customer received someone else's order"", OPSURVEYLEVEL2_O:""hi-res images unavailable"", OPSURVEYLEVEL2_O:""how to re-order"", OPSURVEYLEVEL2_O:""missing items"", OPSURVEYLEVEL2_O:""missing envelopes: print to store"", OPSURVEYLEVEL2_O:""missing envelopes: mail order"", OPSURVEYLEVEL2_O:""group rooms"", OPSURVEYLEVEL2_O:""print to store"", OPSURVEYLEVEL2_O:""print to store coupons"", OPSURVEYLEVEL2_O:""publisher: card not available for print to store"", OPSURVEYLEVEL2_O:publisher}]"
Total: 905
OPSURVEYLEVEL2_O,Distinct Document,% of Document,Sentiment Score
PRINT TO STORE,297,32.82,-0.1
...
Sample Code
#!/usr/bin/python
import csv, os, glob, sys, errno
path = '/path/to/Downloads'
for infile in glob.glob(os.path.join(path,'report_ATTRIBUTE_OP*.csv')):
if 'OPSURVEYLEVEL2' in infile:
prime_column = 'ops2'
elif 'OPSURVEYLEVEL3' in infile:
prime_column = 'ops3'
else:
sys.exit(errno.ENOENT)
with open(infile, "r") as csvfile:
reader = csv.reader(csvfile)
report_date = 'DATE NOT FOUND'
# import pdb; pdb.set_trace()
for row in reader:
foo = 0
while foo < 1:
if row[0][0:].find('OPSURVEYLEVEL') == 0:
foo = 1
if "Date range" in row:
report_date = row[0][-8:]
break
if foo >= 1:
if row[0][0:].find('OPSURVEYLEVEL') == 0:
break
if 'ops2' in prime_column:
dup_col = row[0]
row.insert(0,dup_col)
row.append(report_date)
elif 'ops3' in prime_column:
row.append(report_date)
with open('report_merge.csv', 'a') as outfile:
outfile.write(row)
reader.next()
There are two problems that I can see in this code.
The first is that the code won't find the date range in row. The line:
if "Date range" in row:
... should be:
if "Date range" in row[0]:
The second is that the code:
if row[0][0:].find('OPSURVEYLEVEL') == 0:
break
... is breaking out of the for loop after the header line of the data table, because that is the closest enclosing loop. I suspect that there was another while in there somewhere in a previous version of this code.
The code is simpler (and bug-free) with an if statement instead of the while and if, as follows:
for row in reader:
if foo < 1:
if row[0][0:].find('OPSURVEYLEVEL') == 0:
foo = 1
if "Date range" in row[0]: # Changed this line
print("found report date")
report_date = row[0][-8:]
else:
print(row)
if row[0][0:].find('OPSURVEYLEVEL') == 0:
break
if 'ops2' in prime_column:
dup_col = row[0]
row.insert(0,dup_col)
row.append(report_date)
elif 'ops3' in prime_column:
row.append(report_date)
with open('report_merge.csv', 'a') as outfile:
outfile.write(','.join(row)+'\n')

Categories

Resources