Find matches in two csv files - python

I have two csv files: the first one has one column called ID and 5 rows, and the second one has 12 columns (col 5 is called ID) with 100 rows. I am trying to find match IDs and write the entire row to a new csv file.
Thank you for your help!
here is my code:
import csv
input_file1 = "/Desktop/New1/file1.csv"
input_file2 = "/Desktop/New1/file2.csv"
output_file = "/Desktop/New1/results.csv"
with open(input_file1) as t1, open(input_file2) as t2:
fileone = csv.reader(t1)
filetwo = csv.reader(t2)
with open(output_file, 'w') as output_res:
for line in filetwo:
if line in fileone:
output_res.write(line)

You can read the IDs in file1 into a set for more efficient lookup. You should also use csv.writer to output the rows as CSV:
with open(input_file1) as t1, open(input_file2) as t2:
ids = set(id for id, in csv.reader(t1))
filetwo = csv.reader(t2)
with open(output_file, 'w') as output_res:
writer = csv.writer(output_res)
for row in filetwo:
if row[4] in ids:
writer.writerow(row)

Related

CSV diff with headers

I am taking two csvs and writing the difference to a third csv file. Here's my function.
def diff_csv(NewFile, OldFile):
with open(OldFile, 'r') as t1, open(NewFile, 'r') as t2:
fileone = t1.readlines()
filetwo = t2.readlines()
with open('result.csv', 'w') as outFile:
for line in filetwo:
if line not in fileone:
outFile.write(line)
However, since the two input csvs have the same set of hearders, the headers doesn't show up in the output csv. How do I preserve the headers here?
You could use csv.DictReader and csv.DictWriter. Then unconditionally write the header row, but for the other rows first test whether they exist in old_file. For this test, in the example below, I am making a set of rows from old_file to test inclusion, and representing the rows using a tuple of cell values because this is a hashable object that can be stored in a set.
import csv
import pdb
def diff_csv(new_file, old_file, out_file='result.csv'):
with open(old_file) as t1, open(new_file) as t2:
c2 = csv.DictReader(t2)
c1 = csv.DictReader(t1)
assert c2.fieldnames == c1.fieldnames
row_tuples = set()
for row in c1:
row_tuples.add(tuple(row.values()))
with open(out_file, "w") as fout:
cout = csv.DictWriter(fout, c2.fieldnames)
cout.writeheader()
for row in c2:
row_tuple = tuple(row.values())
if row_tuple not in row_tuples:
cout.writerow(row)
if __name__ == '__main__':
diff_csv("b.csv", "a.csv")
with these input files:
a.csv
blah,stuff
a,4
b,6
and b.csv
blah,stuff
x,4
b,6
d,3
gives result.csv
blah,stuff
x,4
d,3

How to count number of columns in each row?

Each rows have different number of columns but Column A is always file name and rest of columns are fields of that file.
Is there any way I could count number of columns for each row?
import csv
file=('C:/)
with open('C:/Count.csv','w',encoding='cp949',newline='') as testfile:
csv_writer=csv.writer(testfile)
for line in file:
lst=[len(line)]
csv_writer.writerow(lst)
You can either choose to split on commas or open the file with csv.
I'd recommend the latter. Here's how you can do that:
file1 = ... # file to read
file2 = ... # file to write
with open(file1, 'r') as f1, open(file2, 'w', encoding='cp949', newline='') as f2:
csv_reader = csv.reader(f1)
csv_writer = csv.writer(f2)
for row in csv_reader:
csv_writer.writerow([len([x for x in row if x])]) # non-null counts only
Open both files simultaneously, iterate over the file to read, count its columns using len(row) and then write it out.

How to search for a 'text' or 'number' in a csv file with Python AND if exists print only first and second column values to a new csv file

I want to do the following using Python.
Step-1: Read a specific third column on a csv file using Python.
Step-2: Create a list with values got from step-1
Step-3: Take the value of index[0], search in csv file, if present print the values of column 1 and 2 only to a new csv file(There are 6 columns). If Not presents just ignore and goto next search.
file1.csv:
Country,Location,number,letter,name,pup-name,null
a,ab,1,qw,abcd,test1,3
b,cd,1,df,efgh,test2,4
c,ef,2,er,fgh,test3,5
d,gh,3,sd,sds,test4,
e,ij,5,we,sdrt,test5,
f,kl,6,sc,asdf,test6,
g,mn,7,df,xcxc,test7,
h,op,8,gb,eretet,test8,
i,qr,8,df,hjjh,test9,
Python script written for this:
import csv
import time
from collections import defaultdict
columns = defaultdict(list)
with open('file1.csv') as f:
reader = csv.reader(f)
reader.next()
for row in reader:
for (i,v) in enumerate(row):
columns[i].append(v)
#print(columns[2])
b=(columns[2])
for x in b[:]:
time.sleep(1)
print x
Output of above script:
MacBook-Pro:test_usr$ python csv_file.py
1
1
2
3
5
6
7
8
8
MacBook-Pro:test_usr$
I am able to do the steps 1 and 2.
Please guide me on doing Step-3. That is how to search for text/string in csv file and if present how to extract only specific column values to a new csv file?
Output file should look like:
a,ab
b,cd
c,ef
d,gh
e,ij
f,kl
g,mn
h,op
i,qr
Note : Search string will be from another csv file. Please don't suggest the direct answer for printing values of column 1 and 2 directly.
FINAL CODE is looks this:
import csv
import time
from collections import defaultdict
columns = defaultdict(list)
with open('file1.csv') as f:
reader = csv.reader(f)
reader.next()
for row in reader:
for (i,v) in enumerate(row):
columns[i].append(v)
b=(columns[2])
for x in b[:]:
with open('file2.csv') as f, open('file3.csv', 'a') as g:
reader = csv.reader(f)
#next(reader, None) # discard the header
writer = csv.writer(g)
for row in reader:
if row[2] == x:
writer.writerow(row[:2])
file1.csv:
Country,Location,number,letter,name,pup-name,null
a,ab,1,qw,abcd,test1,3
b,cd,1,df,efgh,test2,4
c,ef,2,er,fgh,test3,5
d,gh,3,sd,sds,test4,
e,ij,5,we,sdrt,test5,
f,kl,6,sc,asdf,test6,
g,mn,7,df,xcxc,test7,
h,op,8,gb,eretet,test8,
i,qr,8,df,hjjh,test9,
file2.csv:
count,name,number,Type,status,Config Version,,IP1,port
1,bob,1,TRAFFIC,end,1.2,,1.1.1.1,1
2,john,1,TRAFFIC,end,2.1,,1.1.1.2,2
4,foo,2,TRAFFIC,end,1.1,,1.1.1.3,3
5.333333333,test,3,TRAFFIC,end,3.1,,1.1.1.4,4
6.833333333,raa,5,TRAFFIC,end,5.1,,1.1.1.5,5
8.333333333,kaa,6,TRAFFIC,end,7.1,,1.1.1.6,6
9.833333333,thaa,7,TRAFFIC,end,9.1,,1.1.1.7,7
11.33333333,paa,8,TRAFFIC,end,11.1,,1.1.1.8,8
12.83333333,maa,8,TRAFFIC,end,13.1,,1.1.1.9,9
If I run the above script, output of file3.csv:
1,bob
2,john
1,bob
2,john
1,bob
2,john
1,bob
2,john
1,bob
2,john
1,bob
2,john
1,bob
2,john
1,bob
2,john
1,bob
2,john
1,bob
2,john
1,bob
2,john
1,bob
2,john
.
.
.
Its goes like this in loop
But output should be like this:
count,name
1,bob,
2,john,
4,foo,
5.333333333,test,
6.833333333,raa,
8.333333333,kaa,
9.833333333,thaa,
11.33333333,paa,
12.83333333,maa,
I think you should reconsider your approach. You can achieve your goal simply by iterating over the CSV file, without creating intermediate dicts and lists..., and since you want to work with specific columns, you'll make your life easier and your code more readable by using DictReader and DictWriter
import csv
import time
search_string = "whatever"
with open('file1.csv', 'rb') as f, open('file2.csv', 'wb') as g:
reader = csv.DictReader(f)
c1, c2, c3, *_ = reader.fieldnames
writer = csv.DictWriter(g, fieldnames=(c1, c2))
for row in reader:
if row[c3] == search_string:
writer.writerow({c1:row[c1], c2:row[c2]})
Keep in mind that csv module will always return strings. You have to handle data-type conversions yourself, if you need them (I've left that out form above).
If you don't want to use DictReader/DictWriter, I suppose it is a little more verbose, and don't want a header in your output file:
with open('file1.csv') as f, open('file2.csv', 'w') as g:
reader = csv.reader(f)
next(reader, None) # discard the header
writer = csv.writer(g)
for row in reader:
if row[2] == search_string:
writer.writerow(row[:2])
That is how to search for text/string in csv file and if present how
to extract only specific column values to a new csv file?
This is two questions.
First question: to search for text in a file, the simplest answer would be to read the file text into memory and look for the text. If you want to look for the text in a specific column of the csv you're reading in, you can use a DictReader to make life easy:
for row in reader:
if search_target in row[header]:
# found it!
Second question:
One way to write specific columns to a new csv would be as follows:
keys = ["Country", "Location"]
new_rows = [{key: row[key] for key in keys} for row in reader]
writer = csv.DictWriter(somefile, keys)
writer.writerows(new_rows)
This may help to understand better. Reading two csv files and matching the row indexs values are same or not, If same, writing to another csv.
import numpy as np
import csv
import time
import os
output_dir = "D:\Laneending\data-ars540"
file1 = "3rd_test_rec_road_width_changing_scenarios_250_inference.csv"
file2 = "df_5_signals_1597515776730734.csv"
ars540 = os.path.join(output_dir, file1)
veh_dyn = os.path.join(output_dir, file2)
file3 = "df_5_signals_1597515776730734_processed.csv"
output_file = os.path.join(output_dir, file3)
with open(ars540, 'r') as f1, open(veh_dyn, 'r') as f2, \
open(output_file, 'w+', newline='') as f3:
f1_reader = csv.reader(f1)
f2_reader = csv.reader(f2)
header_f1 = []
header_f1 = next(f1_reader) # reading the next line after header of csv file.
header_f2 = []
header_f2 = next(f2_reader) # reading the next line after header of csv file.
count = 0
writer = csv.writer(f3) #preparing the file f3 for writing the file.
writer.writerow(["Timestamp", "no of detections", "velocity", "yawrate" , "afdr"])
for row_f1 in f1_reader: # looking each row from csv file f1
for row_f2 in f2_reader: # looking for each row from csv file f2
if row_f1[1] == row_f2[0]: #checking the condition; worse case Time complexity o(n2)
# print(row_f2)
print(count)
writer = csv.writer(f3)
writer.writerows([row_f2])
count +=1
break

Replace column in csv with modified column

I got a csv file with a couple of columns and a header containing 4 rows. The first column contains the timestamp. Unfortunately it also gives milliseconds, but whenever those are at 00, they are not given in the file. It looks like that:
"TOA5","CR1000","CR1000","E9048"
"TIMESTAMP","RECORD","BattV_Avg","PTemp_C_Avg"
"TS","RN","Volts","Deg C"
"","","Avg","Avg"
"2015-08-28 12:40:23.51",1,12.91,32.13
"2015-08-28 12:50:43.23",2,12.9,32.34
"2015-08-28 13:12:22",3,12.91,32.54
As I don't need the milliseconds, I want to get rid of those, as this makes further calculations containing time a bit complicated. My approach so far:
Extract first 20 digits in each row to get a format such as 2015-08-28 12:40:23
timestamp = []
with open(filepath) as f:
for _ in xrange(4): #skip 4 header rows
next(f)
for line in f:
time = line[1:20] #Get values for the current line
timestamp.append(time) #Add values to list
From here on I'm struggling on how to procede further. I want to exchange the first column in the csv file with the newly created timestamp list.
I tried creating a dictionary, but I don't know how to use the header caption in row 2 as the key:
d = {}
with open(filepath, 'rb') as csv_file:
csv_reader = csv.reader(csv_file, delimiter=',')
for col in csv_reader:
#use header info from row 2 as key here
This would import the whole csv file into a dict and I'd then change the TIMESTAMP entry in the dict with the timestamp list above. Is this even possible?
Or is there an easier approach on how to just change the first column in the csv with my new list so that my csv file in the end contains the timestamp just without the millisecond information?
So the first column in my csv should look like this:
"TOA5"
"TIMESTAMP"
"TS"
""
2015-08-28 12:40:23
2015-08-28 12:50:43
2015-08-28 13:12:22
This should do it and preserve the quoting:
with open(filepath1, 'rb') as fin, open(filepath2, 'wb') as fout:
reader = csv.reader(fin)
writer = csv.writer(fout, quoting=csv.QUOTE_NONNUMERIC)
for _ in xrange(4): # copy first 4 header rows
writer.writerow(next(reader))
for row in reader: # process data lines
row[0] = row[0][:19] # strip fractional seconds from first column
writer.writerow([row[0], int(row[1])] + map(float, row[2:]))
Since a csv.reader returns the columns of each row as a list of strings, it's necessary to convert any which contain numeric values into their actual int or float numeric value before they're written out to prevent them from being quoted.
I believe you can easily create a new csv from iterating over the original csv and replacing the timestamp as you want.
Example -
with open(filepath, 'rb') as csv_file, open('<new file>','wb') as outfile:
csv_reader = csv.reader(csv_file, delimiter=',')
csv_writer = csv.writer(outfile, delimiter=',')
for i, row in enumerate(csv_reader): #Enumerating as we only need to change rows after 3rd index.
if i <= 3:
csv_writer.writerow(row)
else:
csv_writer.writerow([row[0][1:20]] + row[1:])
I'm not entirely sure about how to parse your csv but I would do something of the sort:
time = time.split(".")[0]
so if it does have a millisecond it would get removed and if it doesn't nothing will happen.

reading and parsing a TSV file, then manipulating it for saving as CSV (*efficiently*)

My source data is in a TSV file, 6 columns and greater than 2 million rows.
Here's what I'm trying to accomplish:
I need to read the data in 3 of the columns (3, 4, 5) in this source file
The fifth column is an integer. I need to use this integer value to duplicate a row entry with using the data in the third and fourth columns (by the number of integer times).
I want to write the output of #2 to an output file in CSV format.
Below is what I came up with.
My question: is this an efficient way to do it? It seems like it might be intensive when attempted on 2 million rows.
First, I made a sample tab separate file to work with, and called it 'sample.txt'. It's basic and only has four rows:
Row1_Column1 Row1-Column2 Row1-Column3 Row1-Column4 2 Row1-Column6
Row2_Column1 Row2-Column2 Row2-Column3 Row2-Column4 3 Row2-Column6
Row3_Column1 Row3-Column2 Row3-Column3 Row3-Column4 1 Row3-Column6
Row4_Column1 Row4-Column2 Row4-Column3 Row4-Column4 2 Row4-Column6
then I have this code:
import csv
with open('sample.txt','r') as tsv:
AoA = [line.strip().split('\t') for line in tsv]
for a in AoA:
count = int(a[4])
while count > 0:
with open('sample_new.csv', 'a', newline='') as csvfile:
csvwriter = csv.writer(csvfile, delimiter=',')
csvwriter.writerow([a[2], a[3]])
count = count - 1
You should use the csv module to read the tab-separated value file. Do not read it into memory in one go. Each row you read has all the information you need to write rows to the output CSV file, after all. Keep the output file open throughout.
import csv
with open('sample.txt', newline='') as tsvin, open('new.csv', 'w', newline='') as csvout:
tsvin = csv.reader(tsvin, delimiter='\t')
csvout = csv.writer(csvout)
for row in tsvin:
count = int(row[4])
if count > 0:
csvout.writerows([row[2:4] for _ in range(count)])
or, using the itertools module to do the repeating with itertools.repeat():
from itertools import repeat
import csv
with open('sample.txt', newline='') as tsvin, open('new.csv', 'w', newline='') as csvout:
tsvin = csv.reader(tsvin, delimiter='\t')
csvout = csv.writer(csvout)
for row in tsvin:
count = int(row[4])
if count > 0:
csvout.writerows(repeat(row[2:4], count))

Categories

Resources