Compare two CSV files and look for matches Python - python

I have two CSV files that are like
CSV1
H1,H2,H3
arm,biopsy,forearm
heart,leg biopsy,biopsy
organs.csv
arm
leg
forearm
heart
skin
I need to compare both the files and get an output list like this [arm,forearm,heart,leg] but the script that I'm currently working on doesn't give me any output (I want leg also in the output, though it is mixed with biopsy in the same cell). Here's the code so far. How can I get all the matched words?
import csv
import io
alist, blist = [], []
with open("csv1.csv", "rb") as fileA:
reader = csv.reader(fileA, delimiter=',')
for row in reader:
alist.append(row)
with open("organs.csv", "rb") as fileB:
reader = csv.reader(fileB, delimiter=',')
for row in reader:
blist.append(row)
first_set = set(map(tuple, alist))
secnd_set = set(map(tuple, blist))
matches = set(first_set).intersection(secnd_set)
print matches

Try this:
import csv
alist, blist = [], []
with open("csv1.csv", "rb") as fileA:
reader = csv.reader(fileA, delimiter=',')
for row in reader:
for row_str in row:
alist += row_str.strip().split()
with open("organs.csv", "rb") as fileB:
reader = csv.reader(fileB, delimiter=',')
for row in reader:
blist += row
first_set = set(alist)
second_set = set(blist)
print first_set.intersection(second_set)
Basically, iterating through the csv file via csv reader returns a row which is a list of the items (strings) like this ['arm', 'biopsy', 'forearm'], so you have to sum lists to insert all of the items.
On the other hand, to remove duplications only one set conversion via the set() function is required, and the intersection method returns another set with the elements.

Change the part reading from csv1.csv to:
with open("csv1.csv", "rb") as fileA:
reader = csv.reader(fileA, delimiter=',')
for row in reader:
# append all words in cell
for word in row:
alist.append(word)

I would treat the CSV files as text files, get a lists of all the words in the first and the seconds, then iterate over the first list to see if any exactly match any in the second list.

Related

Python modify column in a CSV to muliple columns

I'm trying to modify one column in a CSV, to change it to multiple columns.
Hence this CSV:
title,body,field_tag,field_titel
--------------------------------
"bladibla", "bla.....bla", "[""tag1"",""tag2"",""tag3"",""tag4""]", "bladiblabla"
"bladibla", "bla.....bla", "[""tag3"",""tag4"",""tag5"",""tag7"",""tag8"",""tag11""]", "bladiblabla"
What I want is this:
title,body,field_titel,field_tag,field_tag,field_tag,field_tag,field_tag,field_tag
--------------------------------
"bladibla","bla.....bla","bladiblabla","tag1,"tag2","tag3","tag4"
"bladibla","bla.....bla","bladiblabla","tag3,"tag4","tag5","tag7","tag8","tag11"
How to achieve this in Python?
What i've tried so far is this, but not given the result i want.
import csv
import numpy
with open('tester.csv','r') as csvinput:
with open('testeroutput.csv', 'w') as csvoutput:
writer = csv.writer(csvoutput, lineterminator='\n')
reader = csv.reader(csvinput)
all = []
rij = next(reader)
for row in reader:
# print row['field_tag']
strlist = row[3]
#remove [ and ]
strlist = (strlist.replace('[', ''))
strlist = (strlist.replace(']', ''))
text = strlist.split(',')
#make string of list
for tag in text:
str1 = ''.join(tag)
print str1
print(type(str1))
row.append('field_tag')
all.append(row)
row.append(str1)
all.append(row)
writer.writerows(all)
Hope that you can point me in a better direction.
Utilize this snippet:
import ast
for row in reader:
row.extend(ast.literal_eval(row.pop(2)))
writer.writerow(row)
row.pop(2) removes the third item from the row and returns it. ast.literal_eval() safely evaluates that third item as long as it contains "Python literal structures: strings, bytes, numbers, tuples, lists, dicts, sets, booleans, and None."

Check for unique elements of csv

I would like to check for duplicates in a .csv (structure bellow). Every value in this .csv has to be unique! You can find "a" thrice, but it should be there only once.
###start
a
a;b;
d;e
f;g
h
i;
i
d;b
a
c;i
### end
The progress so far:
import os,glob
import csv
folder_path = "csv_entities/"
found_rows = set()
for filepath in glob.glob(os.path.join(folder_path, "*.csv")):
with open(filepath) as fin, open("newfile.csv", "w") as fout:
reader = csv.reader(fin, delimiter=";")
writer = csv.writer(fout, delimiter=";")
for row in reader:
# delete empty list elements
if "" in row:
row = row[:-1]
#delete empt row
if not row:
continue
row = tuple(row) # make row hashable
# don't write if row is there already!
if row in found_rows:
continue
print(row)
writer.writerow(row)
found_rows.add(row)
Which results in this csv:
###start
a
a;b
d;e
f;g
h
i
d;b
c;i
###end
The most important question is right now: How can I get rid of the double values?
e.g in the second row there should be only "b" instead of "a;b", because "a" is already in the row before.
your mistake is to consider the rows themselves as unique elements. You have to consider cells as elements.
So use your marker set to mark elements, not rows.
Example with only one input file (using several input files with only one output file makes no sense)
found_values = set()
with open("input.csv") as fin, open("newfile.csv", "w",newline="") as fout:
reader = csv.reader(fin, delimiter=";")
writer = csv.writer(fout, delimiter=";")
for row in reader:
# delete empty list elements & filter out already seen elements
new_row = [x for x in row if x and x not in found_values]
# update marker set with row contents
found_values.update(row)
if new_row:
# new row isn't empty: write it
writer.writerow(new_row)
the resulting csv file is:
a
b
d;e
f;g
h
i
c

Skip Row In CSV If Contains Strings From List

I have a list of approximately 500 strings that I want to check against a CSV file containing 25,000 rows. What I currently have seems to be getting stuck looping. I basically want to skip the row if it contains any of the strings in my string list and then extract other data.
stringList = [] #strings look like "AAA", "AAB", "AAC", etc.
with open('BadStrings.csv', 'r')as csvfile:
filereader = csv.reader(csvfile, delimiter=',')
for row in filereader:
stringToExclude = row[0]
stringList.append(stringToExclude)
with open('OtherData.csv', 'r')as csvfile:
filereader = csv.reader(csvfile, delimiter=',')
next(filereader, None) #Skip header row
for row in filereader:
for s in stringList:
if s not in row:
data1 = row[1]
Edit: Not an infinite loop, but looping is taking too long.
according to Niels I would change the 2 loop and iterate over the row itself and check if the current row entry is inside the "bad" list:
for row in filereader:
for s in row:
if s not in stringlist:
data1 = row[0]
And I also dont know what you want to do with data1 but you always change the object reference when an item is not in stringList.
You could use a list to add the items to a list with data1.append(item)
You could try something like this.
stringList = [] #strings look like "AAA", "AAB", "AAC", etc.
with open('BadStrings.csv', 'r')as csvfile:
filereader = csv.reader(csvfile, delimiter=',')
for row in filereader:
stringToExclude = row[0]
stringList.append(stringToExclude)
data1 = [] # Right now you are overwriting your data1 every time. I don't know what you want to do with it, but you could for exmaple add all row[1] to a list data1
with open('OtherData.csv', 'r')as csvfile:
filereader = csv.reader(csvfile, delimiter=',')
next(filereader, None) #Skip header row
for row in filereader:
found_s = False
for s in stringList:
if s in row:
found_s = True
break
if not found_s:
data1.append(row[1]) # Add row[1] to the list is no element of stringList is found in row.
Still probably not a huge performance improvement, but at least the for loop for s in stringList: will now stop after s is found.

Need help in finding the row of CSV which contains the values in array

I have an array LiveTick = ['ted3m index','US0003m index','USGG3m index'] and I am reading a CSV file book1.csv. I have to find the row which contains the values in csv.
For example, 15th row will contain ted3m index 500 | 600 and 20th row will contain US0003m index 800 | 900 and likewise.
I then have to get the values contained in the row and parse it for each value contained in array LiveTick. How do I proceed? Below is my sample code:
with open('C:\\blp\\book1.csv', 'r') as f:
reader = csv.reader(f, delimiter=',')
writer = csv.writer(outf)
for row in reader:
for list in LiveTick:
if list in row:
print ('Found: {}'.format(row))
You can use pandas, it's pretty fast and will do all reading, writing and filtering job for you out of the box:
import pandas as pd
df = pd.read_csv('C:\\blp\\book1.csv')
filtered_df = df[df['your_column_name'].isin(LiveTick)]
# now you can save it
filtered_df.to_csv('C:\\blp\\book_filtered.csv')
You have the right idea, but there are a few improvements you can make:
Instead of a nested for loop which doesn't short-circuit, use any to compare the first column to multiple values.
Write to your csv as you go along instead of just print. This is memory-efficient, as you hold in memory only one line at any one time.
Define outf as an open object in your with statement.
Do not shadow built-in list. Use another identifier, e.g. i, for elements in LiveTick.
Here's a demo:
with open('in.csv', 'r') as f, open('out.csv', 'wb', newline='') as outf:
reader = csv.reader(f, delimiter=',')
writer = csv.writer(outf, delimiter=',')
for row in reader:
if any(i in row[0] for i in LiveTick):
writer.writerow(row)

CSV split rows into lists

So i would like to split string from list into multiple lists
like rows[1] should be splited into another list contained in list m
i saw this here and it hsould be accesable m[0][0] to get first item form first list .
import csv
reader = csv.reader(open("alerts.csv"), delimiter=',')
)
rows=[]
for row in reader:
rows.append(row)
num_lists=int(len(rows))
lists=[]
m=[]
for x in rows:
m.append(x.split(';')[0])
printing rows:
[['priority;status;time;object_class;host;app;inc;tool;msg'], ['P2;CLOSED;24-09-2016 20:06:41;nm;prod;;390949;HPNNM;call'], ['P2;CLOSED;24-09-2016 20:06:41;nm;prod;;390949;HPNNM;msg'], ['P2;CLOSED;24-09-2016 20:06:41;nm;prod;;390949;HPNNM;msg']]
and output should look like
m[0][0] should return pririty
you can do this pretty easily with pandas
import pandas as pd
A = pd.read_csv('yourfile.csv')
for x in A.values:
for y in x:
print y
so the 'print y' statement access each element in the row. but I mean, after the "for x in A.values" you can do just about anything
Exact solution to your question; you almost got it right (note the delimiter value):
reader = csv.reader(open("alerts.csv"), delimiter=';')
table = [row for row in reader]
print(table[0][0])
>>> priority
For easy data handling, it is often nice to explicitly extract the header like so:
reader = csv.reader(open("alerts.csv"), delimiter=';')
header = reader.next()
table = [row for row in reader]
print(header[0])
print(table[0][0])
>>> priority
>>> P2
Here's how to do it:
import csv
with open('alerts.csv') as f:
reader = csv.reader(f, delimiter=';')
next(reader) # skip over the first header row
rows = [row for row in reader]
>>> print(rows[0][0])
P2
This uses a list comprehension to read all rows from the CSV file into a list. The delimiter should be a semi-colon, not a comma; so use delimiter=';'. Also the first row is a header and is therefore skipped.

Categories

Resources