I read two .csv files like this.
ori = "all.csv"
det = "find.csv"
names = []
namesa = []
with open(det, "r") as cursor:
for row in cursor:
cells = row.split(",")
if len(cells) > 2:
b = cells[1]
c = b.split("-")
names.append(c[0])
with open(ori, "r") as rcursor1: #read the document
for trow in rcursor1: #read each row
row1 = trow.split(",") #split it by your seperator
namesa.append(row1)
Works just fine.
namesa is a nested list where every row from my .csv is a list (see example) while namescontains the values which I want to find in namesa.
If the value from names is in namesa, I want the whole "nested list part". So i.e.
#example
namesa = [[a,b,c,], [a1, b1, c1], [xy, cd, e2], [u1, i1, il], ...]
names = [a, u1,]
return = [[a1, b1, c1], [u1, i1, il], ...]
#or
namesa = [[john,bill,catherina,], [marti, alex, christoph], [ben, sherlock, london], [Bern, paris, Zürich], ...]
names = [sherlock, marti]
results = [[marti, alex, christoph], [ben, sherlock, london]]
Well, that does not work.
Thats what I tried so far:
#did not return any match
d = list([b for b in namesa if b in [a for a in names]])
print d
#did not return any match neither
for a in namesa:
for b in names:
if b in a:
print "match"
#well, that did not work neither
for a in namesa:
for b in names:
if a[5] == b:
print "match"
There are no matches coming back. I opened my two csv files in excel and searched "by hand" for matches which returned me results...
What am I doing wrong here? Working with python.
If you use .csv file I'd suggest you to use csv module.
I'd to this this way (I'm assuming that things you're looking for are in column 'surname'. If they are in different columns you can consider iterating by them, or doing name in row['surname'] or name in row['name'], depends on complication:
import csv
result = []
listFromCSV = []
names = ['alex','sherlock']
csvFile = open('yourFile.csv')
reader = csv.DictReader(csvFile)
fieldnames = reader.fieldnames
for row in reader:
listFromCSV.append(row)
csvFile.close()
for name in names:
for row in listFromCSV:
if name.strip() in row['surname']:
result.append(row)
And if you want to get rid of duplicates append break at the end of last for loop.
namesa = [['john', 'bill', 'catherina'], ['cat', 'dog', 'foo'], ['noodle', 'bob']]
names = ['john','foo']
Try this
for n in names:
for arr in namesa:
if n.strip() in ''.join(arr):
print arr
.strip because the values in your names list seem to have trailing spaces.
namesa = [['john','bill','catherina',], ['marti', 'alex', 'christoph'], ['ben', 'sherlock', 'london']]
names = ['sherlock', 'marti']
for i in namesa:
for j in names:
if j in i:
print i
OUTPUT
['marti', 'alex', 'christoph']
['ben', 'sherlock', 'london']
Related
I am trying to join two CSV files based on one common column.
I am reading the CSV file storing a list of tuples. My code:
def read_csv(path):
file = open(path, "r")
content_list = []
for line in file.readlines():
record = line.split(",")
for i in range(len(record)):
record[i] = record[i].replace("\n","")
content_list.append(tuple(record))
return content_list
a_list = read_csv("a.csv")
b_list = read_csv("b.csv")
This is giving me list with headers of CSV as first tuple in the list
a_list
[('user_id', 'activeFl'),
('80c611f1-532a-4f7d-aa80-f28b472c0dbe', 'True'),
('4d04ab57-1b50-4474-bd12-b2b16ed2cca3', 'True'),
('0f37a42a-a984-4402-97bd-0eac95fa95d1', 'True'),
('dbe15b19-0128-4e3a-a82b-c8154d272c18', 'True'), ......]
b_list
[('id','date','user_id','blockedFl','amount','type'),
('b7819826-6468-4416-9953-e739d8046b81','2021-04-23','18a382ef-bd38-4884-8bf','True,'9.04','6'), ....]
I would like to merge these two lists based on the user_id, but I am stuck at this point. What can I try next?
the O(N^2) solution is:
result = list()
for left in a_list[1:]:
for right in b_list[1:]:
if left[0] == right[0]:
result.append(right + left[1:])
break
O(N) using dictionary:
result =list()
b_dict = {x[0]: x for x in b_list[1:]}
for left in a_list[1:]:
if left[0] in b_dict:
result.append(b_dict.get(left[0]) + left[1:])
This is one approach using csv module and a dict
Ex:
import csv
def read_csv(path):
with open(path) as infile:
reader = csv.reader(infile)
header = next(reader)
content = {i[0]: i for i in reader} # UserID as key
return content
a_list = read_csv("a.csv")
b_list = read_csv("b.csv")
merge_data = {k: v + [a_list.get(k)] for k, v in b_list.items()}
print(merge_data) # OR print(list(merge_data.values()))
I want to loop in python, over each item from a row against other items from the correspondent row from another column.
If item is not present in the row of the second column then should append to the new list that will be converted in another column (this should also eliminate duplicates when appending through if i not in c).
The goal is to compare items from each row of a column against items from the correspondent row in another column and to save the unique values from the first column, in a new column same df.
df columns
This is just an example, I have much many items in each row
I tried using this code but nothing happened and conversion of the list into the column it's not correct from what I have tested
a= df['final_key_concat'].tolist()
b = df['attributes_tokenize'].tolist()
c = []
for i in df.values:
for i in a:
if i in a:
if i not in b:
if i not in c:
c.append(i)
print(c)
df['new'] = pd.Series(c)
Any help is more than needed, thanks in advance
So seeing as you have these two variables one way would be:
a= df['final_key_concat'].tolist()
b = df['attributes_tokenize'].tolist()
Try something like this:
new = {}
for index, items in enumerate(a):
for thing in items:
if thing not in b[index]:
if index in new:
new[index].append(thing)
else:
new[index] = [thing]
Then map the dictionary to the df.
df['new'] = df.index.map(new)
There are better ways to do it but this should work.
This should be what you want:
import pandas as pd
data = {'final_key_concat':[['Camiseta', 'Tecnica', 'hombre', 'barate'],
['deportivas', 'calcetin', 'hombres', 'deportivas', 'shoes']],
'attributes_tokenize':[['The', 'North', 'Face', 'manga'], ['deportivas',
'calcetin', 'shoes', 'North']]} #recreated from your image
df = pd.DataFrame(data)
a= df['final_key_concat'].tolist() #this generates a list of lists
b = df['attributes_tokenize'].tolist()#this also generates a list of lists
#Both list a and b need to be flattened so as to access their elements the way you want it
c = [itm for sblst in a for itm in sblst] #flatten list a using list comprehension
d = [itm for sblst in b for itm in sblst] #flatten list b using list comprehension
final_list = [itm for itm in c if itm not in d]#Sort elements common to both list c and d
print (final_list)
Result
['Camiseta', 'Tecnica', 'hombre', 'barate', 'hombres']
def parse_str_into_list(s):
if s.startswith('[') and s.endswith(']'):
return ' '.join(s.strip('[]').strip("'").split("', '"))
return s
def filter_restrict_words(row):
targets = parse_str_into_list(row[0]).split(' ', -1)
restricts = parse_str_into_list(row[1]).split(' ', -1)
print(restricts)
# start for loop each words
# use set type to save words or list if we need to keep words in order
words_to_keep = []
for word in targets:
# condition to keep eligible words
if word not in restricts and 3 < len(word) < 45 and word not in words_to_keep:
words_to_keep.append(word)
print(words_to_keep)
return ' '.join(words_to_keep)
df['FINAL_KEYWORDS'] = df[[col_target, col_restrict]].apply(lambda x: filter_restrict_words(x), axis=1)
I am trying to compare two csv files to look for common values in column 1.
import csv
f_d1 = open('test1.csv')
f_d2 = open('test2.csv')
f1_csv = csv.reader(f_d1)
f2_csv = csv.reader(f_d2)
for x,y in zip(f1_csv,f2_csv):
print(x,y)
I am trying to compare x[0] with y[0]. I am fairly new to python and trying to find the most pythonic way to achieve the results. Here is the csv files.
test1.csv
Hadrosaurus,1.2
Struthiomimus,0.92
Velociraptor,1.0
Triceratops,0.87
Euoplocephalus,1.6
Stegosaurus,1.4
Tyrannosaurus Rex,2.5
test2.csv
Euoplocephalus,1.87
Stegosaurus,1.9
Tyrannosaurus Rex,5.76
Hadrosaurus,1.4
Deinonychus,1.21
Struthiomimus,1.34
Velociraptor,2.72
I believe you're looking for the set intersection:
import csv
f_d1 = open('test1.csv')
f_d2 = open('test2.csv')
f1_csv = csv.reader(f_d1)
f2_csv = csv.reader(f_d2)
x = set([item[0] for item in f1_csv])
y = set([item[0] for item in f2_csv])
print(x & y)
Assuming that the files are not prohibitively large, you can read both of them with a CSV reader, convert the first columns to sets, and calculate the set intersection:
with open('test1.csv') as f:
set1 = set(x[0] for x in csv.reader(f))
with open('test2.csv') as f:
set2 = set(x[0] for x in csv.reader(f))
print(set1 & set2)
#{'Hadrosaurus', 'Euoplocephalus', 'Tyrannosaurus Rex', 'Struthiomimus',
# 'Velociraptor', 'Stegosaurus'}
I added a line to test whether the numerical values in each row are the same. You can modify this to test whether, for instance, the values are within some distance of each other:
import csv
f_d1 = open('test1.csv')
f_d2 = open('test2.csv')
f1_csv = csv.reader(f_d1)
f2_csv = csv.reader(f_d2)
for x,y in zip(f1_csv,f2_csv):
if x[1] == y[1]:
print('they match!')
Take advantage of the defaultdict in Python and you can iterate both the files and maintain the count in a dictionary like this
from collections import defaultdict
d = defaultdict(list)
for row in f1_csv:
d[row[0]].append(row[1])
for row in f2_csv:
d[row[0]].append(row[1])
d = {k: d[k] for k in d if len(d[k]) > 1}
print(d)
Output:
{'Hadrosaurus': ['1.2', '1.4'], 'Struthiomimus': ['0.92', '1.34'], 'Velociraptor': ['1.0', '2.72'],
'Euoplocephalus': ['1.6', '1.87'], 'Stegosaurus': ['1.4', '1.9'], 'Tyrannosaurus Rex': ['2.5', '5.76']}
I have two sets that are like below
Set A:
(['African American and Japanese', 'Indian', 'Chinese'])
Set B:
(['African', 'American', 'African American', 'Chinese', 'Russian'])
I want the output to be (['African American', 'Chinese']) but my script gives me either just Chinese or African, American, Chinese (splits African and American, I know that's how my script is, but am not sure how to edit).
I tried this so far.
import csv
alist, blist = [], []
with open("sample.csv", "rb") as fileA:
reader = csv.reader(fileA, delimiter=',')
for row in reader:
for row_str in row:
alist.append(row_str)
#alist = alist.strip().split() #If I use this, it also prints African, but doesn't print African American.
with open("ethnicity.csv", "rb") as fileB:
reader = csv.reader(fileB, delimiter='\n')
for row in reader:
blist += row
blist = [x.lower() for x in blist]
first_set = set(alist)
second_set = set(blist)
print [s for s in first_set if second_set in s]
EDIT:
Elements in SetA are not always separated by "and", it could be anything else or just a space.
You can rearrange list i.e. split the list item when it contains "and" as substring
Then use intersection method of set to get common items from both list.
code:
def convert(input):
output = []
for i in input:
for j in i.split("and"):
output.append(j.strip())
return output
a = ['African American and Japanese', 'Indian', 'Chinese']
b = ['African American', 'Chinese']
a = convert(a)
print a
b = convert(b)
print set(a).intersection(set(b))
Output:
set(['African American', 'Chinese'])
Is this helpful ?
If it could be any string (spaces included) separating the words, you can do something like this:
import re
sep = ' ; '
_a = sep.join(re.split(' [a-z]* ', sep.join(a)))
_b = sep.join(re.split(' [a-z]* ', sep.join(b)))
set(_b.split(sep)).intersection(_a.split(sep))
It won't work when ; is separating two words in your lists... but I think it does handle all cases when you have a non-capatalized word separator.
I have an "asin.txt" document:
in,Huawei1,DE
out,Huawei2,UK
out,Huawei3,none
in,Huawei4,FR
in,Huawei5,none
in,Huawei6,none
out,Huawei7,IT
I'm opening this file and make an OrderedDict:
from collections import OrderedDict
reader = csv.reader(open('asin.txt','r'),delimiter=',')
reader1 = csv.reader(open('asin.txt','r'),delimiter=',')
d = OrderedDict((row[0], row[1].strip()) for row in reader)
d1 = OrderedDict((row[1], row[2].strip()) for row in reader1)
Then I want to create variables (a,b,c,d) so if we take the first line of the asin.txt it should be like: a = in; b = Huawei1; c = Huawei1; d = DE. To do this I'm using a "for" loop:
from itertools import izip
for (a, b), (c, d) in izip(d.items(), d1.items()): # here
try:
.......
It worked before, but now, for some reason, it prints an error:
d = OrderedDict((row[0], row[1].strip()) for row in reader)
IndexError: list index out of range
How do I fix that?
Probably you have a row in your textfile which does not have at least two fields delimited by ",". E.g.:
in,Huawei1
Try to find the solution along these lines:
d = OrderedDict((row[0], row[1].strip()) for row in reader if len(row) >= 2)
or
l = []
for row in reader:
if len(row) >= 2:
l.append(row[0], row[1].strip())
d = OrderedDict(l)