Compare two csv files - python

I am trying to compare two csv files to look for common values in column 1.
import csv
f_d1 = open('test1.csv')
f_d2 = open('test2.csv')
f1_csv = csv.reader(f_d1)
f2_csv = csv.reader(f_d2)
for x,y in zip(f1_csv,f2_csv):
print(x,y)
I am trying to compare x[0] with y[0]. I am fairly new to python and trying to find the most pythonic way to achieve the results. Here is the csv files.
test1.csv
Hadrosaurus,1.2
Struthiomimus,0.92
Velociraptor,1.0
Triceratops,0.87
Euoplocephalus,1.6
Stegosaurus,1.4
Tyrannosaurus Rex,2.5
test2.csv
Euoplocephalus,1.87
Stegosaurus,1.9
Tyrannosaurus Rex,5.76
Hadrosaurus,1.4
Deinonychus,1.21
Struthiomimus,1.34
Velociraptor,2.72

I believe you're looking for the set intersection:
import csv
f_d1 = open('test1.csv')
f_d2 = open('test2.csv')
f1_csv = csv.reader(f_d1)
f2_csv = csv.reader(f_d2)
x = set([item[0] for item in f1_csv])
y = set([item[0] for item in f2_csv])
print(x & y)

Assuming that the files are not prohibitively large, you can read both of them with a CSV reader, convert the first columns to sets, and calculate the set intersection:
with open('test1.csv') as f:
set1 = set(x[0] for x in csv.reader(f))
with open('test2.csv') as f:
set2 = set(x[0] for x in csv.reader(f))
print(set1 & set2)
#{'Hadrosaurus', 'Euoplocephalus', 'Tyrannosaurus Rex', 'Struthiomimus',
# 'Velociraptor', 'Stegosaurus'}

I added a line to test whether the numerical values in each row are the same. You can modify this to test whether, for instance, the values are within some distance of each other:
import csv
f_d1 = open('test1.csv')
f_d2 = open('test2.csv')
f1_csv = csv.reader(f_d1)
f2_csv = csv.reader(f_d2)
for x,y in zip(f1_csv,f2_csv):
if x[1] == y[1]:
print('they match!')

Take advantage of the defaultdict in Python and you can iterate both the files and maintain the count in a dictionary like this
from collections import defaultdict
d = defaultdict(list)
for row in f1_csv:
d[row[0]].append(row[1])
for row in f2_csv:
d[row[0]].append(row[1])
d = {k: d[k] for k in d if len(d[k]) > 1}
print(d)
Output:
{'Hadrosaurus': ['1.2', '1.4'], 'Struthiomimus': ['0.92', '1.34'], 'Velociraptor': ['1.0', '2.72'],
'Euoplocephalus': ['1.6', '1.87'], 'Stegosaurus': ['1.4', '1.9'], 'Tyrannosaurus Rex': ['2.5', '5.76']}

Related

Joining two CSV files (inner join) based on a common column in Python without Pandas

I am trying to join two CSV files based on one common column.
I am reading the CSV file storing a list of tuples. My code:
def read_csv(path):
file = open(path, "r")
content_list = []
for line in file.readlines():
record = line.split(",")
for i in range(len(record)):
record[i] = record[i].replace("\n","")
content_list.append(tuple(record))
return content_list
a_list = read_csv("a.csv")
b_list = read_csv("b.csv")
This is giving me list with headers of CSV as first tuple in the list
a_list
[('user_id', 'activeFl'),
('80c611f1-532a-4f7d-aa80-f28b472c0dbe', 'True'),
('4d04ab57-1b50-4474-bd12-b2b16ed2cca3', 'True'),
('0f37a42a-a984-4402-97bd-0eac95fa95d1', 'True'),
('dbe15b19-0128-4e3a-a82b-c8154d272c18', 'True'), ......]
b_list
[('id','date','user_id','blockedFl','amount','type'),
('b7819826-6468-4416-9953-e739d8046b81','2021-04-23','18a382ef-bd38-4884-8bf','True,'9.04','6'), ....]
I would like to merge these two lists based on the user_id, but I am stuck at this point. What can I try next?
the O(N^2) solution is:
result = list()
for left in a_list[1:]:
for right in b_list[1:]:
if left[0] == right[0]:
result.append(right + left[1:])
break
O(N) using dictionary:
result =list()
b_dict = {x[0]: x for x in b_list[1:]}
for left in a_list[1:]:
if left[0] in b_dict:
result.append(b_dict.get(left[0]) + left[1:])
This is one approach using csv module and a dict
Ex:
import csv
def read_csv(path):
with open(path) as infile:
reader = csv.reader(infile)
header = next(reader)
content = {i[0]: i for i in reader} # UserID as key
return content
a_list = read_csv("a.csv")
b_list = read_csv("b.csv")
merge_data = {k: v + [a_list.get(k)] for k, v in b_list.items()}
print(merge_data) # OR print(list(merge_data.values()))

Using x.isdigit() for floats?

I want my code to compute the sum of the values in the numeric column X per value of column Y
reader = csv.reader(f)
csv_l=list(reader)
rows = len(csv_l)-1
columns = len(csv_l[0])
without_header = csv_l[1:]
number_list = [[int(x) if x.isdigit() else x for x in lst] for lst in without_header]
my_dict = {}
for d in number_list:
if d[0] in my_dict.keys():
my_dict[d[0]] += d[3]
else:
my_dict[d[0]] = d[3]
If the value in the input CSV column is an integer, it works perfectly fine but I have found that if the value is a float, isdigit() fails and it returns the result as the floats pieced together as strings instead of an addition.
I used pandas for this and here it works, but I would want it in "pure python".
dataframe = pd.read_csv(filePath)
new_dataframe = dataframe.groupby('Column Y')['Column X'].sum().reset_index(name='Sum of Values')
return(new_dataframe)
Is this close to what you are trying to achieve?
Setting up the data:
import numpy as np
import pandas as pd
df = pd.DataFrame({'X': np.random.choice(list('ABCD'), size=16),
'Y': np.random.random(size=16)})
table = df.values
Grouping by X column without pandas, iteratively filling a dict:
res = {}
for n, v in table:
if n in res.keys():
res[n] += v
else:
res[n] = v
Perhaps something along these lines is what you are looking for?
reader = csv.reader(f)
csv_l=list(reader)
rows = len(csv_l)-1
columns = len(csv_l[0])
without_header = csv_l[1:]
def x_to_num(x):
try:
x = int(x)
except Exception:
pass
return x
number_list = [[x_to_num(x) for x in lst] for lst in without_header]
my_dict = {}
for d in number_list:
if d[0] in my_dict.keys():
my_dict[d[0]] += d[3]
else:
my_dict[d[0]] = d[3]
If x is numeric, then it converts x to an integer, otherwise, it leaves x as is.

Python : Find Duplicate Items

I have data in columns of csv .I have an array from two columns of it.Iam using a List of list . I have string list like this
[[A,Bcdef],[Z,Wexy]
I want to identify duplicate entries i.e [A,Bcdef] and [A,Bcdef]
import csv
import StringIO
import os, sys
import hashlib
from collections import Counter
from collections import defaultdict
from itertools import takewhile, count
columns = defaultdict(list)
with open('person.csv','rU') as f:
reader = csv.DictReader(f) # read rows into a dictionary format
listoflists = [];
for row in reader: # read a row as {column1: value1, column2: value2,...}
a_list = [];
for (c,n) in row.items():
if c =="firstName":
try:
a_list.append(n[0])
except IndexError:
pass
for (c,n) in row.items():
if c=="lastName":
try:
a_list.append(n);
except IndexError:
pass
#print list(a_list);
listoflists.append(a_list);
#i += 1
print len(listoflists);
I have tried a couple of solutions proposed here
Using set (listoflist) always returns :unhashable type: 'list'
Functions : returns : 'list' object has no attribute 'values'
For example:
results = list(filter(lambda x: len(x) > 1, dict1.values()))
if len(results) > 0:
print('Duplicates Found:')
print('The following files are identical. the content is identical')
print('___________________')
for result in results:
for subresult in result:
print('\t\t%s' % subresult)
print('___________________')
else:
print('No duplicate files found.')
Any suggestions are welcomed.
Rather than lists, you can use tuples which are hashable.
You could build a set of the string representations of you lists, which are quite hashable.
l = [ ['A', "BCE"], ["B", "CEF"], ['A', 'BCE'] ]
res = []
dups = []
s = sorted(l, key=lambda x: x[0]+x[1])
previous = None
while s:
i = s.pop()
if i == previous:
dups.append(i)
else:
res.append(i)
previous = i
print res
print dups
Assuming you just want to get rid of duplicates and don't care about the order, you could turn your lists into strings, throw them into a set, and then turn them back into a list of lists.
foostrings = [x[0] + x[1] for x in listoflists]
listoflists = [[x[0], x[1:]] for x in set(foostrings)]
Another option, if you're going to be dealing with a bunch of tabular data, is to use pandas.
import pandas as pd
df = pd.DataFrame(listoflists)
deduped_df = df.drop_duplicates()

Write multiple lists to CSV

I have two lists:
x = [['a','b','c'], ['d','e','f'], ['g','h','i']]
y = [['j','k','l'], ['m','n','o'], ['p','q','r']]
I'd like to write lists x and y to a CSV file such that it reads in columns:
Col 1:
a
b
c
Col 2:
j
k
l
Col 3:
d
e
f
Col 4:
m
n
o
etc. I'm not really sure how to do this.
You can use zip to do the transpose and csv to create your output file, eg:
x = [['a','b','c'], ['d','e','f'], ['g','h','i']]
y = [['j','k','l'], ['m','n','o'], ['p','q','r']]
from itertools import chain
import csv
res = zip(*list(chain.from_iterable(zip(x, y))))
with open(r'yourfile.csv', 'wb') as fout:
csvout = csv.writer(fout)
csvout.writerows(res)
If you have unequal lengths, then you may wish to look at itertools.izip_longest and specify a suitable fillvalue= instead of using the builtin zip

How to convert a file into a dictionary?

I have a file comprising two columns, i.e.,
1 a
2 b
3 c
I wish to read this file to a dictionary such that column 1 is the key and column 2 is the value, i.e.,
d = {1:'a', 2:'b', 3:'c'}
The file is small, so efficiency is not an issue.
d = {}
with open("file.txt") as f:
for line in f:
(key, val) = line.split()
d[int(key)] = val
This will leave the key as a string:
with open('infile.txt') as f:
d = dict(x.rstrip().split(None, 1) for x in f)
You can also use a dict comprehension like:
with open("infile.txt") as f:
d = {int(k): v for line in f for (k, v) in [line.strip().split(None, 1)]}
def get_pair(line):
key, sep, value = line.strip().partition(" ")
return int(key), value
with open("file.txt") as fd:
d = dict(get_pair(line) for line in fd)
By dictionary comprehension
d = { line.split()[0] : line.split()[1] for line in open("file.txt") }
Or By pandas
import pandas as pd
d = pd.read_csv("file.txt", delimiter=" ", header = None).to_dict()[0]
Simple Option
Most methods for storing a dictionary use JSON, Pickle, or line reading. Providing you're not editing the dictionary outside of Python, this simple method should suffice for even complex dictionaries. Although Pickle will be better for larger dictionaries.
x = {1:'a', 2:'b', 3:'c'}
f = 'file.txt'
print(x, file=open(f,'w')) # file.txt >>> {1:'a', 2:'b', 3:'c'}
y = eval(open(f,'r').read())
print(x==y) # >>> True
If you love one liners, try:
d=eval('{'+re.sub('\'[\s]*?\'','\':\'',re.sub(r'([^'+input('SEP: ')+',]+)','\''+r'\1'+'\'',open(input('FILE: ')).read().rstrip('\n').replace('\n',',')))+'}')
Input FILE = Path to file, SEP = Key-Value separator character
Not the most elegant or efficient way of doing it, but quite interesting nonetheless :)
IMHO a bit more pythonic to use generators (probably you need 2.7+ for this):
with open('infile.txt') as fd:
pairs = (line.split(None) for line in fd)
res = {int(pair[0]):pair[1] for pair in pairs if len(pair) == 2 and pair[0].isdigit()}
This will also filter out lines not starting with an integer or not containing exactly two items
I had a requirement to take values from text file and use as key value pair. i have content in text file as key = value, so i have used split method with separator as "=" and
wrote below code
d = {}
file = open("filename.txt")
for x in file:
f = x.split("=")
d.update({f[0].strip(): f[1].strip()})
By using strip method any spaces before or after the "=" separator are removed and you will have the expected data in dictionary format
import re
my_file = open('file.txt','r')
d = {}
for i in my_file:
g = re.search(r'(\d+)\s+(.*)', i) # glob line containing an int and a string
d[int(g.group(1))] = g.group(2)
Here's another option...
events = {}
for line in csv.reader(open(os.path.join(path, 'events.txt'), "rb")):
if line[0][0] == "#":
continue
events[line[0]] = line[1] if len(line) == 2 else line[1:]

Categories

Resources