Python : Find Duplicate Items

Python : Find Duplicate Items - python

I have data in columns of csv .I have an array from two columns of it.Iam using a List of list . I have string list like this
[[A,Bcdef],[Z,Wexy]
I want to identify duplicate entries i.e [A,Bcdef] and [A,Bcdef]
import csv
import StringIO
import os, sys
import hashlib
from collections import Counter
from collections import defaultdict
from itertools import takewhile, count
columns = defaultdict(list)
with open('person.csv','rU') as f:
reader = csv.DictReader(f) # read rows into a dictionary format
listoflists = [];
for row in reader: # read a row as {column1: value1, column2: value2,...}
a_list = [];
for (c,n) in row.items():
if c =="firstName":
try:
a_list.append(n[0])
except IndexError:
pass
for (c,n) in row.items():
if c=="lastName":
try:
a_list.append(n);
except IndexError:
pass
#print list(a_list);
listoflists.append(a_list);
#i += 1
print len(listoflists);
I have tried a couple of solutions proposed here
Using set (listoflist) always returns :unhashable type: 'list'
Functions : returns : 'list' object has no attribute 'values'
For example:
results = list(filter(lambda x: len(x) > 1, dict1.values()))
if len(results) > 0:
print('Duplicates Found:')
print('The following files are identical. the content is identical')
print('___________________')
for result in results:
for subresult in result:
print('\t\t%s' % subresult)
print('___________________')
else:
print('No duplicate files found.')
Any suggestions are welcomed.

Rather than lists, you can use tuples which are hashable.

You could build a set of the string representations of you lists, which are quite hashable.

l = [ ['A', "BCE"], ["B", "CEF"], ['A', 'BCE'] ]
res = []
dups = []
s = sorted(l, key=lambda x: x[0]+x[1])
previous = None
while s:
i = s.pop()
if i == previous:
dups.append(i)
else:
res.append(i)
previous = i
print res
print dups

Assuming you just want to get rid of duplicates and don't care about the order, you could turn your lists into strings, throw them into a set, and then turn them back into a list of lists.
foostrings = [x[0] + x[1] for x in listoflists]
listoflists = [[x[0], x[1:]] for x in set(foostrings)]
Another option, if you're going to be dealing with a bunch of tabular data, is to use pandas.
import pandas as pd
df = pd.DataFrame(listoflists)
deduped_df = df.drop_duplicates()

Related

get the file names list by another list

I am new to Python and felt some kind of confused...
I have a List like:
List_all = ["aawoobbcc", "aawoobbca", "aabbcskindd","asakindsbbss","wooedakse","sdadakindwsd","xxxxsdsd"]
and also a keyword list:
Key = ["woo","kind"]
and I want to get something like this:
[
["aawoobbcc", "aawoobbca","wooedakse"],
["aabbcskindd","asakindsbbss","sdadakindwsd"]
]
I have tried list_sub = [file for file in List_all if Key in file]
or list_sub = [file for file in List_all if k for k in Key in file]
which were not right.
how could I go through the elements in Key for the substring of elements in List?
Thanks a lot!

One approach (O(n^2)), is the following:
res = [[e for e in List_all if k in e] for k in Key]
print(res)
Output
[['aawoobbcc', 'aawoobbca', 'wooedakse'], ['aabbcskindd', 'asakindsbbss', 'sdadakindwsd']]
A simpler to understand solution (for newbies) is to use nested for loops:
res = []
for k in Key:
filtered = []
for e in List_all:
if k in e:
filtered.append(e)
res.append(filtered)
A more advanced solution, albeit more performant (for really long lists), is to use a regular expression in conjunction with a defaultdict:
import re
from collections import defaultdict
List_all = ["aawoobbcc", "aawoobbca", "aabbcskindd", "asakindsbbss", "wooedakse", "sdadakindwsd", "xxxxsdsd"]
Key = ["woo", "kind"]
extract_key = re.compile(f"{'|'.join(Key)}")
table = defaultdict(list)
for word in List_all:
if match := extract_key.search(word):
table[match.group()].append(word)
res = [table[k] for k in Key if k in table]
print(res)
Output
[['aawoobbcc', 'aawoobbca', 'wooedakse'], ['aabbcskindd', 'asakindsbbss', 'sdadakindwsd']]
Note that this solution consider that each string contains only one key.

how to separate a dictionary item from a nested list

I am stuck in a project where I have to seperate all Dictionary item from a list and create a dataframe from that. Below is the json file link.
Link:- https://drive.google.com/file/d/1H76rjDEZweVGzPcziT5Z6zXqzOSmVZQd/view?usp=sharing
I had written this code which coverting the all list item into string. hence I am able to seperate them into a new list. However the collected item is not getting coverted into a dataframe. Your help will be highly appriciated.
read_cont = []
new_list1 = []
new_list2 = []
for i in rjson:
for j in rjson[i]:
read_cont.append(rjson[i][j])
data_filter = read_cont[1]
for item in data_filter:
for j in item:
new_list1.append(item[j])
new_list1 = map(str,new_list1)
for i in new_list1:
if len(i) > 100:
new_list2.append(i)
header_names = ["STRIKE PRICE","EXPIRY","underlying", "identifier","OPENINTEREST","changeinOpenInterest","pchangeinOpenInterest", "totalTradedVolume","impliedVolatility","lastPrice","change","pChange", "totalBuyQuantity","totalSellQuantity","bidQty","bidprice","askQty","askPrice","underlyingValue"]
df = pd.DataFrame(new_list2,columns=header_names)`
It should be looking something like this.........
Columns: [STRIKE PRICE, EXPIRY, underlying, identifier, OPENINTEREST, changeinOpenInterest, pchangeinOpenInterest, totalTradedVolume, impliedVolatility, lastPrice, change, pChange, totalBuyQuantity, totalSellQuantity, bidQty, bidprice, askQty, askPrice, underlyingValue]
Index: []

import json
import pandas as pd
h = json.load(open('scrap.json'))
mdf = pd.DataFrame()
for i in h['records']['data']:
for k in i:
if isinstance(i[k], dict):
df = pd.DataFrame(i[k], index=[0])
mdf = pd.concat([mdf, df])
continue
print(mdf)

While loop within for loop for list of lists

I'm trying to create a big list that will contain lists of strings. I iterate over the input list of strings and create a temporary list.
Input:
['Mike','Angela','Bill','\n','Robert','Pam','\n',...]
My desired output:
[['Mike','Angela','Bill'],['Robert','Pam']...]
What i get:
[['Mike','Angela','Bill'],['Angela','Bill'],['Bill']...]
Code:
for i in range(0,len(temp)):
temporary = []
while(temp[i] != '\n' and i<len(temp)-1):
temporary.append(temp[i])
i+=1
bigList.append(temporary)

Use itertools.groupby
from itertools import groupby
names = ['Mike','Angela','Bill','\n','Robert','Pam']
[list(g) for k,g in groupby(names, lambda x:x=='\n') if not k]
#[['Mike', 'Angela', 'Bill'], ['Robert', 'Pam']]

Fixing your code, I'd recommend iterating over each element directly, appending to a nested list -
r = [[]]
for i in temp:
if i.strip():
r[-1].append(i)
else:
r.append([])
Note that if temp ends with a newline, r will have a trailing empty [] list. You can get rid of that though:
if not r[-1]:
del r[-1]
Another option would be using itertools.groupby, which the other answerer has already mentioned. Although, your method is more performant.

Your for loop was scanning over the temp array just fine, but the while loop on the inside was advancing that index. And then your while loop would reduce the index. This caused the repitition.
temp = ['mike','angela','bill','\n','robert','pam','\n','liz','anya','\n']
# !make sure to include this '\n' at the end of temp!
bigList = []
temporary = []
for i in range(0,len(temp)):
if(temp[i] != '\n'):
temporary.append(temp[i])
print(temporary)
else:
print(temporary)
bigList.append(temporary)
temporary = []

You could try:
a_list = ['Mike','Angela','Bill','\n','Robert','Pam','\n']
result = []
start = 0
end = 0
for indx, name in enumerate(a_list):
if name == '\n':
end = indx
sublist = a_list[start:end]
if sublist:
result.append(sublist)
start = indx + 1
>>> result
[['Mike', 'Angela', 'Bill'], ['Robert', 'Pam']]

Compare two csv files

I am trying to compare two csv files to look for common values in column 1.
import csv
f_d1 = open('test1.csv')
f_d2 = open('test2.csv')
f1_csv = csv.reader(f_d1)
f2_csv = csv.reader(f_d2)
for x,y in zip(f1_csv,f2_csv):
print(x,y)
I am trying to compare x[0] with y[0]. I am fairly new to python and trying to find the most pythonic way to achieve the results. Here is the csv files.
test1.csv
Hadrosaurus,1.2
Struthiomimus,0.92
Velociraptor,1.0
Triceratops,0.87
Euoplocephalus,1.6
Stegosaurus,1.4
Tyrannosaurus Rex,2.5
test2.csv
Euoplocephalus,1.87
Stegosaurus,1.9
Tyrannosaurus Rex,5.76
Hadrosaurus,1.4
Deinonychus,1.21
Struthiomimus,1.34
Velociraptor,2.72

I believe you're looking for the set intersection:
import csv
f_d1 = open('test1.csv')
f_d2 = open('test2.csv')
f1_csv = csv.reader(f_d1)
f2_csv = csv.reader(f_d2)
x = set([item[0] for item in f1_csv])
y = set([item[0] for item in f2_csv])
print(x & y)

Assuming that the files are not prohibitively large, you can read both of them with a CSV reader, convert the first columns to sets, and calculate the set intersection:
with open('test1.csv') as f:
set1 = set(x[0] for x in csv.reader(f))
with open('test2.csv') as f:
set2 = set(x[0] for x in csv.reader(f))
print(set1 & set2)
#{'Hadrosaurus', 'Euoplocephalus', 'Tyrannosaurus Rex', 'Struthiomimus',
# 'Velociraptor', 'Stegosaurus'}

I added a line to test whether the numerical values in each row are the same. You can modify this to test whether, for instance, the values are within some distance of each other:
import csv
f_d1 = open('test1.csv')
f_d2 = open('test2.csv')
f1_csv = csv.reader(f_d1)
f2_csv = csv.reader(f_d2)
for x,y in zip(f1_csv,f2_csv):
if x[1] == y[1]:
print('they match!')

Take advantage of the defaultdict in Python and you can iterate both the files and maintain the count in a dictionary like this
from collections import defaultdict
d = defaultdict(list)
for row in f1_csv:
d[row[0]].append(row[1])
for row in f2_csv:
d[row[0]].append(row[1])
d = {k: d[k] for k in d if len(d[k]) > 1}
print(d)
Output:
{'Hadrosaurus': ['1.2', '1.4'], 'Struthiomimus': ['0.92', '1.34'], 'Velociraptor': ['1.0', '2.72'],
'Euoplocephalus': ['1.6', '1.87'], 'Stegosaurus': ['1.4', '1.9'], 'Tyrannosaurus Rex': ['2.5', '5.76']}

Delete empty value from (sorted) dictionary

I've got an arduino sending me serial data which is transposed into a dictionary.
However, not all entries have a value due data being sent at random.
Before sending the dictionary data to a CSV file I want to prune the empty values or values that are 0 from the dict.
Incoming data would look like this: (values only)
['','7','','49,'','173','158']
I want that to become
['7','49','173','158].
The script I currently use:
import serial
import time
def delete_Blanks(arrayName):
tempArray = array.copy()
for key, value in sorted(tempArray.items()):
if value == "":
del tempArray[key]
else:
print "Value is not nil"
return tempArray
array = {}
ser = serial.Serial('COM2', 9600, timeout=1)
key = 0
while 1:
length = len(array)
if len(array) in range(0,5):
array.update({key:ser.read(1000)})
key = key + 1
print "key is ", key
print array.values()
length = len(array)
else:
newArray = delete_Blanks(array)
print newArray.items()
break

from itertools import compress
l = ['','7','','49','','173','158']
ret = compress(l, map(lambda x: bool(x), l))
print(list(ret))
will output:
['7', '49', '173', '158']
if you have long arrays of data - it's better to work with iterators to avoid memory leaks. If you work with short lists - list comprehension is just fine

You can use a dictionary comprehension. This will remove all false values from a dictionary d:
d={key,d[key] for key in d if d[key]}

If it's just a plain list you can do something like this
Mylist = filter(None, Mylist)
Before creating the dictionary you can filter the two list, the list containing the keys and the list containing the values. Assuming both list are the same length you can then
mydict = dict(zip(l1, l2))
to create your new list

>>> li = ['','7','','49','','173','158']
>>> [e for e in li if e]
['7', '49', '173', '158']

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python : Find Duplicate Items - python

Rather than lists, you can use tuples which are hashable.

You could build a set of the string representations of you lists, which are quite hashable.

l = [ ['A', "BCE"], ["B", "CEF"], ['A', 'BCE'] ] res = [] dups = [] s = sorted(l, key=lambda x: x[0]+x[1]) previous = None while s: i = s.pop() if i == previous: dups.append(i) else: res.append(i) previous = i print res print dups

Related

get the file names list by another list

how to separate a dictionary item from a nested list

While loop within for loop for list of lists

Compare two csv files

Delete empty value from (sorted) dictionary

Categories

Resources