Get the two instances with most number of items in common - python

I'm trying to analyze a database of trucks and the items they carry to find out which two trucks are the most similar to one another (trucks which share the most number of items). I have a csv similar to this:
truck_id | item_id
13 | 85394 *
16 | 294 *
13 | 294 *
89 | 3115
89 | 85394
13 | 294
16 | 85394 *
13 | 3115
In the above example, 16 and 13 are the most similar trucks, as they both have the 294 and 85394 items.
The entire code is too long so I'll offer pseudo code for what I'm doing:
truck_items = {}
#1
loop over the csv:
add to truck_items a truck_id and an ARRAY with the items each truck has
#2
go over each truck in the truck_items dictionary, and compare their array to all other arrays
to get the count of similar items
#3
create a 'most_similar' key in the dictionary.
#4
check in most_similar what are the two trucks with most similarity.
So I would end up with something like this:
{
13: [16, 2] // truck_1_id: [truck_2_id, number_similar_items]
89: ...
}
I understand this is not the most efficient way as I'm going ever the lists too many times and that shouldn't be done. Is there a more efficient way?

Non-pandas solution, facilitating built-in tools such as collections.defaultdict (optional) and itertools.product (also optional, but will help you push some calculations/loops down to the C level which will be beneficial if the data set is large enough).
I think the logic itself is self-explanatory.
from collections import defaultdict
from itertools import product
trucks = [
(13, 294),
(13, 294),
(13, 3115),
(13, 85394),
(16, 294),
(16, 85394),
(89, 3115),
(89, 85394),
]
d = defaultdict(set)
for truck, load in trucks:
d[truck].add(load)
li = [({'truck': k1, 'items': v1},
{'truck': k2, 'items': v2})
for (k1, v1), (k2, v2) in product(d.items(), repeat=2)
if k1 != k2]
truck_1_data, truck_2_data = max(li, key=lambda e: len(e[0]['items'] & e[1]['items']))
print(truck_1_data['truck'], truck_2_data['truck'])
outputs
13 16
Arguably a more readable version:
...
li = [{k1: v1,
k2: v2}
for (k1, v1), (k2, v2) in product(d.items(), repeat=2)
if k1 != k2]
def dict_values_intersection_len(d):
values = list(d.values())
return len(values[0] & values[1])
truck_1, truck_2 = max(li, key=dict_values_intersection_len)
print(truck_1, truck_2)
which also outputs
13 16

Use groupby to gather all records for a given truck. For each group, make a set of part numbers. Make a new data frame of that data:
truck_id | items
13 | {85394, 294, 3115}
16 | {294, 85394}
89 | {3115, 85394}
Now you need to make a full cross-product of this DF with itself; filter to remove self-reference and duplicates (13-16 and 16-13, for example). If you make the product with
truck_id_left < truck_id_right (I'll leave the implementation syntax to you, dependent on the package you use), you'll get only the unique pairs.
On that series of truck pairs, simply take the set intersection of their items:
trucks | items
(13, 16) | {85394, 294}
(13, 89) | {3115}
(16, 89) | {85394}
Then find the row with the max value on that intersection.
Can you handle each of those steps? They're all contained in PANDAS tutorials.

Here's a solution that seems like it might work:
I'm using pandas as my main data container, just makes stuff like this easier.
import pandas as pd
from collections import Counter
Here I'm creating a similar dataset
#creating toy data
df = pd.DataFrame({'truck_id':[1,1,2,2,2,3,3],'item_id':[1,7,1,7,5,2,2]})
that looks like this
item_id truck_id
0 1 1
1 7 1
2 1 2
3 7 2
4 5 2
5 2 3
6 2 3
I'm reformatting it to have a list of items for each truck
#making it so each row is a truck, and the value is a list of items
df = df.groupby('truck_id')['item_id'].apply(list)
which looks like this:
truck_id
1 [1, 7]
2 [1, 7, 5]
3 [2, 2]
now I'm creating a function that, given a df that looks like the previous one, counts the number of similar items on 2 trucks.
def get_num_similar(df, id0, id1):
#drops duplicates from each truck, so there's only one of each item in each truck
#combining those lists together, so it's a list of items in both trucks
comp = [*list(set(df.loc[id0])), *list(set(df.loc[id1]))]
#getting how many items of each exist (should be 1 or 2)
quants = dict(Counter(comp))
#getting how many similar items are carried
num_similar = len([quant for quant in quants.values() if quant > 1])
return num_similar
running this:
print(get_num_similar(df, 1, 2))
results in an output of 2, which is accurate. Now just iterate over all groups of trucks you want to analyze, and you can calculate which trucks have the most shared things.

Related

Sorting on multiple keys from heterogenous tuple in values of a python dictionary [duplicate]

This question already has answers here:
How do I sort a dictionary by value?
(34 answers)
Closed 1 year ago.
Input:
{'Thiem': (3, 0, 10, 104, 11, 106),
'Medvedev': (1, 2, 11, 106, 10, 104),
'Barty': (0, 2, 8, 74, 9, 76),
'Osaka': (0, 4, 9, 76, 8, 74)}
The expected output should be sorted based on Values of Dict, in the order of attributes in values tuple. Like, firstly on 1st field value(desc), if matching then on 2nd value(desc), till 4th field(desc) and Ascending on 5th & 6th field. I tried using sorted() method in a couple of ways.
output:
Thiem 3 0 10 104 11 106
Medvedev 1 2 11 106 10 104
Osaka 0 4 9 76 8 74
Barty 0 2 8 74 9 76
Kindly assist or suggest an approach.
Edit:
Updated description for more clarity. Below is the code i tried:
>>> results=[]
>>> for (k,v) in d.items():
results.append(v)
>>> results.sort(key= lambda x: (x[4],x[5]))
>>> results.sort(key= lambda x: (x[0],x[1],x[2],x[3]), reverse=True)
I believe you are trying to compare the first number (element) of every tuple with one another, and the key with the greatest number should go on top (0 index). In case the numbers were equal; you would instead compare the second number of every tuple, and so on... till you reach the final element in that tuple. If so, then:
def dic_sorted(dic):
for r in range(len(dic) - 1):
i = 0
values = list(dic.values())
while values[r][i] == values[r + 1][i]:
i += 1
if values[r][i] < values[r + 1][i]:
key_shift(dic, values[r])
return dic
def key_shift(dic, v1):
keys = list(dic.keys())
values = list(dic.values())
temp_key = keys[values.index(v1)]
del dic[temp_key]
dic[temp_key] = v1
for i in range(5): # Number of iterations depends on the complexity of your dictionary
dic_sorted(data)
print(data)

Automating the process of identifying subgroups of a pandas dataframe that do not significantly differ on a value

I have the following dataframe, which, for the sake of this example, is full of random numbers:
import numpy as np
import pandas as pd
from scipy.stats import ttest_ind
df = pd.DataFrame(np.random.randint(0,1000,size=(100, 4)), columns=list('ABCD'))
df['Category'] = np.random.randint(1, 3, df.shape[0])
df.head()
A B C D Category
0 417 88 924 844 2
1 647 136 57 680 2
2 223 883 837 56 2
3 346 94 19 80 1
4 635 863 405 29 1
I need to find a subset of n rows (say, 80 rows), which do not significantly differ (p> .05) on the value "C" between two category groups (thus, between categories 1 and 2).
I perform the following t-test to test if the difference is significant:
# t-test
cat1 = df[df['Category']==1]
cat2 = df[df['Category']==2]
ttest_ind(cat1['C'], cat2['D'])
Output:
Ttest_indResult(statistic=-2.004339328381308, pvalue=0.047793084338372295)
Currently, I am doing this manually, using trial and error. I do this by manually picking the subsets, testing them, and then retesting until I find the desired result. I am curious to hear if there is a way to automate this process.
Here is my suggestion, by using combinations from itertools as rightfully suggested by #rpanai with groupby and pipethat enables you to get different groups within the same operation. You return a Boolean for the pvalue being above or below threshold 0.05 and you break the loop when Boolean is True:
np.random.seed(123)
df = pd.DataFrame(np.random.randint(0,1000,size=(100, 4)), columns=list('ABCD'))
df['Category'] = np.random.randint(1, 3, df.shape[0])
df.head()
list_iter = [idx for idx in combinations(df.Category.unique(), 2)]
test = dict()
for i, j in list_iter:
test[(i, j)] = df.groupby("Category").pipe(lambda g: ttest_ind(g["C"].get_group(i),
g["C"].get_group(j))[1] > 0.05)
if test[(i, j)]:
break
Here, in this example, the dictionary test is:
{(2, 1): True}
It works with any numbers of groups, for instance if Category has three groups, with df['Category'] = np.random.randint(1, 4, df.shape[0]), the output for test would look like:
{(2, 3): True}
EDIT : If you want the values of A for a successful test, you can do the following:
list_iter = [idx for idx in combinations(df.Category.unique(), 2)]
test = dict()
for i, j in list_iter:
test[(i, j)] = df.groupby("Category").pipe(lambda g: ttest_ind(g["C"].get_group(i),
g["C"].get_group(j))[1] > 0.05)
if test[(i, j)]:
output = df.loc[df["Category"].isin([i,j]), ["Category", "A"]]
break
I replaced D with C because re-reading your question you say you want to compare C across different values of Category. If it is not C but C and D, combinations won't iter across all the groups you want.
I also changed the Boolean to being above 0.05, since you want the groups that are not significantly different.
Here I have the following result for test:
{(2, 3): True}
and for output:
Category A
0 2 510
1 3 988
2 2 595
You get the values of A for the two categories 2 and 3 where values of C were not significantly different.

Python, dataframes: How to make dictionary/series/dataframe for each value whose key is duplicate in python dataframe?

I have dataframes created like this:
+---------+--------+-----+-------------+
| VideoID | long | lat | viewerCount |
+---------+--------+-----+-------------+
| 123 | -1.1 | 1.1 | 25 |
+---------+--------+-----+-------------+
| 123 | -1.1 | 1.1 | 20 |
+---------+--------+-----+-------------+
The videoIDs are the IDs for video live streaming on facebook. And viewerCount is the number of viewers watching them.
I add the values refreshing after every 30 seconds. The videoIDs mostly will be duplicates but the viewerCount can will change.
So what I am trying is to store the new viewerCount also but no duplication of videoIDs
(i.e.: the viewerCount is now not a single columns but can be a dictionary or series). Something like this:
So, I have a comment before I answer your question. Unless you are dealing with "big-data" (where the in-memory join operation outcosts the storage space and possible update costs) it is advisable that you split your table into two.
- The first will contain the Videos details Video_id*, longitude, latitude, location
- The second table will contain the Video_id, refreshes and Views
.
With that being said, there are several options to reach this final representation. The solution that I would use my self, would be storing the Viewers_count as a list. Lists would be beneficial, since one would remove the Num_refresh all together, because it can be recalculated from the elements index.
Using dicts will be needlessly expensive and complex in this case, but I will add the syntax as well.
df = pd.DataFrame({'id': list("aabb"),
'location': list("xxyy"),
'views': [3, 4, 1, 2]})
# id location views
# 0 a x 3
# 1 a x 4
# 2 b y 1
# 3 b y 2
grouped_df = (df
.groupby(["id", "location"])["views"] # Create a group for each [id, location] and select view
.apply(np.hstack) # Transform the grouped views to a list
# .apply(lambda x: dict(zip(range(len(x)), x))) # Dict
.reset_index()) # Move id and location to regular columns
# id location views
# 0 a x [3, 4]
# 1 b y [1, 2]
Update:
You've mentioned in the comments the problem of nested lists during the iterations. You could replace list by np.hstack.
# Second iterations
iter_2 = pd.DataFrame({'id': list("aabb"),
'location': list("xxyy"),
'views': [30, 40, 10, 20]})
grouped_df = (grouped_df
.append(iter_2) # Add the rows of the new dataframe to the grouped_df
.groupby(["id", "location"])["views"]
.apply(np.hstack)
.reset_index())
# id location views
# 0 a x [3, 4, 30, 40]
# 1 b y [1, 2, 10, 20]

Obtaining the element(s) that appears in 3 or more lists

Say I have 5 lists in total
# Sample data
a1 = [1,2,3,4,5,6,7]
a2= [1,21,35,45,58]
a3= [1,2,15,27,36]
a4=[2,3,1,45,85,51,105,147,201]
a5=[3,458,665]
I need to find the elements of a1 which are also present in a2,a3,a4,a5 for more than 3 times including that in a1
or
I need the elements with frequency greater than or equal to 3 from all the lists (a1 - a5) combined, along with their frequency.
From the above example expected output would be
1 with a frequency of 4
2 with a frequency of 3
3 with a frequency of 3
For my actual problem, the number of lists as well as the length are so huge, Can anyone suggest me a simple and fast approach ?
Thanks,
Prithivi
As Patrick writes in the comments, chain and Counter are your friends here:
import itertools
import collections
targets = [1,2,3,4,5,6,7]
lists = [
[1,21,35,45,58],
[1,2,15,27,36],
[2,3,1,45,85,51,105,147,201],
[3,458,665]
]
chained = itertools.chain(*lists)
counter = collections.Counter(chained)
result = [(t, counter[t]) for t in targets if counter[t] >= 2]
such that
>>> results
[(1, 3), (2, 2), (3, 2)]
You say that you have a lot of lists, and each list is long. Try this solution and see how long it takes. If it needs to be sped-up, then that's another question. It may be that collections.Counter is too slow for your application.
a1= [1,2,3,4,5,6,7]
a2= [1,21,35,45,58]
a3= [1,2,15,27,36]
a4= [2,3,1,45,85,51,105,147,201]
a5= [3,458,665]
b = a1+a2+a3+a4+a5 #make b all lists together
for x in set(b): #iterate though b's set
print(x, 'with a frequency of', b.count(x)) #print the count
will give you:
1 with a frequency of 4
2 with a frequency of 3
3 with a frequency of 3
4 with a frequency of 1
5 with a frequency of 1
6 with a frequency of 1
7 with a frequency of 1
35 with a frequency of 1
36 with a frequency of 1
...
Edit:
Using:
for x in range(9000):
a1.append(random.randint(1,10000))
a2.append(random.randint(1,10000))
a3.append(random.randint(1,10000))
a4.append(random.randint(1,10000))
I made the lists much much longer and using time I checked how long the program took(where it doesn't print but instead saves the info) and the program took 4.9395 seconds. I hope that is fast enough.
This solution using pandas is quite fast
import pandas as pd
a1=[1,2,3,4,5,6,7]
a2=[1,21,35,45,58]
a3=[1,2,15,27,36]
a4=[2,3,1,45,85,51,105,147,201]
a5=[3,458,665]
# convert each list to a DataFrame with an indicator column
A = [a1, a2, a3, a4, a5]
D = [ pd.DataFrame({'A': a, 'ind{0}'.format(i):[1]*len(a)}) for i,a in enumerate(A)]
# left join each dataframe onto a1
# if you know the integers are distinct then you don't need drop_duplicates
df = pd.merge(D[0], D[1].drop_duplicates(['A']), how='left', on='A')
for d in D[2:]:
df = pd.merge(df, d.drop_duplicates(['A']), how='left', on='A')
# sum accross the indicators
df['freq'] = df[['ind{0}'.format(i) for i,d in enumerate(D)]].sum(axis=1)
# drop frequencies less than 3
print df[['A','freq']].loc[df['freq'] >= 3]
A test using larger input below runs in well under 0.2 seconds on my machine
import numpy.random as npr
a1 = xrange(10000)
a2 = npr.randint(10000, size=100000)
a3 = npr.randint(10000, size=100000)
a4 = npr.randint(10000, size=100000)
a5 = npr.randint(10000, size=100000)

Most efficient way to sum huge 2D NumPy array, grouped by ID column?

I have a massive data array (500k rows) that looks like:
id value score
1 20 20
1 10 30
1 15 0
2 12 4
2 3 8
2 56 9
3 6 18
...
As you can see, there is a non-unique ID column to the left, and various scores in the 3rd column.
I'm looking to quickly add up all of the scores, grouped by IDs. In SQL this would look like SELECT sum(score) FROM table GROUP BY id
With NumPy I've tried iterating through each ID, truncating the table by each ID, and then summing the score up for that table.
table_trunc = table[(table == id).any(1)]
score = sum(table_trunc[:,2])
Unfortunately I'm finding the first command to be dog-slow. Is there any more efficient way to do this?
you can use bincount():
import numpy as np
ids = [1,1,1,2,2,2,3]
data = [20,30,0,4,8,9,18]
print np.bincount(ids, weights=data)
the output is [ 0. 50. 21. 18.], which means the sum of id==0 is 0, the sum of id==1 is 50.
I noticed the numpy tag but in case you don't mind using pandas (or if you read in these data using this module), this task becomes an one-liner:
import pandas as pd
df = pd.DataFrame({'id': [1,1,1,2,2,2,3], 'score': [20,30,0,4,8,9,18]})
So your dataframe would look like this:
id score
0 1 20
1 1 30
2 1 0
3 2 4
4 2 8
5 2 9
6 3 18
Now you can use the functions groupby() and sum():
df.groupby(['id'], sort=False).sum()
which gives you the desired output:
score
id
1 50
2 21
3 18
By default, the dataframe would be sorted, therefore I use the flag sort=False which might improve speed for huge dataframes.
You can try using boolean operations:
ids = [1,1,1,2,2,2,3]
data = [20,30,0,4,8,9,18]
[((ids == i)*data).sum() for i in np.unique(ids)]
This may be a bit more effective than using np.any, but will clearly have trouble if you have a very large number of unique ids to go along with large overall size of the data table.
If you're looking only for sum you probably want to go with bincount. If you also need other grouping operations like product, mean, std etc. have a look at https://github.com/ml31415/numpy-groupies . It's the fastest python/numpy grouping operations around, see the speed comparison there.
Your sum operation there would look like:
res = aggregate(id, score)
The numpy_indexed package has vectorized functionality to perform this operation efficiently, in addition to many related operations of this kind:
import numpy_indexed as npi
npi.group_by(id).sum(score)
You can use a for loop and numba
from numba import njit
#njit
def wbcnt(b, w, k):
bins = np.arange(k)
bins = bins * 0
for i in range(len(b)):
bins[b[i]] += w[i]
return bins
Using #HYRY's variables
ids = [1, 1, 1, 2, 2, 2, 3]
data = [20, 30, 0, 4, 8, 9, 18]
Then:
wbcnt(ids, data, 4)
array([ 0, 50, 21, 18])
Timing
%timeit wbcnt(ids, data, 4)
%timeit np.bincount(ids, weights=data)
1000000 loops, best of 3: 1.99 µs per loop
100000 loops, best of 3: 2.57 µs per loop
Maybe using itertools.groupby, you can group on the ID and then iterate over the grouped data.
(The data must be sorted according to the group by func, in this case ID)
>>> data = [(1, 20, 20), (1, 10, 30), (1, 15, 0), (2, 12, 4), (2, 3, 0)]
>>> groups = itertools.groupby(data, lambda x: x[0])
>>> for i in groups:
for y in i:
if isinstance(y, int):
print(y)
else:
for p in y:
print('-', p)
Output:
1
- (1, 20, 20)
- (1, 10, 30)
- (1, 15, 0)
2
- (2, 12, 4)
- (2, 3, 0)

Categories

Resources