find rows that share values - python

I have a pandas dataframe that look like this:
df = pd.DataFrame({'name': ['bob', 'time', 'jane', 'john', 'andy'], 'favefood': [['kfc', 'mcd', 'wendys'], ['mcd'], ['mcd', 'popeyes'], ['wendys', 'kfc'], ['tacobell', 'innout']]})
-------------------------------
name | favefood
-------------------------------
bob | ['kfc', 'mcd', 'wendys']
tim | ['mcd']
jane | ['mcd', 'popeyes']
john | ['wendys', 'kfc']
andy | ['tacobell', 'innout']
For each person, I want to find out how many favefood's of other people overlap with their own.
I.e., for each person I want to find out how many other people have a non-empty intersection with them.
The resulting dataframe would look like this:
------------------------------
name | overlap
------------------------------
bob | 3
tim | 2
jane | 2
john | 1
andy | 0
The problem is that I have about 2 million rows of data. The only way I can think of doing this would be through a nested for-loop - i.e. for each person, go through the entire dataframe to see what overlaps (this would be extremely inefficient). Would there be anyway to do this more efficiently using pandas notation? Thanks!

Logic behind it
s=df['favefood'].explode().str.get_dummies().sum(level=0)
s.dot(s.T).ne(0).sum(axis=1)-1
Out[84]:
0 3
1 2
2 2
3 1
4 0
dtype: int64
df['overlap']=s.dot(s.T).ne(0).sum(axis=1)-1
Method from sklearn
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
s=pd.DataFrame(mlb.fit_transform(df['favefood']),columns=mlb.classes_, index=df.index)
s.dot(s.T).ne(0).sum(axis=1)-1
0 3
1 2
2 2
3 1
4 0
dtype: int64

Related

Convert multiple rows into one row with multiple columns in pyspark?

I have something like this (I've simplified the number of columns for brevity, there's about 10 other attributes):
id name foods foods_eaten color continent
1 john apples 2 red Europe
1 john oranges 3 red Europe
2 jack apples 1 blue North America
I want to convert it to:
id name apples oranges color continent
1 john 2 3 red Europe
2 jack 1 0 blue North America
Edit:
(1) I updated the data to show a few more of the columns.
(3) I've done
df_piv = df.groupBy(['id', 'name', 'color', 'continent', ...]).pivot('foods').avg('foods_eaten')
Is there a simpler way to do this sort of thing? As far as I can tell, I'll need to groupby almost every attribute to get my result.
Extending from what you have done so far and leveraging here
>>>from pyspark.sql import functions as F
>>>from pyspark.sql.types import *
>>>from pyspark.sql.functions import collect_list
>>>data=[{'id':1,'name':'john','foods':"apples"},{'id':1,'name':'john','foods':"oranges"},{'id':2,'name':'jack','foods':"banana"}]
>>>dataframe=spark.createDataFrame(data)
>>>dataframe.show()
+-------+---+----+
| foods| id|name|
+-------+---+----+
| apples| 1|john|
|oranges| 1|john|
| banana| 2|jack|
+-------+---+----+
>>>grouping_cols = ["id","name"]
>>>other_cols = [c for c in dataframe.columns if c not in grouping_cols]
>>> df=dataframe.groupBy(grouping_cols).agg(*[collect_list(c).alias(c) for c in other_cols])
>>>df.show()
+---+----+-----------------+
| id|name| foods|
+---+----+-----------------+
| 1|john|[apples, oranges]|
| 2|jack| [banana]|
+---+----+-----------------+
>>>df_sizes = df.select(*[F.size(col).alias(col) for col in other_cols])
>>>df_max = df_sizes.agg(*[F.max(col).alias(col) for col in other_cols])
>>> max_dict = df_max.collect()[0].asDict()
>>>df_result = df.select('id','name', *[df[col][i] for col in other_cols for i in range(max_dict[col])])
>>>df_result.show()
+---+----+--------+--------+
| id|name|foods[0]|foods[1]|
+---+----+--------+--------+
| 1|john| apples| oranges|
| 2|jack| banana| null|
+---+----+--------+--------+

Calculating the percentage of values based on the values in other columns

I am trying to create a column that includes a percentage of values based on the values in other columns in python. For example, let's assume that we have the following dataset.
+------------------------------------+------------+--------+
| Teacher | grades | counts |
+------------------------------------+------------+--------+
| Teacher1 | 1 | 1 |
| | 2 | 2 |
| | 3 | 1 |
| Teacher2 | 2 | 1 |
| Teacher3 | 3 | 2 |
| Teacher4 | 2 | 2 |
| | 3 | 2 |
+------------------------------------+------------+--------+
As you can see we have teachers in the first columns, grades that teacher gives (1,2 and 3) in the second column, and the number of given corresponding grade in third columns. Here, I am trying to get the percentage of grade numbers 1 and 2 in total given grade for each teacher. For instance, teacher 1 gave one grade 1, two grade 2, and one grade 3. In this case, the percentage of given grade numbers 1 and 2 in the total grade is 75%. Teacher 2 gave only 1 grade 2 so the percentage is 100%. Similarly, teacher 3 gave two grade 3 so the percentage 0% because he/she did not give any grades 1 and 2. So these percentages should be added to the new column in the dataset. Honestly, I couldn't even think about anything to try and I didn't find anything about it when I search it in here. Could you please help me to get the column.
I am not sure this is the most efficient way, but I find it quite readable and easy to follow.
percents = {} #store Teacher:percent
for t, g in df.groupby('Teacher'): #t,g is short for teacher,group
total = g.counts.sum()
one_two = g.loc[g.grades.isin([1,2])].counts.sum() #consider only 1&2
percent = (one_two/total)*100
#print(t, percent)
percents[t] = [percent]
xf = pd.DataFrame(percents).T.reset_index() #make a df from the dic
xf.columns = ['Teacher','percent'] #rename columns
df = df.merge(xf) #merge with initial df
print(df)
Teacher grades counts percent
0 Teacher1 1 1 75.0
1 Teacher1 2 2 75.0
2 Teacher1 3 1 75.0
3 Teacher2 2 1 100.0
4 Teacher3 3 2 0.0
5 Teacher4 2 2 50.0
6 Teacher4 3 2 50.0
I believe this will solve your query
y=0
data['Percentage']='None'
for teacher in teachers:
x=data[data['Teachers']==teacher]
total=sum(x['Counts'])
condition1= 1 in set(x['Grades'])
condition2= 2 in set(x['Grades'])
if (condition1==True or condition2==True):
for i in range(y,y+len(x)):
data['Percentage'].iloc[i]=(data['Counts'].iloc[i]/total)*100
else:
for i in range(y,y+len(x)):
data['Percentage'].iloc[i]=0
y=y+len(x)
Output:
Teachers Grades Counts Percentage
0 Teacher1 1 1 25
1 Teacher1 2 2 50
2 Teacher1 3 1 25
3 Teacher2 2 1 100
4 Teacher3 3 2 0
5 Teacher4 2 2 50
6 Teacher4 3 2 50
I have made used of boolean comprehension to segregate the data on
basis on each teacher. Most of the code is self explanatory. For any
other clarification please fill free to leave a comment.

How to enrich dataframe by adding columns in specific condition

I have a two different datasets:
users:
+-------+---------+--------+
|user_id| movie_id|timestep|
+-------+---------+--------+
| 100 | 1000 |20200728|
| 101 | 1001 |20200727|
| 101 | 1002 |20200726|
+-------+---------+--------+
movies:
+--------+---------+--------------------------+
|movie_id| title | genre |
+--------+---------+--------------------------+
| 1000 |Toy Story|Adventure|Animation|Chil..|
| 1001 | Jumanji |Adventure|Children|Fantasy|
| 1002 | Iron Man|Action|Adventure|Sci-Fi |
+--------+---------+--------------------------+
How to get dataset in the following format? So I can get user's taste profile, so I can compare different users by their similarity score?
+-------+---------+--------+---------+---------+-----+
|user_id| Action |Adventure|Animation|Children|Drama|
+-------+---------+--------+---------+---------+-----+
| 100 | 0 | 1 | 1 | 1 | 0 |
| 101 | 1 | 1 | 0 | 1 | 0 |
+-------+---------+---------+---------+--------+-----+
Where df is the movies dataframe and dfu is the users dataframe
The 'genre' column needs to be split into a list with pandas.Series.str.split, and then using pandas.DataFrame.explode, transform each element of the list into a row, replicating index values.
pandas.merge the two dataframes on 'movie_id'
Use pandas.DataFrame.groupby on 'user_id' and 'genre' and aggregate by count.
Shape final
.unstack converts the groupby dataframe from long to wide format
.fillna replace NaN with 0
.astype changes the numeric values from float to int
Tested in python 3.10, pandas 1.4.3
import pandas as pd
# data
movies = {'movie_id': [1000, 1001, 1002],
'title': ['Toy Story', 'Jumanji', 'Iron Man'],
'genre': ['Adventure|Animation|Children', 'Adventure|Children|Fantasy', 'Action|Adventure|Sci-Fi']}
users = {'user_id': [100, 101, 101],
'movie_id': [1000, 1001, 1002],
'timestep': [20200728, 20200727, 20200726]}
# set up dataframes
df = pd.DataFrame(movies)
dfu = pd.DataFrame(users)
# split the genre column strings at '|' to make lists
df.genre = df.genre.str.split('|')
# explode the lists in genre
df = df.explode('genre', ignore_index=True)
# merge df with dfu
dfm = pd.merge(dfu, df, on='movie_id')
# groupby, count and unstack
final = dfm.groupby(['user_id', 'genre'])['genre'].count().unstack(level=1).fillna(0).astype(int)
# display(final)
genre Action Adventure Animation Children Fantasy Sci-Fi
user_id
100 0 1 1 1 0 0
101 1 2 0 1 1 1

Searching through data base for partial and full match integers

I'm trying to search through a dataframe with a column that can have one or more integer values, to match one or more given integers.
The integers in the database has a '-' in between For example
--------------------------------------------------
| Customer 1 |1124 |
--------------------------------------------------
| Customer 2 |1124-1123 |
--------------------------------------------------
| Customer 3 |1124-1234-1642 |
--------------------------------------------------
| Customer 3 |1213-1234-1642 |
--------------------------------------------------
The objective here is to do a partial and full match, and be able to and be able to find out how many integers didn't match.
So for example let's say I have find all customers with 1124, the output would look like this(going off the example I provided)
--------------------------------------------------
| Customer 1 |1124 |None
--------------------------------------------------
| Customer 2 |1124-1123 |1
--------------------------------------------------
| Customer 3 |1124-1234-1642 |2
--------------------------------------------------
Thanks ahead of time!
Use set
define x as the test set
make s a series of sets
s - x creates a series of differences
(s - x).str.len() are the sizes of the differences
s & x is a boolean series indicating whether there is an intersection. Or in this case, if x is in s
x = {'1124'}
s = df['col2'].str.split('-').apply(set)
df.assign(col3=(s - x).str.len())[s & x]
col1 col2 col3
0 Customer 1 1124 0
1 Customer 2 1124-1123 1
2 Customer 3 1124-1234-1642 2
Setup
df = pd.DataFrame({
'col1': ['Customer 1', 'Customer 2', 'Customer 3', 'Customer 3'],
'col2': ['1124', '1124-1123', '1124-1234-1642', '1213-1234-1642']
})

Pandas Pivot table, how to put a series of columns in the values attribute

First of all, I apologize! It's my first time using stack overflow so I hope I'm doing it right! I searched but can't find what I'm looking for.
I'm also quite new with pandas and python :)
I am going to try to use an example and I will try to be clear.
I have a dataframe with 30 columns that contains information about a shopping cart, 1 of the columns (order) have 2 values, either completed of in progress.
And I have like 20 columns with items, lets say apple, orange, bananas... And I need to know how many times there is an apple in a complete order and how many in a in progress order. I decided to use a pivot table with the aggregate function count.
This would be a small example of the dataframe:
Order | apple | orange | banana | pear | pineapple | ... |
-----------|-------|--------|--------|------|-----------|------|
completed | 2 | 4 | 10 | 5 | 1 | |
completed | 5 | 4 | 5 | 8 | 3 | |
iProgress | 3 | 7 | 6 | 5 | 2 | |
completed | 6 | 3 | 1 | 7 | 1 | |
iProgress | 10 | 2 | 2 | 2 | 2 | |
completed | 2 | 1 | 4 | 8 | 1 | |
I have the output I want but what I'm looking for is a more elegant way of selecting lots of columns without having to type them manually.
df.pivot_table(index=['Order'], values=['apple', 'bananas', 'orange', 'pear', 'strawberry',
'mango'], aggfunc='count')
But I want to select around 15 columns, so instead of typing one by one 15 times, I'm sure there is an easy way of doing it by using column numbers or something. Let's say I want to select columns from 6 till 15.
I have tried with things like values=[df.columns[6:15]], I have also tried using df.iloc, but as I said, I'm pretty new so I'm probably using things wrong or making silly things!
Is there also a way to get them in the order they have? Because in my answer they seem to have been ordered alphabetically and I want to keep the order of the columns. So it should be apple, orange, banana...
Order Completed In progress
apple 92 221
banana 102 144
mango 70 55
I'm just looking for a way of improving my code and I hope I have not made much mess. Thank you!
I think you can use:
#if need select only few columns - df.columns[1:3]
df = df.pivot_table(columns=['Order'], values=df.columns[1:3], aggfunc='count')
print (df)
Order completed iProgress
apple 4 2
orange 4 2
#if need use all column, parameter values can be omit
df = df.pivot_table(columns=['Order'], aggfunc='count')
print (df)
Order completed iProgress
apple 4 2
banana 4 2
orange 4 2
pear 4 2
pineapple 4 2
What is the difference between size and count in pandas?
df = df.pivot_table(columns=['Order'], aggfunc=len)
print (df)
Order completed iProgress
apple 4 2
banana 4 2
orange 4 2
pear 4 2
pineapple 4 2
#solution with groupby and transpose
df = df.groupby('Order').count().T
print (df)
Order completed iProgress
apple 4 2
orange 4 2
banana 4 2
pear 4 2
pineapple 4 2
Your example doesn't show an example of an item not in the cart. I'm assuming it comes up as None or 0. If this is correct, then I fill na values and count how many are greater than 0
df.set_index('Order').fillna(0).gt(0).groupby(level='Order').sum().T

Categories

Resources