pandas dataframe - group artists per unique user - python

To avoid duplicates of the same user, I want to neatly organize a nested dictionary of {k: artist1, artist2, artist3, etc} using pandas groupby function. Here is sample data (my instinct tells me chain an agg func?)
...like df.groupby('users')?
users artist
0 00001411dc427966b17297bf4d69e7e193135d89 the most serene republic
1 00001411dc427966b17297bf4d69e7e193135d89 stars
2 00001411dc427966b17297bf4d69e7e193135d89 broken social scene
3 00001411dc427966b17297bf4d69e7e193135d89 have heart
4 00001411dc427966b17297bf4d69e7e193135d89 luminous orange
5 00001411dc427966b17297bf4d69e7e193135d89 boris
6 00001411dc427966b17297bf4d69e7e193135d89 arctic monkeys
7 00001411dc427966b17297bf4d69e7e193135d89 bright eyes
8 00001411dc427966b17297bf4d69e7e193135d89 coaltar of the deepers
9 00001411dc427966b17297bf4d69e7e193135d89 polar bear club
10 00001411dc427966b17297bf4d69e7e193135d89 the libertines
11 00001411dc427966b17297bf4d69e7e193135d89 death from above 1979
12 00001411dc427966b17297bf4d69e7e193135d89 owl city
13 00001411dc427966b17297bf4d69e7e193135d89 coldplay
14 00001411dc427966b17297bf4d69e7e193135d89 okkervil river
15 00001411dc427966b17297bf4d69e7e193135d89 jim sturgess
16 00001411dc427966b17297bf4d69e7e193135d89 deerhoof
17 00001411dc427966b17297bf4d69e7e193135d89 fear before the march of flames
18 00001411dc427966b17297bf4d69e7e193135d89 breathe carolina
19 00001411dc427966b17297bf4d69e7e193135d89 mstrkrft

I believe you're looking for groupby + agg here.
df.groupby('users').artist.apply(list).to_dict()
{'00001411dc427966b17297bf4d69e7e193135d89': ['the most serene republic',
'stars',
'broken social scene',
'have heart',
'luminous orange',
'boris',
...
]
}

Related

select random pairs from remaining unique values in a list

Updated: Not sure I explained it well first time.
I have a scheduling problem, or more accurately, a "first come first served" problem. A list of available assets are assigned a set of spaces, available in pairs (think cars:parking spots, diners:tables, teams:games). I need a rough simulation (random) that chooses the first two to arrive from available pairs, then chooses the next two from remaining available pairs, and so on, until all spaces are filled.
Started using teams:games to cut my teeth. The first pair is easy enough. How do I then whittle it down to fill the next two spots from among the remaining available entities? Tried a bunch of different things, but coming up short. Help appreciated.
import itertools
import numpy as np
import pandas as pd
a = ['Georgia','Oregon','Florida','Texas'], ['Georgia','Oregon','Florida','Texas']
b = [(x,y) for x,y in itertools.product(*a) if x != y]
c = pd.DataFrame(b)
c.columns = ['home', 'away']
print(c)
d = c.sample(n = 2, replace = False)
print(d)
The first results is all possible combinations. But, once the first slots are filled, there can be no repeats. in example below, once Oregon and Georgia are slated in, the only remaining options to choose from are Forlida:Texas or Texas:Florida. Obviously just the sample function alone produces duplicates frequently. I will need this to scale up to dozens, then hundreds of entities:slots. Many thanks in advance!
home away
0 Georgia Oregon
1 Georgia Florida
2 Georgia Texas
3 Oregon Georgia
4 Oregon Florida
5 Oregon Texas
6 Florida Georgia
7 Florida Oregon
8 Florida Texas
9 Texas Georgia
10 Texas Oregon
11 Texas Florida
home away
3 Oregon Georgia
5 Oregon Texas
Not exactly sure what you are trying to do. But if you want to randomly pair your unique entities you can simply randomly order them and then place them in a 2-columns dataframe. I wrote this with all the US states minus one (Wyomi):
states = ['Alaska','Alabama','Arkansas','Arizona','California',
'Colorado','Connecticut','District of Columbia','Delaware',
'Florida','Georgia','Hawaii','Iowa','Idaho','Illinois',
'Indiana','Kansas','Kentucky','Louisiana','Massachusetts',
'Maryland','Maine','Michigan','Minnesota','Missouri',
'Mississippi','Montana','North Carolina','North Dakota',
'Nebraska','New Hampshire','New Jersey','New Mexico',
'Nevada','New York','Ohio','Oklahoma','Oregon',
'Pennsylvania','Rhode Island','South Carolina',
'South Dakota','Tennessee','Texas','Utah','Virginia',
'Vermont','Washington','Wisconsin','West Virginia']
a=states.copy()
random.shuffle(states)
c = pd.DataFrame({'home':a[::2],'away':a[1::2]})
print(c)
#Output
home away
0 West Virginia Minnesota
1 New Hampshire Louisiana
2 Nevada Florida
3 Alabama Indiana
4 Delaware North Dakota
5 Georgia Rhode Island
6 Oregon Pennsylvania
7 New York South Dakota
8 Maryland Kansas
9 Ohio Hawaii
10 Colorado Wisconsin
11 Iowa Idaho
12 Illinois Missouri
13 Arizona Mississippi
14 Connecticut Montana
15 District of Columbia Vermont
16 Tennessee Kentucky
17 Alaska Washington
18 California Michigan
19 Arkansas New Jersey
20 Massachusetts Utah
21 Oklahoma New Mexico
22 Virginia South Carolina
23 North Carolina Maine
24 Texas Nebraska
Not sure if this is exactly what you were asking for though.
If you need to schedule all the fixtures of the season, you can check this answer --> League fixture generator in python

Pandas: efficient way to replace entire string with a substring

I have a dataframe that looks like this:
df = pd.DataFrame({
'name': ['John','William', 'Nancy', 'Susan', 'Robert', 'Lucy', 'Blake', 'Sally', 'Bruce'],
'injury': ['right hand broken', 'lacerated left foot', 'foot broken', 'right foot fractured', '', 'sprained finger', 'chest pain', 'swelling in arm', 'laceration to arms, hands, and foot']
})
name injury
0 John right hand broken
1 William lacerated left foot
2 Nancy foot broken
3 Susan right foot fractured
4 Robert
5 Lucy sprained finger
6 Blake chest pain
7 Sally swelling in arm
8 Bruce lacerations to arm, hands, and foot <-- this is a weird case, since there are multiple body parts
Notably, some of the values in the injury column are blank.
I want to replace the values in the injury column with only the affected body part. In my case, that would be hand, foot, finger, and chest, arm. There are dozens more... this is a small example.
The desired dataframe would look like this:
name injury
0 John hand
1 William foot
2 Nancy foot
3 Susan foot
4 Robert
5 Lucy finger
6 Blake chest
7 Sally arm
8 Bruce arm, hand, foot
I could do something like this:
df.loc[df['injury'].str.contains('hand'), 'injury'] = 'hand'
df.loc[df['injury'].str.contains('foot'), 'injury'] = 'foot'
df.loc[df['injury'].str.contains('finger'), 'injury'] = 'finger'
df.loc[df['injury'].str.contains('chest'), 'injury'] = 'chest'
df.loc[df['injury'].str.contains('arm'), 'injury'] = 'arm'
But, this might not be the most elegant way.
Is there a more elegant way to do this? (e.g. using a dictionary)
(any advice on that last case with multiple body parts would be appreciated)
Thank you!
I think you should maintain a list of text, and using apply function:
body_parts = ['hand', 'foot', 'finger', 'chest', 'arm']
def test(value):
body_text = []
for body_part in body_parts:
if body_part in value:
body_text.append(body_part)
if body_text:
return ', '.join(body_text)
return value
df['injury'] = df['injury'].apply(test)
return:
name injury
0 John hand
1 William foot
2 Nancy foot
3 Susan foot
4 Robert
5 Lucy finger
6 Blake chest
7 Sally arm
8 Bruce hand, foot, arm
The standard way to get the first match of a regex on a string column is to use .extract(), please see the quickstart 10 minutes to pandas: working with text data.
df['injury'].str.extract('(arm|chest|finger|foot|hand)', expand=False)
0 hand
1 foot
2 foot
3 foot
4 NaN
5 finger
6 chest
7 arm
8 arm
Name: injury, dtype: object
Note row 4 returned NaN rather than '' (but it's trivial to apply .fillna('') to the result). More importantly in row 8 we'll only return the first match, not all matches. You need to decide how you want to handle this. See .extractall()
selected_words = ["hand", "foot", "finger", "chest", "arms", "arm", "hands"]
df["injury"] = (
df["injury"]
.str.replace(",", "")
.str.split(" ", expand=False)
.apply(lambda x: ", ".join(set([i for i in x if i in selected_words])))
)
print(df)
name injury
0 John hand
1 William foot
2 Nancy foot
3 Susan foot
4 Robert
5 Lucy finger
6 Blake chest
7 Sally arm
8 Bruce arms, foot, hands

List of top ten values in column no duplicates based on another column

Imagine I have a dataset that is like so:
ID prod_title total_sales
0 619040 All Veggie Yummies 72.99
1 619041 Ball and String 18.95
2 619042 Cat Cave 28.45
3 619043 Chewie Dental 24.95
4 619044 Chomp-a Plush 60.99
5 619045 Feline Fix Mix 65.99
6 619046 Fetch Blaster 9.95
7 619047 Foozy Mouse 45.99
8 619048 Kitty Climber 35.99
9 619049 Purr Mix 32.99
10 619050 Fetch Blaster 19.90
11 619051 Purr Mix 98.97
12 619052 Cat Cave 56.90
13 619053 Purrfect Puree 54.95
14 619054 Foozy Mouse 91.98
15 619055 Reddy Beddy 21.95
16 619056 Cat Cave 85.83
17 619057 Scratchy Post 48.95
18 619058 Snack-em Fish 15.99
19 619059 Snoozer Essentails 99.95
20 619060 Scratchy Post 48.95
21 619061 Purrfect Puree 219.80
22 619062 Chewie Dental 49.90
23 619063 Reddy Beddy 65.85
24 619064 The New Bone 71.96
25 619065 Reddy Beddy 109.75
What are the top ten product titles by total dollar amount made? Display in descending order. Store in variable top_tot_sales
The answer should be something like this; though this isn't the correct answer just an example: ['Purrfect Puree' 'Ball and String ' 'Fetch Blaster ' 'Reddy Beddy' 'Chomp-a Plush' 'Foozy Mouse ' 'Kitty Climber' 'Snack-em Fish' 'Snoozer Essentails' 'Cat Cave']
I have tried groupby's, nlargest, apply, unique, combos of .loc groupby idxmax and many more. I'm just struggling trying to figure out how to isolate these columns in my question and get a list of the top ten prod_title.
I'll add the code i've tried below
top_tot_sales = df_cleaned.loc[df_cleaned.groupby('prod_title')['total_sales'].idxmax()]
df_cleaned.nlargest(10, 'total_sales')
df_cleaned['prod_title'].drop_duplicates()
df_cleaned['prod_title'].unique()
top_tot_sales = df_cleaned.groupby(['prod_title'])['total_sales'].transform(max) == df_cleaned['total_sales']
print(top_tot_sales)
df_cleaned['prod_title'].drop_duplicates()
df_cleaned['prod_title'].unique()
top_tot_sales = df_cleaned.groupby(['prod_title'])df_cleaned.nlargest(n=10, columns=['total_sales'])
print(top_tot_sales)
top_tot_sales = df_cleaned.groupby('prod_title')['total_sales'].nlargest(n=10)
print(top_tot_sales)
top_num_sales = df_cleaned.loc[df_cleaned.groupby('prod_title')['trans_quantity'].idxmax()]
df.sort_values('total_sales', ascending=False).drop_duplicates(subset=['prod_title']).iloc[:10]['prod_title'].values
sort_values() sorts your dataframe, make sure to set ascending to false.
drop_duplicates will get rid of duplicate products
iloc will select your first 10 items, since you already sorted, they will be the top 10.
['prod_title'].values will return an array of prod_titles from the resulting dataframe.
Is this what you are looking for?
df.groupby('prod_title').sum().sort_values('total_sales', ascending=False).index[:10].values
Result
['Purrfect Puree' 'Reddy Beddy' 'Cat Cave' 'Foozy Mouse' 'Purr Mix'
'Snoozer Essentails' 'Scratchy Post' 'Chewie Dental' 'All Veggie Yummies'
'The New Bone']

add a new column based on a group without grouping

I have this reproducible data set where i need to add a column based for the 'best usage' source.
df_in = pd.DataFrame({
'year': [ 5, 5, 5,
10, 10,
15, 15,
30, 30, 30 ],
'usage': ['farm', 'best', '',
'manual', 'best',
'best', 'city',
'random', 'best', 'farm' ],
'value': [0.825, 0.83, 0.85,
0.935, 0.96,
1.12, 1.305,
1.34, 1.34, 1.455],
'source': ['wood', 'metal', 'water',
'metal', 'water',
'wood', 'water',
'wood', 'metal', 'water' ]})
desired outcome:
print(df)
year usage value source best
0 5 farm 0.825 wood metal
1 5 best 0.830 metal metal
2 5 0.850 water metal
3 10 manual 0.935 metal water
4 10 best 0.960 water water
5 15 best 1.120 wood wood
6 15 city 1.305 water wood
7 30 random 1.340 wood metal
8 30 best 1.340 metal metal
9 30 farm 1.455 water metal
Is there a way to do that without grouping? currently, i'm using:
grouped = df_in.groupby('usage').get_group('best')
grouped = grouped.rename(columns={'source': 'best'})
df = df_in.merge(grouped[['year','best']],how='outer', on='year')
You could just query:
df_in.merge(df_in.query('usage=="best"')[['year','source']]
.drop_duplicates('year') # you might not need/want this line if `best` is unique per year (or doesn't need to be in the output)
.rename(columns={'source':'best'}),
on='year', how='left')
Output:
year usage value source best
0 5 farm 0.825 wood metal
1 5 best 0.830 metal metal
2 5 0.850 water metal
3 10 manual 0.935 metal water
4 10 best 0.960 water water
5 15 best 1.120 wood wood
6 15 city 1.305 water wood
7 30 random 1.340 wood metal
8 30 best 1.340 metal metal
9 30 farm 1.455 water metal
Here is a way using .loc and .map()
(df.assign(best = df_in['year']
.map(df_in.loc[df_in['usage'].eq('best'),['year','source']]
.set_index('year')
.squeeze())))
Output:
year usage value source best
0 5 farm 0.825 wood metal
1 5 best 0.830 metal metal
2 5 0.850 water metal
3 10 manual 0.935 metal water
4 10 best 0.960 water water
5 15 best 1.120 wood wood
6 15 city 1.305 water wood
7 30 random 1.340 wood metal
8 30 best 1.340 metal metal
9 30 farm 1.455 water metal

How to get top 3 sales in data frame after using group by and sorting in python?

recently I am doing with this data set
import pandas as pd
data = {'Product':['Box','Bottles','Pen','Markers','Bottles','Pen','Markers','Bottles','Box','Markers','Markers','Pen'],
'State':['Alaska','California','Texas','North Carolina','California','Texas','Alaska','Texas','North Carolina','Alaska','California','Texas'],
'Sales':[14,24,31,12,13,7,9,31,18,16,18,14]}
df1=pd.DataFrame(data, columns=['Product','State','Sales'])
df1
I want to find the 3 groups that have the highest sales
grouped_df1 = df1.groupby('State')
grouped_df1.apply(lambda x: x.sort_values(by = 'Sales', ascending=False))
So I have a dataframe like this
Now, I want to find the top 3 State that have the highest sales.
I tried to use
grouped_df1.apply(lambda x: x.sort_values(by = 'Sales', ascending=False)).head(3)
# It gives me the first three rows
grouped_df1.apply(lambda x: x.sort_values(by = 'Sales', ascending=False)).max()
#It only gives me the maximum value
The expected result should be:
Texas: 31
California: 24
North Carolina: 18
Thus, how can I fix it? Because sometimes, a State can have 3 top sales, for example Alaska may have 3 top sales. When I simply sort it, the top 3 will be Alaska, and it cannot find 2 other groups.
Many thanks!
You could add a new column called Sales_Max_For_State and then use drop_duplicates and nlargest:
>>> df1['Sales_Max_For_State'] = df1.groupby(['State'])['Sales'].transform(max)
>>> df1
Product State Sales Sales_Max_For_State
0 Box Alaska 14 16
1 Bottles California 24 24
2 Pen Texas 31 31
3 Markers North Carolina 12 18
4 Bottles California 13 24
5 Pen Texas 7 31
6 Markers Alaska 9 16
7 Bottles Texas 31 31
8 Box North Carolina 18 18
9 Markers Alaska 16 16
10 Markers California 18 24
11 Pen Texas 14 31
>>> df2 = df1.drop_duplicates(['Sales_Max_For_State']).nlargest(3, 'Sales_Max_For_State')[['State', 'Sales_Max_For_State']]
>>> df2
State Sales_Max_For_State
2 Texas 31
1 California 24
3 North Carolina 18
I think there are a few ways to do this:
1-
df1.groupby('State').agg({'Sales': 'max'}).sort_values(by='Sales', ascending=False).iloc[:3]
2-df1.groupby('State').agg({'Sales': 'max'})['Sales'].nlargest(3)
Sales
State
Texas 31
California 24
North Carolina 18

Categories

Resources