Using Python 3
I have a dataframe sort of like this:
productCode productType storeCode salesAmount moreInfo
111 1 111 111 info
111 1 112 112 info
456 4 456 456 info
and so on for thousands of rows
I want to select (and have a list with the codes for) the X amount of the best selling unique products for each different store.
How would I accomplish that?
Data:
df = pd.DataFrame({'productCode': [111,111,456,123,125],
'productType' : [1,1,4,3,3],
'storeCode' : [111,112,112,456,456],
'salesAmount' : [111,112,34,456,1235]})
productCode productType storeCode salesAmount
0 111 1 111 111
1 111 1 112 112
2 456 4 112 34
3 123 3 456 456
4 125 3 456 1235
It sounds like you want the best selling product at each storeCode? In which case:
df.sort_values('salesAmount', ascending=False).groupby('storeCode').head(1)
productCode productType storeCode salesAmount
4 125 3 456 1235
1 111 1 112 112
0 111 1 111 111
Instead, if you want the best selling of each productType at each storeCode, then:
df.sort_values('salesAmount', ascending=False).groupby(['storeCode', 'productType']).head(1)
productCode productType storeCode salesAmount
4 125 3 456 1235
1 111 1 112 112
0 111 1 111 111
2 456 4 112 34
Related
I have a pandas data frame where I try to find the first ID for when the left is less than the values of
list = [0,50,100,150,200,250,500,1000]
ID ST ... csum left
0 0 AK ... 4.293174e+05 760964.996900
1 1 AK ... 4.722491e+06 760535.679500
2 2 AK ... 8.586347e+06 760149.293900
3 3 AK ... 2.683233e+07 758324.695200
4 4 AK ... 2.962290e+07 758045.638900
.. ... ... ... ... ...
111 111 AK ... 7.609006e+09 107.329336
112 112 AK ... 7.609221e+09 85.863469
113 113 AK ... 7.609435e+09 64.397602
114 114 AK ... 7.609650e+09 42.931735
115 115 AK ... 7.610079e+09 0.000000
So I would end up with a list or dataframe looking like
threshold ID
0 115
50 114
100 112
150 100
200 100
250 99
500 78
1000 77
How can I achieve this?
If you want to match the ID of the first value greater than the target, use a merge_asof:
lst = [0,50,100,150,200,250,500,1000]
pd.merge_asof(pd.Series(lst, name='threshold', dtype=df['left'].dtype),
df.sort_values(by='left').rename(columns={'left': 'threshold'})[['threshold', 'ID']],
# uncomment for strictly superior
#allow_exact_matches=False,
)
Output:
threshold ID
0 0.0 115
1 50.0 114
2 100.0 112
3 150.0 111 # due to truncated input
4 200.0 111 #
5 250.0 111 #
6 500.0 111 #
7 1000.0 111 #
list = [0,50,100,150,200,250,500,1000]
df11=pd.DataFrame(dict(threshold=list))
df11.assign(id=df11.threshold.map(lambda x:df1.query("left<=#x").iloc[0,0]))
out
threshold ID
0 0.0 115
1 50.0 114
2 100.0 112
3 150.0 111 # due to truncated input
4 200.0 111 #
5 250.0 111 #
6 500.0 111 #
7 1000.0 111 #
Let's say, I have number A and they call several people B
A B
123 987
123 987
123 124
435 567
435 789
653 876
653 876
999 654
999 654
999 654
999 123
I want to find to whom the person in A has called maximum times and also the number of times.
OUTPUT:
A B Count
123 987 2
435 567 or789 1
653 876 2
999 654 3
How one can think of it is,
A B
123 987 2
124 1
435 567 1
789 1
653 876 2
999 654 3
123 1
Can somebody help me out on how to do this?
Try this
# count the unique values in rows
df.value_counts(['A','B']).sort_index()
A B
123 124 1
987 2
435 567 1
789 1
653 876 2
999 123 1
654 3
dtype: int64
To get the highest values for each unique A:
v = df.value_counts(['A','B'])
# remove duplicated rows
v[~v.reset_index(level=0).duplicated('A').values]
A B
999 654 3
123 987 2
653 876 2
435 567 1
dtype: int64
Use SeriesGroupBy.value_counts which by default sorting values, so get first rows per A by GroupBy.head:
df = df.groupby('A')['B'].value_counts().groupby(level=0).head(1).reset_index(name='Count')
print (df)
A B Count
0 123 987 2
1 435 567 1
2 653 876 2
3 999 654 3
Another idea:
df = df.value_counts(['A','B']).reset_index(name='Count').drop_duplicates('A')
print (df)
A B Count
0 999 654 3
1 123 987 2
2 653 876 2
4 435 567 1
I have sorted data frame as mentioned below(Input DataFrame) and I need to iterate the rows,select & retrive the rows into output data frame based on below conditions.
• Condition 1: For a given R1,R2,W - if we have two records with TYPE 'A' and 'B'
a) If (amoun1& amount2) of TYPE ‘A’ is > (amoun1& amount2 )of TYPE ‘B’ we need to bring the TYPE 'A' record into the output
b) If (amoun1& amount2) of TYPE ‘B’ is > (amoun1& amount2 )of TYPE ‘A’ we need to bring the TYPE 'B' record into the output
c) If (amoun1& amount2) of TYPE ‘A’ is = (amoun1& amount2 )of TYPE ‘B’ we need to bring the TYPE 'A' record into the output
• Condition 2: For a given R1,R2,W - if we have only record with TYPE 'A', we need to bring the TYPE 'A' record into the output
• Condition 3: For a given R1,R2,W - if we have only record with TYPE 'B', we need to bring the TYPE 'B' record into the output
Input Dataframe
R1 R2 W TYPE amount1 amount2
0 123 12 1 A 111 222
1 123 12 1 B 111 222
2 123 12 2 A 222 222
3 123 12 2 B 333 333
4 123 12 3 A 444 444
5 123 12 3 B 333 333
6 123 34 1 A 111 222
7 123 34 2 A 333 444
8 123 34 2 B 333 444
9 123 34 3 B 444 555
10 123 34 4 A 555 666
11 123 34 4 B 666 777
Output dataframe
R1 R2 W TYPE amount1 amount1
0 123 12 1 A 111 222
3 123 12 2 B 333 333
4 123 12 3 A 444 444
6 123 34 1 A 111 222
7 123 34 2 A 333 444
9 123 34 3 B 444 555
11 123 34 4 B 666 777
Selection based on your criteria's
def my_selection(idf):
# If 'A' and 'B' in 'TYPE' then give me the row with 'A'
if idf['TYPE'].unique().shape[0] == 2:
return idf[idf['TYPE'] == 'A']
else:
return idf
df2 = df.groupby(['R1', 'R2', 'W'], as_index=False).apply(lambda idf: my_selection(idf))
df2.index = df2.index.droplevels(-1)
# R1 R2 W TYPE amount1 amount2
# 0 123 12 1 A 111 222
# 1 123 12 2 A 333 444
# 2 123 12 3 A 555 666
# 3 123 34 1 A 111 222
# 4 123 34 2 A 222 333
# 5 123 34 3 B 444 555
# 6 123 34 4 A 555 666
All you have to do is groupby R1,R2,W and operate on Type column as follows:
data.groupby(['R1','R2','W']).apply(lambda x: 'A' if 'A' in x['Type'].values else 'B').reset_index()
You can merge this output with original DataFrame on the obtained columns from the above output to get corresponding 'amount1', 'amount2' values
This is what I would do:
categories = ['B','A'] #create a list of categories in ascending order of precedence
d={i:e for e,i in enumerate(categories)} #create a dictionary:{'A': 0, 'B': 1}
s=df['TYPE'].map(d) #map to df['TYPE'] and create a helper series
then assign this series to the dataframe and groupby+transform max and check if it is equal to the helper series and return where both value matches:
out = df[s.eq(df.assign(TYPE=s).groupby(['R1','R2','W'])['TYPE'].transform('max'))]
print(out)
R1 R2 W TYPE amount1 amount2
0 123 12 1 A 111 222
2 123 12 2 A 333 444
4 123 12 3 A 555 666
6 123 34 1 A 111 222
7 123 34 2 A 222 333
9 123 34 3 B 444 555
10 123 34 4 A 555 666
I have been looking for a way to find the first occurance in a series of rows based on a group.
First I went through and applied a 'group' counter to each group. Then I want to return the ID of the first orruance of 'sold' under status as a new column and apply it to the whole group.
Example below. Final_ID is the new column to be created.
group ID status Final_ID
1 100 view 103
1 101 show 103
1 102 offer 103
1 103 sold 103
1 104 view 103
2 105 view 106
2 106 sold 106
2 107 sold 106
3 108 pending 109
3 109 sold 109
3 110 view 109
4 111 sold 111
4 112 sold 111
4 113 sold 111
4 114 sold 111
I have tried using
df = pd.DataFrame ({'group':['1','1','1','1','1','2','2','2','3','3','3','4','4','4','4'],
'ID':['100','101','102','103','104','105','106','107','108','109','110','111','112','113','114'],
'status':['view','show','offer','sold','view','view','sold','sold','pending','sold','view','sold','sold','sold','sold']
})
df2=df[( df.status=='sold')][['group','ID']].groupby('group'['ID'].apply(min).reset_index()
df2=df.merge(df2, on='group' , how='left')
but I am not sure that is the proper way to go about it.. Any other thoughts?
Mask your ID series wherever status is not sold, then groupby your groups and transform first, which chooses the first non-NaN value for each group, which in this case is the first occurence of sold
df['ID'].mask(df['status'] != 'sold').groupby(df['group']).transform('first').astype(int)
0 103
1 103
2 103
3 103
4 103
5 106
6 106
7 106
8 109
9 109
10 109
11 111
12 111
13 111
14 111
Name: Final_ID, dtype: int32
Assume the ID column is already sorted, you can do:
(
df.set_index('group')
.assign(Final_ID=df.loc[df.status=='sold'].groupby(by='group').ID.first())
.reset_index()
)
group ID status Final_ID
0 1 100 view 103
1 1 101 show 103
2 1 102 offer 103
3 1 103 sold 103
4 1 104 view 103
5 2 105 view 106
6 2 106 sold 106
7 2 107 sold 106
8 3 108 pending 109
9 3 109 sold 109
10 3 110 view 109
11 4 111 sold 111
12 4 112 sold 111
13 4 113 sold 111
14 4 114 sold 111
You need to look for sold rows, drop status column, groupby on group, not on ID, do min.
df.merge(df.loc[df.status=='sold'].drop('status',1).groupby(['group'], as_index=False).min()
.rename(columns={'ID': 'Final_ID'}))
Output:
group ID status Final_ID
0 1 100 view 103
1 1 101 show 103
2 1 102 offer 103
3 1 103 sold 103
4 1 104 view 103
5 2 105 view 106
6 2 106 sold 106
7 2 107 sold 106
8 3 108 pending 109
9 3 109 sold 109
10 3 110 view 109
11 4 111 sold 111
12 4 112 sold 111
13 4 113 sold 111
14 4 114 sold 111
I have dataframe
ID url
111 vk.com
111 facebook.com
111 twitter.com
111 avito.ru
111 apple.com
111 tiffany.com
111 pikabu.ru
111 stackoverflow.com
222 vk.com
222 facebook.com
222 vc.ru
222 twitter.com
I need to add new column part, where I should groupby dataframe with ID and next divide it to 4 parts.
Desire output
ID url part
111 vk.com 1
111 facebook.com 1
111 twitter.com 2
111 avito.ru 2
111 apple.com 3
111 tiffany.com 3
111 pikabu.ru 4
111 stackoverflow.com 4
222 vk.com 1
222 facebook.com 2
222 vc.ru 3
222 twitter.com 4
I tried
df.groupby(['ID']).agg({'ID': np.sum / 4}).rename(columns={'ID': 'part'}).reset_index()
But I don't get desirable with it
You can use groupby with numpy.repeat:
df['part'] = df.groupby('ID')['ID']
.apply(lambda x: pd.Series(np.repeat(np.arange(1, 5), (len(x.index) / 4))))
.reset_index(drop=True)
print (df)
ID url part
0 111 vk.com 1
1 111 facebook.com 1
2 111 twitter.com 2
3 111 avito.ru 2
4 111 apple.com 3
5 111 tiffany.com 3
6 111 pikabu.ru 4
7 111 stackoverflow.com 4
8 222 vk.com 1
9 222 facebook.com 2
10 222 vc.ru 3
11 222 twitter.com 4
Another solution with custom function:
def f(x):
#print (x)
x['part'] = np.repeat(np.arange(1, 5), (len(x.index) / 4))
return x
df = df.groupby('ID').apply(f)
print (df)
ID url part
0 111 vk.com 1
1 111 facebook.com 1
2 111 twitter.com 2
3 111 avito.ru 2
4 111 apple.com 3
5 111 tiffany.com 3
6 111 pikabu.ru 4
7 111 stackoverflow.com 4
8 222 vk.com 1
9 222 facebook.com 2
10 222 vc.ru 3
11 222 twitter.com 4
If groups are not divide by 4 get error:
ValueError: Length of values does not match length of index
One possible solution is append values fo0r divide by 4 and last remove them by dropna:
print (df)
ID url
0 111 vk.com
1 111 avito.ru
2 111 apple.com
3 111 tiffany.com
4 111 pikabu.ru
5 222 vk.com
6 222 facebook.com
7 222 twitter.com
def f(x):
a = len(x.index) % 4
if a != 0:
x = pd.concat([x, pd.DataFrame(index = np.arange(4-a))])
x['part'] = np.repeat(np.arange(1, 5), (len(x.index) / 4))
return x
df = df.groupby('ID').apply(f).dropna(subset=['ID']).reset_index(drop=True)
#if necessary convert to int
df.ID = df.ID.astype(int)
print (df)
ID url part
0 111 vk.com 1
1 111 avito.ru 1
2 111 apple.com 2
3 111 tiffany.com 2
4 111 pikabu.ru 3
5 222 vk.com 1
6 222 facebook.com 2
7 222 twitter.com 3