How to group by and aggregate on multiple columns in pandas - python

I have following dataframe in pandas
ID Balance ATM_drawings Value
1 100 50 345
1 150 33 233
2 100 100 333
2 100 100 234
I want data in that desired format
ID Balance_mean Balance_sum ATM_Drawings_mean ATM_drawings_sum
1 75 250 41.5 83
2 200 100 200 100
I am using following command to do it in pandas
df1= df[['Balance','ATM_drawings']].groupby('ID', as_index = False).agg(['mean', 'sum']).reset_index()
But, it does not give what I intended to get.

You can use a dictionary to specify aggregation functions for each series:
d = {'Balance': ['mean', 'sum'], 'ATM_drawings': ['mean', 'sum']}
res = df.groupby('ID').agg(d)
# flatten MultiIndex columns
res.columns = ['_'.join(col) for col in res.columns.values]
print(res)
Balance_mean Balance_sum ATM_drawings_mean ATM_drawings_sum
ID
1 125 250 41.5 83
2 100 200 100.0 200
Or you can define d via dict.fromkeys:
d = dict.fromkeys(('Balance', 'ATM_drawings'), ['mean', 'sum'])

Not sure how to achieve this using agg, but you could reuse the `groupby´ object to avoid having to do the operation multiple times, and then use transformations:
import pandas as pd
df = pd.DataFrame({
"ID": [1, 1, 2, 2],
"Balance": [100, 150, 100, 100],
"ATM_drawings": [50, 33, 100, 100],
"Value": [345, 233, 333, 234]
})
gb = df.groupby("ID")
df["Balance_mean"] = gb["Balance"].transform("mean")
df["Balance_sum"] = gb["Balance"].transform("sum")
df["ATM_drawings_mean"] = gb["ATM_drawings"].transform("mean")
df["ATM_drawings_sum"] = gb["ATM_drawings"].transform("sum")
print df
Which yields:
ID Balance Balance_mean Balance_sum ATM_drawings ATM_drawings_mean ATM_drawings_sum Value
0 1 100 125 250 50 41.5 83 345
1 1 150 125 250 33 41.5 83 233
2 2 100 100 200 100 100.0 200 333
3 2 100 100 200 100 100.0 200 234

Related

group list of dataframes to another dataframe row by row

Frames is a list of dataframes with the following order and dimension:
[ participant activity t phone_accel_x phone_accel_y \
0 1600 D 241598773279024 0.565436 1.049568
1 1600 D 241598823633028 0.502723 1.029012
2 1600 D 241598873987032 0.470794 1.002914
3 1600 D 241598924341036 0.490821 1.003417
4 1600 D 241598974695040 0.487980 1.033217
.. ... ... ... ... ...
195 1600 D 241608592309883 0.677391 0.918443
196 1600 D 241608642663887 0.673493 0.913030
197 1600 D 241608693017891 0.674655 0.913004
198 1600 D 241608743371894 0.679319 0.914433
199 1600 D 241608793725898 0.676576 0.913901
phone_accel_z phone_gyro_x phone_gyro_y phone_gyro_z
0 1.248711 -0.017212 -0.006581 -0.080116
1 1.197390 -0.121311 0.050491 -0.109368
2 1.224439 -0.324749 -0.007777 -0.148947
3 1.234429 -0.290535 -0.105310 -0.151757
4 1.223829 -0.100016 -0.093174 -0.112706
.. ... ... ... ...
195 1.250941 -0.008502 0.028063 0.019072
196 1.260808 -0.004811 0.027223 0.024403
197 1.266306 0.000024 0.022763 0.023875
198 1.258972 0.003954 0.012599 0.021185
199 1.259517 -0.006841 0.007218 0.012923
[200 rows x 9 columns],
participant activity t phone_accel_x phone_accel_y \
50 1600 D 241601290979316 0.667534 0.907823
51 1600 D 241601341333320 0.659705 0.917594
52 1600 D 241601391687324 0.650291 0.908096
53 1600 D 241601442041328 0.641641 0.901728
54 1600 D 241601492395332 0.659827 0.899954
.. ... ... ... ... ...
245 1600 D 241611110023497 0.673400 0.913214
246 1600 D 241611160377501 0.677467 0.912210
247 1600 D 241611210731505 0.681255 0.905807
248 1600 D 241611261085509 0.670614 0.904358
249 1600 D 241611311439513 0.668775 0.909658
phone_accel_z phone_gyro_x phone_gyro_y phone_gyro_z
50 1.277606 -0.031145 -0.012867 -0.057229
51 1.272129 -0.039413 0.005489 -0.044188
52 1.290153 -0.056169 0.004972 -0.065202
53 1.274855 -0.044967 -0.010766 -0.078963
54 1.290040 -0.046148 -0.010928 -0.075745
.. ... ... ... ...
245 1.246544 -0.006509 0.009480 0.009705
246 1.250491 -0.012193 0.010935 0.008721
247 1.256942 -0.006915 0.017657 0.008312
248 1.264303 -0.007985 0.019806 0.001612
249 1.265652 0.002644 0.007558 0.004734
[200 rows x 9 columns], etc
All the dataframes are of the same dimension 200 rows x 9 columns and the len(frames) is 91999. I want to create a new dataframe that contains the values of all the 200 rows of every dataframe in one row but only for the columns phone_accel_x, phone_accel_y, phone_accel_z, phone_gyro_x, phone_gyro_y, phone_gyro_z and activity. The values of each dataframe will be added as new row, so the new dataframe will be of dimension 91999 rows x 1201 columns (200 x 6 + 1).
sensors_frames = []
for i in range(0, len(frames)):
t = frames[i][['phone_accel_x', 'phone_accel_y', 'phone_accel_z',
'phone_gyro_x', 'phone_gyro_y', 'phone_gyro_z', 'activity']].values
sensors_frames.append(t)
i am trying something like this, but i am having difficulties in stacking the values of each column in a single row and continue in a new line for the next dataframe. The list sensors_frames will be converted to a dataframe afterwards.
Any ideas to make it happen with pandas library?
Thanks in advance.
Here is a pandas-based solution - I timed it on my machine and took around 2 minutes to run.
import pandas as pd
# simulate input data
df = pd.DataFrame(
{
"participant": [1600] * 200,
"activity": ["D"] * 200,
"t": [241598773279024] * 200,
"phone_accel_x": [1.049568] * 200,
"phone_accel_y": [1.049568] * 200,
"phone_accel_z": [1.248711] * 200,
"phone_gyro_x": [-0.017212] * 200,
"phone_gyro_y": [-0.006581] * 200,
"phone_gyro_z": [-0.080116] * 200,
}
)
columns = [
"phone_accel_x",
"phone_accel_y",
"phone_accel_z",
"phone_gyro_x",
"phone_gyro_y",
"phone_gyro_z",
]
frames = [df] * 91_999
# suggested solution
df_frames = pd.concat(frames, axis=0, ignore_index=True)
df_frames["step"] = df_frames.index // 200
reshaped = df_frames.groupby(["step", "activity"]).apply(
lambda grp: pd.DataFrame(
grp[columns].values.reshape(1, -1),
columns=[f"{col}_{i}" for i in range(200) for col in columns],
)
).reset_index()

merge dataframes and add price data for each instance of an item ID

I am trying to merge two dataframes so that each instance of an item ID in DF3 displays the pricing data associated with the matching ID from DF1.
DF3 (what I am trying to accomplish)
recipeID
itemID_out
qty_out
buy_price
sell_price
buy_quantity
sell_quantity
id_1_in
qty_id1
buy_price
sell_price
buy_quantity
sell_quantity
id_2_in
qty_id2
buy_price
sell_price
buy_quantity
sell_quantity
id_3_in
qty_id3
buy_price
sell_price
buy_quantity
sell_quantity
id_4_in
qty_id4
buy_price
sell_price
buy_quantity
sell_quantity
id_5_in
qty_id5
buy_price
sell_price
buy_quantity
sell_quantity
1
1986
1
129
167
67267
21637
123
1
10
15
1500
3000
124
1
12
14
550
800
125
1
8
12
124
254
126
1
22
25
1251
890
127
1
64
72
12783
1251515
2
1987
1
1521
1675
654
1245
123
2
10
15
1500
3000
3
1988
1
128376
131429
47
23
123
10
10
15
1500
3000
124
3
12
14
550
800
These are the two dataframes I am trying to merge from.
DF1: Contains 26863 rows; master list of item names, IDs, and price data. Pulled from API, new items can be added and will appear as new rows after an update request from the user.
itemID
name
buy_price
sell_price
buy_quantity
sell_quantity
1986
XYZ
129
167
67267
21637
123
ABC
10
15
1500
3000
124
DEF
12
14
550
800
DF2 (contains 12784 rows; recipes that combine from items in the master list. Pulled from API, new recipes can be added and will appear as new rows after an update request from the user.)
recipeID
itemID_out
qty_out
id_1_in
qty_id1
id_2_in
qty_id2
id_3_in
qty_id3
id_4_in
qty_id4
id_5_in
qty_id5
1
1986
1
123
1
124
1
125
1
126
1
127
1
2
1987
1
123
2
3
1988
1
123
10
124
3
Recipes can contain a combination of 1 to 5 items (null values occur) that consist of IDs from DF1 and/or the itemID_out column in DF2.
The "id_#_in" columns in DF2 can contain item IDs from the "itemID_out" column, due to that recipe using the item that is being output from another recipe.
I have tried to merge it using:
pd.merge(itemlist_modified, recipelist_modified, left_on='itemID', right_on='itemID_out')
But this only ever results in a single column of ideas receiving the pricing data as intended.
I feel like I'm trying to use the wrong function for this, any help would be very much appreciated!
Thanks in advance!
Not a pretty approach, but it first melts the ingredient table into long form and then merges it on the itemlist table
import pandas as pd
import numpy as np
itemlist_modified = pd.DataFrame({
'itemID': [1986, 123, 124],
'name': ['XYZ', 'ABC', 'DEF'],
'buy_price': [129, 10, 12],
'sell_price': [167, 15, 14],
'buy_quantity': [67267, 1500, 550],
'sell_quantity': [21637, 3000, 800],
})
recipelist_modified = pd.DataFrame({
'RecipeID': [1, 2, 3],
'itemID_out': [1986, 1987, 1988],
'qty_out': [1, 1, 1],
'id_1_in': [123, 123, 123],
'qty_id1': [1, 2, 10],
'id_2_in': [124.0, np.nan, 124.0],
'qty_id2': [1.0, np.nan, 3.0],
'id_3_in': [125.0, np.nan, np.nan],
'qty_id3': [1.0, np.nan, np.nan],
'id_4_in': [126.0, np.nan, np.nan],
'qty_id4': [1.0, np.nan, np.nan],
'id_5_in': [127.0, np.nan, np.nan],
'qty_id5': [1.0, np.nan, np.nan],
})
#columns which are not qty or input id cols
id_vars = ['RecipeID','itemID_out','qty_out']
#prepare dict to map column name to ingredient number
col_renames = {}
col_renames.update({'id_{}_in'.format(i+1):'ingr_{}'.format(i+1) for i in range(5)})
col_renames.update({'qty_id{}'.format(i+1):'ingr_{}'.format(i+1) for i in range(5)})
#melt reciplist into longform
long_recipelist = recipelist_modified.melt(
id_vars=id_vars,
var_name='ingredient',
).dropna()
#add a new column to specify whether each row is a qty or an id
long_recipelist['kind'] = np.where(long_recipelist['ingredient'].str.contains('qty'),'qty_in','id_in')
#convert ingredient names
long_recipelist['ingredient'] = long_recipelist['ingredient'].map(col_renames)
#pivot on the new ingredient column
reshape_recipe_list = long_recipelist.pivot(
index=['RecipeID','itemID_out','qty_out','ingredient'],
columns='kind',
values='value',
).reset_index()
#merge with the itemlist
priced_ingredients = pd.merge(reshape_recipe_list, itemlist_modified, left_on='id_in', right_on='itemID')
#pivot on the priced ingredients
priced_ingredients = priced_ingredients.pivot(
index = ['RecipeID','itemID_out','qty_out'],
columns = 'ingredient',
)
#flatten the hierarchical columns
priced_ingredients.columns = ["_".join(a[::-1]) for a in priced_ingredients.columns.to_flat_index()]
priced_ingredients.columns.name = ''
priced_ingredients = priced_ingredients.reset_index()
priced_ingredients partial output:

Pandas DataFrame pivot (reshape?)

I can't seem to get this right... here's what I'm trying to do:
import pandas as pd
df = pd.DataFrame({
'item_id': [1,1,3,3,3],
'contributor_id': [1,2,1,4,5],
'contributor_role': ['sing', 'laugh', 'laugh', 'sing', 'sing'],
'metric_1': [80, 90, 100, 92, 50],
'metric_2': [180, 190, 200, 192, 150]
})
--->
item_id contributor_id contributor_role metric_1 metric_2
0 1 1 sing 80 180
1 1 2 laugh 90 190
2 3 1 laugh 100 200
3 3 4 sing 92 192
4 3 5 sing 50 150
And I want to reshape it into:
item_id SING_1_contributor_id SING_1_metric_1 SING_1_metric_2 SING_2_contributor_id SING_2_metric_1 SING_2_metric_2 ... LAUGH_1_contributor_id LAUGH_1_metric_1 LAUGH_1_metric_2 ... <LAUGH_2_...>
0 1 1 80 180 N/A N/A N/A ... 2 90 190 ... N/A..
1 3 4 92 192 5 50 150 ... 1 100 200 ... N/A..
Basically, for each item_id, I want to collect all relevant data into a single row. Each item could have multiple types of contributors, and there is a max for each type (e.g. max SING contributor = A per item, max LAUGH contributor = B per item). There are a set of metrics tied to each contributor (but for the same contributor, the values could be different across different items / contributor types).
I can probably achieve this through some seemingly inefficient methods (e.g. looping and matching then populating a template df), but I was wondering if there is a more efficient way to achieve this, potentially through cleverly specifying the index / values / columns in the pivot operation (or any other method..).
Thanks in advance for any suggestions!
EDIT:
Ended up adapting Ben's script below into the following:
df['role_count'] = df.groupby(['item_id', 'contributor_role']).cumcount().add(1).astype(str)
df['contributor_role'] = df.apply(lambda row: row['contributor_role'] + '_' + row['role_count'], axis=1)
df = df.set_index(['item_id','contributor_role']).unstack()
df.columns = ['_'.join(x) for x in df.columns.values]
You can create the additional key with cumcount then do unstack
df['newkey']=df.groupby('item_id').cumcount().add(1).astype(str)
df['contributor_id']=df['contributor_id'].astype(str)
s = df.set_index(['item_id','newkey']).unstack().sort_index(level=1,axis=1)
s.columns=s.columns.map('_'.join)
s
Out[38]:
contributor_id_1 contributor_role_1 ... metric_1_3 metric_2_3
item_id ...
1 1 sing ... NaN NaN
3 1 messaround ... 50.0 150.0

Python DataFrames concat or append problem

I have a problem with dataframes in Python. I am trying to copy certain rows to a new dataframe but I can't figure it out.
There are 2 arrays:
pokemon_data
# HP Attack Defense Sp. Atk Sp. Def Speed
0 1 45 49 49 65 65 45
1 2 60 62 63 80 80 60
2 3 80 82 83 100 100 80
3 4 80 100 123 122 120 80
4 5 39 52 43 60 50 65
... ... ... ... ... ... ... ...
795 796 50 100 150 100 150 50
796 797 50 160 110 160 110 110
797 798 80 110 60 150 130 70
798 799 80 160 60 170 130 80
799 800 80 110 120 130 90 70
800 rows × 7 columns
combats_data
First_pokemon Second_pokemon Winner
0 266 298 1
1 702 701 1
2 191 668 1
3 237 683 1
4 151 231 0
... ... ... ...
49995 707 126 0
49996 589 664 0
49997 303 368 1
49998 109 89 0
49999 9 73 0
50000 rows × 3 columns
I created third dataset with columns:
output1
HP0 Attack0 Defens0 Sp. Atk0 Sp. Def0 Speed0 HP1 Attack1 Defense1 Sp. Atk1 Sp. Def1 Speed1 Winner
What I'm trying to do is copy attributes from pokemon_data to output1 in order from combats_data.
HP0 and HP1 are respectivly HP from first Pokemon and HP from second Pokemon.
I want to use that data in neural networks with TensorFlow to predict what Pokemon would win.
For this type of wrangling, you should first "melt" or "tidy" the combats_data so each ID has its own row, then do a "join" or "merge" of the two dataframes.
You didn't provide a minimum reproducible example, so here's mine:
import pandas as pd
df1 = pd.DataFrame({'id': [1,2,3,4,5],
'var1': [10,20,30,40,50],
'var2': [15,25,35,45,55]})
df2 = pd.DataFrame({'id1': [1,2],
'id2': [3,4],
'outcome': [1,4]})
df2tidy = pd.melt(df2, id_vars=['outcome'], value_vars=['id1', 'id2'],
var_name='name', value_name='id')
df2tidy
# outcome name id
# 0 1 id1 1
# 1 4 id1 2
# 2 1 id2 3
# 3 4 id2 4
output = pd.merge(df2tidy, df1, on='id')
output
# outcome name id var1 var2
# 0 1 id1 1 10 15
# 1 4 id1 2 20 25
# 2 1 id2 3 30 35
# 3 4 id2 4 40 45
which you could then train some sort of classifier on outcome.
(Btw, you should make outcome a 0 or 1 (for pokemon1 vs pokemon2) instead of the actual ID of the winner.)
So i would like to create new array based on these two arrays. For example:
#ids represent pokemons and their attributes
pokemons = pd.DataFrame({'id': [1,2,3,4,5],
'HP': [10,20,30,40,50],
'Attack': [15,25,35,45,55],
'Defese' : [25,15,45,15,35]})
#here 0 or 1 represents whether first or second pokemon won
combats = pd.DataFrame({'id1': [1,2],
'id2': [3,4],
'winner': [0,1]})
#in output data i want to replace ids with attributes, the order is based on combats array
output = pd.DataFrame({'HP1': [10,20],
'Attack1': [15,25],
'Defense1': [25,15],
'HP2': [30,40],
'Attack2': [35,45],
'Defense2': [45,15],
'winner': [0,1]})
Not sure if its correct thinking. I want to train neural network to figure out what pokemon will win.
This is solution from user part from 4programmers.net forum.
import pandas as pd
if __name__ == "__main__":
pokemon_data = pd.DataFrame({
"Id": [1, 2, 3, 4, 5],
"HP": [45, 60, 80, 80, 39],
"Attack": [49, 62, 82, 100, 52],
"Defense": [49, 63, 83, 123, 43],
"Sp. Atk": [65, 80, 100, 122, 60],
"Sp. Def": [65, 80, 100, 120, 50],
"Speed": [45, 60, 80, 80, 65]})
combats_data = pd.DataFrame({
"First_pokemon": [1, 2, 3],
"Second_pokemon": [2, 3, 4],
"Winner": [1, 0, 1]})
output = pokemon_data.merge(combats_data, left_on="Id", right_on="First_pokemon")
output = output.merge(pokemon_data, left_on="Second_pokemon", right_on="Id",
suffixes=("_pokemon1", "_pokemon2"))
print(output)

pandas, numpy round down to nearest 100

I created a dataframe column with the below code, and was trying to figure out how to round it down to the nearest 100th.
...
# This prints out my new value rounded to the nearest whole number.
df['new_values'] = (10000/df['old_values']).apply(numpy.floor)
# How do I get it to round down to the nearest 100th instead?
# i.e. 8450 rounded to 8400
You need divide by 100, convert to int and last multiple by 100:
df['new_values'] = (df['old_values'] / 100).astype(int) *100
Same as:
df['new_values'] = (df['old_values'] / 100).apply(np.floor).astype(int) *100
Sample:
df = pd.DataFrame({'old_values':[8450, 8470, 343, 573, 34543, 23999]})
df['new_values'] = (df['old_values'] / 100).astype(int) *100
print (df)
old_values new_values
0 8450 8400
1 8470 8400
2 343 300
3 573 500
4 34543 34500
5 23999 23900
EDIT:
df = pd.DataFrame({'old_values':[3, 6, 89, 573, 34, 23]})
#show output of first divide for verifying output
df['new_values1'] = (10000/df['old_values'])
df['new_values'] = (10000/df['old_values']).div(100).astype(int).mul(100)
print (df)
old_values new_values1 new_values
0 3 3333.333333 3300
1 6 1666.666667 1600
2 89 112.359551 100
3 573 17.452007 0
4 34 294.117647 200
5 23 434.782609 400
Borrowing #jezrael's sample dataframe
df = pd.DataFrame({'old_values':[8450, 8470, 343, 573, 34543, 23999]})
Use floordiv or //
df // 100 * 100
old_values
0 8400
1 8400
2 300
3 500
4 34500
5 23900
I've tried something similar using the math module
a = [123, 456, 789, 145]
def rdl(x):
ls = []
for i in x:
a = math.floor(i/100)*100
ls.append(a)
return ls
rdl(a)
#Output was [100, 200, 400, 700, 100]
Hope this provides some idea. Its very similar to solution provided by #jezrael

Categories

Resources