Calculate weights for grouped data in pandas

Calculate weights for grouped data in pandas - python

I would like to calculate portfolio weights with a pandas dataframe. Here is some dummy data for an example:
df1 = DataFrame({'name' : ['ann','bob']*3}).sort('name').reset_index(drop=True)
df2 = DataFrame({'stock' : list('ABC')*2})
df3 = DataFrame({'val': np.random.randint(10,100,6)})
df = pd.concat([df1, df2, df3], axis=1)
Each person owns 3 stocks with a value val. We can calculate portfolio weights like this:
df.groupby('name').apply(lambda x: x.val/(x.val).sum())
which gives this:
If I want to add a column wgt to df I need to merge this result back to df on name and index. This seems rather clunky.
Is there a way to do this in one step? Or what is the way to do this that best utilizes pandas features?

Use transform, this will return a series with an index aligned to your original df:
In [114]:
df['wgt'] = df.groupby('name')['val'].transform(lambda x: x/x.sum())
df
Out[114]:
name stock val wgt
0 ann A 18 0.131387
1 ann B 43 0.313869
2 ann C 76 0.554745
3 bob A 16 0.142857
4 bob B 44 0.392857
5 bob C 52 0.464286

Related

Pandas: Search and match based on two conditions

I am using the code below to make a search on a .csv file and match a column in both files and grab a different column I want and add it as a new column. However, I am trying to make the match based on two columns instead of one. Is there a way to do this?
import pandas as pd
df1 = pd.read_csv("matchone.csv")
df2 = pd.read_csv("comingfrom.csv")
def lookup_prod(ip):
for row in df2.itertuples():
if ip in row[1]:
return row[3]
else:
return '0'
df1['want'] = df1['name'].apply(lookup_prod)
df1[df1.want != '0']
print(df1)
#df1.to_csv('file_name.csv')
The code above makes a search from the column name 'samename' in both files and gets the column I request ([3]) from the df2. I want to make the code make a match for both column 'name' and another column 'price' and only if both columns in both df1 and df2 match then the code take the value on ([3]).
df 1 :
name price value
a 10 35
b 10 21
c 10 33
d 10 20
e 10 88
df 2 :
name price want
a 10 123
b 5 222
c 10 944
d 10 104
e 5 213
When the code is run (asking for the want column from d2, based on both if df1 name = df2 name) the produced result is :
name price value want
a 10 35 123
b 10 21 222
c 10 33 944
d 10 20 104
e 10 88 213
However, what I want is if both df1 name = df2 name and df1 price = df2 price, then take the column df2 want, so the desired result is:
name price value want
a 10 35 123
b 10 21 0
c 10 33 944
d 10 20 104
e 10 88 0

You need to use pandas.DataFrame.merge() method with multiple keys:
df1.merge(df2, on=['name','price'], how='left').fillna(0)
Method represents missing values as NaNs, so that the column's dtype changes to float64 but you can change it back after filling the missed values with 0.
Also please be aware that duplicated combinations of name and price in df2 will appear several times in the result.

If you are matching the two dataframes based on the name and the price, you can use df.where and df.isin
df1['want'] = df2['want'].where(df1[['name','price']].isin(df2).all(axis=1)).fillna('0')
df1
name price value want
0 a 10 35 123.0
1 b 10 21 0
2 c 10 33 944.0
3 d 10 20 104.0
4 e 10 88 0

Expanding on https://stackoverflow.com/a/73830294/20110802:
You can add the validate option to the merge in order to avoid duplication on one side (or both):
pd.merge(df1, df2, on=['name','price'], how='left', validate='1:1').fillna(0)
Also, if the float conversion is a problem for you, one option is to do an inner join first and then pd.concat the result with the "leftover" df1 where you already added a constant valued column. Would look something like:
df_inner = pd.merge(df1, df2, on=['name', 'price'], how='inner', validate='1:1')
merged_pairs = set(zip(df_inner.name, df_inner.price))
df_anti = df1.loc[~pd.Series(zip(df1.name, df1.price)).isin(merged_pairs)]
df_anti['want'] = 0
df_result = pd.concat([df_inner, df_anti]) # perhaps ignore_index=True ?
Looks complicated, but should be quite performant because it filters by set. I think there might be a possibility to set name and price as index, merge on index and then filter by index to not having to do the zip-set-shenanigans, bit I'm no expert on multiindex-handling.

#Try this code it will give you expected results
import pandas as pd
df1 = pd.DataFrame({'name' :['a','b','c','d','e'] ,
'price' :[10,10,10,10,10],
'value' : [35,21,33,20,88]})
df2 = pd.DataFrame({'name' :['a','b','c','d','e'] ,
'price' :[10,5,10,10,5],
'want' : [123,222,944,104 ,213]})
new = pd.merge(df1,df2, how='left', left_on=['name','price'], right_on=['name','price'])
print(new.fillna(0))

Python: explode column that contains dictionary

I have a DataFrame that looks like this:
df:
amount info
12 {id:'1231232', type:'trade', amount:12}
14 {id:'4124124', info:{operation_type:'deposit'}}
What I want to achieve is this:
df:
amount type operation_type
12 trade Nan
14 Nan deposit
I have tried the df.explode('info') method but with no luck. Are there any other ways to do this?

We could do it in 2 steps: (i) Build a DataFrame df with data; (ii) use json_normalize on "info" column and join it back to df:
df = pd.DataFrame(data)
out = df.join(pd.json_normalize(df['info'].tolist())[['type', 'info.operation_type']]).drop(columns='info')
out.columns = out.columns.map(lambda x: x.split('.')[-1])
Output:
amount type operation_type
0 12 trade NaN
1 14 NaN deposit

Left joining multiple datasets with same column headers

not sure if this question is answered ,Please help me to solve this .
I have tried my max to explain this .
Please refer the images to understand my query .
I want my below query solved in Python .
The query is :
I need to left merge a dataframe with 3 other dataframes .
But the tricky part is all the dataframes are having same column headers , and I want the same column to overlap the preceeding column in my output dataframe .
But while I use left merge in python , the column headers of all the dataframes are printed along with sufix "_x" and "_y".
The below are my 4 dataframes:
df1 = pd.DataFrame({"Fruits":['apple','banana','mango','strawberry'],
"Price":[100,50,60,70],
"Count":[1,2,3,4],
"shop_id":['A','A','A','A']})
df2 = pd.DataFrame({"Fruits":['apple','banana','mango','chicku'],
"Price":[10,509,609,1],
"Count":[8,9,10,11],
"shop_id":['B','B','B','B']})
df3 = pd.DataFrame({"Fruits":['apple','banana','chicku'],
"Price":[1000,5090,10],
"Count":[5,6,7],
"shop_id":['C','C','C']})
df4 = pd.DataFrame({"Fruits":['apple','strawberry','mango','chicku'],
"Price":[50,51,52,53],
"Count":[11,12,13,14],
"shop_id":['D','D','D','D']})
Now I want to left join df1 , with df2 , df3 and df4.
from functools import reduce
data_frames = [df1, df2,df3,df4]
df_merged = reduce(lambda left,right: pd.merge(left,right,on=['Fruits'],
how='left'), data_frames)
But this produces an output as below :
The same columns are printed in the o/p dataset with suffix _x and _y
I want only a single Price , shop_id and count column like below:

It looks like what you want is combine_first, not merge:
from functools import reduce
data_frames = [df1, df2,df3,df4]
df_merged = reduce(lambda left,right: right.set_index('Fruits').combine_first(left.set_index('Fruits')).reset_index(),
data_frames)
output:
Fruits Price Count shop_id
0 apple 50 11 D
1 banana 5090 6 C
2 chicku 53 14 D
3 mango 52 13 D
4 strawberry 51 12 D
To filter the output to get only the keys from df1:
df_merged.set_index('Fruits').loc[df1['Fruits']].reset_index()
output:
Fruits Price Count shop_id
0 apple 50 11 D
1 banana 5090 6 C
2 mango 52 13 D
3 strawberry 51 12 D
NB. everything would actually be easier if you set Fruits as index

Pandas: sum of every N columns

I have dataframe
ID 2016-01 2016-02 ... 2017-01 2017-02 ... 2017-10 2017-11 2017-12
111 12 34 0 12 3 0 0
222 0 32 5 5 0 0 0
I need to count every 12 columns and get
ID 2016 2017
111 46 15
222 32 10
I try to use
(df.groupby((np.arange(len(df.columns)) // 31) + 1, axis=1).sum().add_prefix('s'))
But it returns to all columns
But when I try to use
df.groupby['ID']((np.arange(len(df.columns)) // 31) + 1, axis=1).sum().add_prefix('s'))
It returns
TypeError: 'method' object is not subscriptable
How can I fix that?

First set_index of all columns without dates:
df = df.set_index('ID')
1. groupby by splited columns and selected first:
df = df.groupby(df.columns.str.split('-').str[0], axis=1).sum()
2. lambda function for split:
df = df.groupby(lambda x: x.split('-')[0], axis=1).sum()
3. converted columns to datetimes and groupby years:
df.columns = pd.to_datetime(df.columns)
df = df.groupby(df.columns.year, axis=1).sum()
4. resample by years:
df.columns = pd.to_datetime(df.columns)
df = df.resample('A', axis=1).sum()
df.columns = df.columns.year
print (df)
2016 2017
ID
111 46 15
222 32 10

The above code has a slight syntax error and throws the following error:
ValueError: No axis named 1 for object type
Basically, the groupby condition needs to be wrapped by []. So I'm rewriting the code correctly for convenience:
new_df = df.groupby([[i//n for i in range(0,m)]], axis = 1).sum()
where n is the number of columns you want to group together and m is the total number of columns being grouped. You have to rename the columns after that.

If you don't mind losing the labels, you can try this:
new_df = df.groupby([i//n for i in range(0,m)], axis = 1).sum()
where n is the number of columns you want to group together and m is the total number of columns being grouped. You have to rename the columns after that.

pandas adding grouped data frame to another data frame as row

I get following dataframe:
category_name amount
Blades & Razors & Foam 158
Diaper 486
Empty 193
Fem Care 2755
HairCare 3490
Irrelevant 1458
Laundry 889
Oral Care 2921
Others 69
Personal Cleaning Care 1543
Skin Care 645
I want to add it as row to following dataframe that has additional retailer column that is absent with the first dataframe.
categories_columns = ['retailer'] + self.product_list.category_name.unique().tolist()
categories_df = pd.DataFrame(columns=categories_columns)
And if some category is missing I just want zero value.
Any ideas ?

Use set_index to move the category_name column into the index. Then taking the transpose (.T) will move the category_names into the column index:
In [35]: df1
Out[35]:
amount cat
0 0 A
1 1 B
2 2 C
In [36]: df1.set_index('cat').T
Out[36]:
cat A B C
amount 0 1 2
Once the category names (cat, above) are in the column index, you can concatenate
the reshaped DataFrame with the second DataFrame using append or `pd.concat.
pd.concat fills missing values with NaN. Use fillna(0) to replace the NaNs with 0.
import numpy as np
import pandas as pd
df1 = pd.DataFrame({'amount': range(3), 'cat': list('ABC')})
df2 = pd.DataFrame(np.arange(2*4).reshape(2, 4), columns=list('ABCD'))
result = df2.(df1.set_index('cat').T).fillna(0)
print(result)
yields
A B C D
0 0 1 2 3.0
1 4 5 6 7.0
amount 0 1 2 0.0

Just append and replace Nan :
pd.DataFrame(columns=products).append(df.T).fillna(0)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Calculate weights for grouped data in pandas - python

Related

Pandas: Search and match based on two conditions

Python: explode column that contains dictionary

Left joining multiple datasets with same column headers

Pandas: sum of every N columns

pandas adding grouped data frame to another data frame as row

Categories

Resources