I have the following dataframe
df:
group people value value_50
1 5 100 1
2 2 90 1
1 10 80 1
2 20 40 0
1 7 10 0
2 23 30 0
And I am trying to apply sklearn minmax on one of the column, given a condition on dataset, and then want to join that back as per pandas index in my original data
import pandas as pd
from sklearn.preprocessing import MinMaxScaler
After copying the above data
data = pd.read_clipboard()
minmax = MinMaxScaler(feature_range=(0,10))
''' Applying a filter on "group" and then apply minmax only on those values '''
val = pd.DataFrame(minmax.fit_transform(data[data['group'] == 1][['value']])
,columns = ['val_minmax'] )
But it looks like we lose the index after the minmax
val
val_minmax
0 10.000000
1 7.777778
2 0.000000
where index in my original dataset on this filter is
data[data['group'] == 1]['value']
output:
0 100
2 80
4 10
Desired dataset:
df_out:
group people value value_50 val_minmax
1 5 100 1 10
2 2 90 1 na
1 10 80 1 7.88
2 20 40 0 na
1 7 10 0 0
2 23 30 0 na
Now, how to join back my data at rows in the original data, so that I can get the above output?
You just need to assign it back
df.loc[df.group==1,'val_minmax']=minmax.fit_transform(df[df['group'] == 1][['value']])
Related
I have missing values in one column that I would like to fill by random sampling from a source distribution:
import pandas as pd
import numpy as np
source = pd.DataFrame({'age':5*[21],
'location':[0,0,1,1,1],
'x':[1,2,3,4,4]})
source
age location x
0 21 0 1
1 21 0 2
2 21 1 3
3 21 1 4
4 21 1 4
target = pd.DataFrame({'age':5*[21],
'location':[0,0,0,1,2],
'x':5*[np.nan]})
target
age location x
0 21 0 NaN
1 21 0 NaN
2 21 0 NaN
3 21 1 NaN
4 21 2 NaN
Now I need to fill in the missing values of x in the target dataframe by choosing a random value of x from the source dataframe that have the same values for age and location as the missing x with replacement. If there is no value of x in source that has the same values for age and location as the missing value it should be left as missing.
Expected output:
age location x
0 21 0 1 with probability 0.5 2 otherwise
1 21 0 1 with probability 0.5 2 otherwise
2 21 0 1 with probability 0.5 2 otherwise
3 21 1 3 with probability 0.33 4 otherwise
4 21 2 NaN
I can loop through all the missing combinations of age and location and slice the source dataframe and then take a random sample, but my dataset is large enough that it takes quite a while to do.
Is there a better way?
You can create MultiIndex in both DataFrames and then in custom function replace NaN by another DataFrame in GroupBy.transform with numpy.random.choice:
source = pd.DataFrame({'age':5*[21],
'location':[0,0,1,1,1],
'x':[1,2,3,4,4]})
target = pd.DataFrame({'age':5*[21],
'location':[0,0,0,1,2],
'x':5*[np.nan]})
cols = ['age', 'location']
source1 = source.set_index(cols)['x']
target1 = target.set_index(cols)['x']
def f(x):
try:
a = source1.loc[x.name].to_numpy()
m = x.isna()
x[m] = np.random.choice(a, size=m.sum())
return x
except KeyError:
return np.nan
target1 = target1.groupby(level=[0,1]).transform(f).reset_index()
print (target1)
age location x
0 21 0 1.0
1 21 0 2.0
2 21 0 2.0
3 21 1 3.0
4 21 2 NaN
You can create a common grouper and perform a merge:
cols = ['age', 'location']
(target[cols]
.assign(group=target.groupby(cols).cumcount()) # compute subgroup for duplicates
.merge((# below: assigns a random row group
source.assign(group=source.sample(frac=1).groupby(cols, sort=False).cumcount())
.groupby(cols+['group'], as_index=False) # get one row per group
.first()
),
on=cols+['group'], how='left') # merge
#drop('group', axis=1) # column kept for clarity, uncomment to remove
)
output:
age location group x
0 20 0 0 0.339955
1 20 0 1 0.700506
2 21 0 0 0.777635
3 22 1 0 NaN
I have this dataframe :
id start end
1 1 2
1 13 27
1 30 35
1 36 40
2 2 5
2 8 10
2 25 30
I want to groupby over id and aggregate rows where difference of end of n-1 row and start of n row is less than 10 for example. I already find a way using a loop but it's far too long with over a million rows.
So the expected outcome would be :
id start end
1 1 2
1 13 40
2 2 10
2 25 30
First I can get the required difference by using df['diff']=df['start'].shift(-1)-df['end']. How can I gather ids based on the condition for each different id ?
Thanks !
I believe you can create groups by suntract shifted end by DataFrameGroupBy.shift with greater like 10 and cumulative sum and pass to GroupBy.agg:
g = df['start'].sub(df.groupby('id')['end'].shift()).gt(10).cumsum()
df = (df.groupby(['id',g])
.agg({'start':'first', 'end': 'last'})
.reset_index(level=1, drop=True)
.reset_index())
print (df)
id start end
0 1 1 2
1 1 13 40
2 2 2 10
3 2 25 30
I am joining two tables left_table and right_table on non-unique keys that results in row explosion. I then want to aggregate rows to match the number of rows in left_table. To do this I aggregate over left_table columns.
Weirdly, when I save the table the columns in left_table double. It seems like columns of left_table become an index for resulting dataframe...
Left table
k1 k2 s v c target
0 1 3 20 40 2 2
1 1 2 10 20 1 1
2 1 2 10 80 2 1
Right table
k11 k22 s2 v2
0 1 2 0 100
1 2 3 30 200
2 1 2 10 300
Left join
k1 k2 s v c target s2 v2
0 1 3 20 40 2 2 NaN NaN
1 1 2 10 20 1 1 0.0 100.0
2 1 2 10 20 1 1 10.0 300.0
3 1 2 10 80 2 1 0.0 100.0
4 1 2 10 80 2 1 10.0 300.0
Aggregation code
dic = {}
keys_to_agg_over = left_table_col_names
for col in numeric_cols:
if col in all_cols:
dic[col] = 'median'
left_join = left_join.groupby(keys_to_agg_over).aggregate(dic)
After aggregation (doubled number of left table cols)
k1 k2 s v c target s2 v2
k1 k2 s v c target
1 2 10 20 1 1 1 2 10 20 1 1 5.0 200.0
80 2 1 1 2 10 80 2 1 5.0 200.0
3 20 40 2 2 1 3 20 40 2 2 NaN NaN
Saved to csv file
k1,k2,s,v,c,target,k1,k2,s,v,c,target,s2,v2
1,2,10,20,1,1,1,2,10,20,1,1,5.0,200.0
1,2,10,80,2,1,1,2,10,80,2,1,5.0,200.0
1,3,20,40,2,2,1,3,20,40,2,2,,
I tried resetting index, as left_join.reset_index() but I get
ValueError: cannot insert target, already exists
How to fix the issue of column-doubling?
You have a couple of options:
Store csv not including the index: I guess you are using the to_csv method to store the result in a csv. By default it includes you index columns in the generated csv. you can do to_csv(index=False) to avoid storing them.
reset_index dropping it: you can use left_join.reset_index(drop=True) in order to discard the index columns and not add them in the dataframe. By default reset_index adds the current index columns to the dataframe, generating the ValueError you obtain.
It seems like you are using:
left_join = left_table.merge(right_table, left_on = ["k1", "k2"], "right_on" = ["k11", "k22"] , how = "left")
This will result in a dataframe with repeated rows since indexes 1 and 2 from the left table both can be joined to indexes 0 and 2 of the right table. If that is the behavior you expected, and just want to get rid of duplicated rows you can try using:
left_join = left_join.drop_duplicates()
Before aggregating. This solution won't stop duplicating rows, it will rather eliminate them to not cause any trouble.
You can also pass the parameter as_index = False in the groupby function like this:
left_join = left_join.groupby(keys_to_agg_over, as_index = False).aggregate(dic)
To stop geting the "grouping columns" as indexes.
Say I have a dataframe df and group it by a few columns, dfg, with the median of one of its columns. How could I then take those median values, and expand them out so that those mean values are in a new column of the original df, and associated with the respective conditions? This will mean there are duplicates, but I will next be using this column for a subsequent calculation and having these in a column will make this possible.
Example data:
import pandas as pd
data = {'idx':[1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2],
'condition1':[1,1,2,2,3,3,4,4,1,1,2,2,3,3,4,4],
'condition2':[1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2],
'values':np.random.normal(0,1,16)}
df = pd.DataFrame(data)
dfg = df.groupby(['idx', 'condition2'], as_index=False)['values'].median()
example of desired result (note duplicates corresponding to correct conditions):
idx condition1 condition2 values medians
0 1 1 1 0.35031 0.656355
1 1 1 2 -0.291736 -0.024304
2 1 2 1 1.593545 0.656355
3 1 2 2 -1.275154 -0.024304
4 1 3 1 0.075259 0.656355
5 1 3 2 1.054481 -0.024304
6 1 4 1 0.9624 0.656355
7 1 4 2 0.243128 -0.024304
8 2 1 1 1.717391 1.155406
9 2 1 2 0.788847 1.006583
10 2 2 1 1.145891 1.155406
11 2 2 2 -0.492063 1.006583
12 2 3 1 -0.157029 1.155406
13 2 3 2 1.224319 1.006583
14 2 4 1 1.164921 1.155406
15 2 4 2 2.042239 1.006583
I believe you need GroupBy.transform with median for new column:
df['medians'] = df.groupby(['idx', 'condition2'])['values'].transform('median')
I have a dataframe consisting of a few columns of custom calculations for a trading strategy. I want to add a new column called 'Signals' to this dataframe, consisting of 0s and 1s (long only strategy). The signals will be generated on the following code, each item in this code is a separate column in the dataframe:
if:
open_price > low_sigma.shift(1) and high_price > high_sigma.shift(1):
signal = 1
else:
signal = 0
From my understanding, if statements are not efficient for dataframes. In addition, I haven't been able to get this to output as desired. How do you recommend I generate the signal and add it to the dataframe?
You could assign df['Signals'] to the boolean condition itself, then use astype to convert the booleans to 0s and 1s:
df['Signals'] = (((df['open_price'] > df['low_sigma'].shift(1))
& (df['high_price'] > df['high_sigma'].shift(1)))
.astype('int'))
for example,
import pandas as pd
df = pd.DataFrame({
'open_price': [1,2,3,4],
'low_sigma': [1,3,2,4],
'high_price': [10,20,30,40],
'high_sigma': [10,40,20,30]})
# high_price high_sigma low_sigma open_price
# 0 10 10 1 1
# 1 20 40 3 2
# 2 30 20 2 3
# 3 40 30 4 4
mask = ((df['open_price'] > df['low_sigma'].shift(1))
& (df['high_price'] > df['high_sigma'].shift(1)))
# 0 False
# 1 True
# 2 False
# 3 True
# dtype: bool
df['Signals'] = mask.astype('int')
print(df)
yields
high_price high_sigma low_sigma open_price Signals
0 10 10 1 1 0
1 20 40 3 2 1
2 30 20 2 3 0
3 40 30 4 4 1