Cumulative product by group without groups' last row in pandas - python

I have a simple dataframe as the following:
n_obs = 3
dd = pd.DataFrame({
'WTL_exploded': [0, 1, 2]*n_obs,
'hazard': [0.3, 0.4, 0.5, 0.2, 0.8, 0.9, 0.6,0.6,0.65],
}, index=[1,1,1,2,2,2,3,3,3])
dd
I want to group by the index and get the cumulative product of the hazard column. However, I want to multiply all but the last element of each group.
Desired output:
index
hazard
1
0.3
1
0.12
2
0.2
2
0.16
3
0.6
3
0.36
How can I do that?

You can use:
out = dd.groupby(level=0, group_keys=False).apply(lambda x: x.cumprod().iloc[:-1])
Or:
out = dd.groupby(level=0).apply(lambda x: x.cumprod().iloc[:-1]).droplevel(1)
output:
WTL_exploded hazard
1 0 0.30
1 0 0.12
2 0 0.20
2 0 0.16
3 0 0.60
3 0 0.36
NB. you can also use lambda x: x.cumprod().head(-1).

The solution I found is a bit intricate but works for the test case.
First, get rid of the last row of each group:
ff = dd.groupby(lambda x:x, as_index=False).apply(lambda x: x.iloc[:-1])
ff
Then, restore the original index, group-by again and use pandas cumprod:
ff.reset_index().set_index('level_1').groupby(lambda x:x).cumprod()
Is there a more direct way?

Related

Filling dataframe with average of previous columns values

I have a dataframe with having 5 columns with having missing values.
How do i fill the missing values with taking the average of previous two column values.
Here is the sample code for the same.
coh0 = [0.5, 0.3, 0.1, 0.2,0.2]
coh1 = [0.4,0.3,0.6,0.5]
coh2 = [0.2,0.2,0.3]
coh3 = [0.8,0.8]
coh4 = [0.5]
df= pd.DataFrame({'coh0': pd.Series(coh0), 'coh1': pd.Series(coh1),'coh2': pd.Series(coh2), 'coh3': pd.Series(coh3),'coh4': pd.Series(coh4)})
df
Here is the sample output
coh0coh1coh2coh3coh4
0 0.5 0.4 0.2 0.8 0.5
1 0.3 0.3 0.2 0.8 NaN
2 0.1 0.6 0.3 NaN NaN
3 0.2 0.5 NaN NaN NaN
4 0.2 NaN NaN NaN NaN
Here is the desired result i am looking for.
The NaN value in each column should be replaced by the previous two columns average value at the same position. However for the first NaN value in second column, it will take the default last value of first column.
The sample desired output would be like below.
For the exception you named, the first NaN, you can do
df.iloc[1, -1] = df.iloc[0, -1]
though it doesn't make a difference in this case as the mean of .2 and .8 is .5, anyway.
Either way, the rest is something like a rolling window calculation, except it has to be computed incrementally. Normally, you want to vectorize your operations and avoid iterating over the dataframe, but IMHO this is one of the rarer cases where it's actually appropriate to loop over the columns (cf. this excellent post), i.e.,
compute the row-wise (axis=1) mean of up to two columns left of the current one (df.iloc[:, max(0, i-2):i]),
and fill its NaN values from the resulting series.
for i in range(1, df.shape[1]):
mean_df = df.iloc[:, max(0, i-2):i].mean(axis=1)
df.iloc[:, i] = df.iloc[:, i].fillna(mean_df)
which results in
coh0 coh1 coh2 coh3 coh4
0 0.5 0.4 0.20 0.800 0.5000
1 0.3 0.3 0.20 0.800 0.5000
2 0.1 0.6 0.30 0.450 0.3750
3 0.2 0.5 0.35 0.425 0.3875
4 0.2 0.2 0.20 0.200 0.2000

Get mean of numpy array using pandas groupby

I have a DataFrame where one column is a numpy array of numbers. For example,
import numpy as np
import pandas as pd
df = pd.DataFrame.from_dict({
'id': [1, 1, 2, 2, 3, 3, 3, 4, 4],
'data': [np.array([0.43, 0.32, 0.19]),
np.array([0.41, 0.11, 0.21]),
np.array([0.94, 0.35, 0.14]),
np.array([0.78, 0.92, 0.45]),
np.array([0.32, 0.63, 0.48]),
np.array([0.17, 0.12, 0.15]),
np.array([0.54, 0.12, 0.16]),
np.array([0.48, 0.16, 0.19]),
np.array([0.14, 0.47, 0.01])]
})
I want to groupby the id column and aggregate by taking the element-wise average of the array. Splitting the array up first is not feasible since it is length 300 and I have 200,000+ rows. When I do df.groupby('id').mean(), I get the error "No numeric types to aggregate". I am able to get an element-wise mean of the lists using df['data'].mean(), so I think there should be a way to do a grouped mean. To clarify, I want the output to be an array for each value of ID. Each element in the resulting array should be the mean of the values of the elements in the corresponding position within each group. In the example, the result should be:
pd.DataFrame.from_dict({
'id': [1, 2,3,4],
'data': [np.array([0.42, 0.215, 0.2]),
np.array([0.86, 0.635, 0.29500000000000004]),
np.array([0.3433333333333333, 0.29, 0.26333333333333336]),
np.array([0.31, 0.315, 0.1])]
})
Could someone suggest how I might do this? Thanks!
Mean it twice, one at array level and once at group level:
df['data'].map(np.mean).groupby(df['id']).mean().reset_index()
id data
0 1 0.278333
1 2 0.596667
2 3 0.298889
3 4 0.241667
Based on comment, you can do:
pd.DataFrame(df['data'].tolist(),index=df['id']).mean(level=0).agg(np.array,1)
id
1 [0.42, 0.215, 0.2]
2 [0.86, 0.635, 0.29500000000000004]
3 [0.3433333333333333, 0.29, 0.26333333333333336]
4 [0.31, 0.315, 0.1]
dtype: object
Or:
df.groupby("id")['data'].apply(np.mean)
First, splitting up the array is feasible because your current storage requires storing a complex object of all the values within a DataFrame. This is going to take a lot more space than simply storing the flat 2D array
# Your current memory usage
df.memory_usage(deep=True).sum()
1352
# Create a new DataFrame (really just overwrite `df` but keep separate for illustration)
df1 = pd.concat([df['id'], pd.DataFrame(df['data'].tolist())], 1)
# id 0 1 2
#0 1 0.43 0.32 0.19
#1 1 0.41 0.11 0.21
#2 2 0.94 0.35 0.14
#3 2 0.78 0.92 0.45
#4 3 0.32 0.63 0.48
#5 3 0.17 0.12 0.15
#6 3 0.54 0.12 0.16
#7 4 0.48 0.16 0.19
#8 4 0.14 0.47 0.01
Yes, this looks bigger, but it's not in terms of memory, it's actually smaller. The 3x factor here is a bit extreme, for larger DataFrames with long arrays it will probably be like 95% of the memory. Still it has to be less.
df1.memory_usage(deep=True).sum()
#416
And now your aggregation is a normal groupby + mean, columns give the location in the array
df1.groupby('id').mean()
# 0 1 2
#id
#1 0.420000 0.215 0.200000
#2 0.860000 0.635 0.295000
#3 0.343333 0.290 0.263333
#4 0.310000 0.315 0.100000
Group by mean for array where output is array of mean value
df['data'].map(np.array).groupby(df['id']).mean().reset_index()
Output:
id data
0 1 [0.42, 0.215, 0.2]
1 2 [0.86, 0.635, 0.29500000000000004]
2 3 [0.3433333333333333, 0.29, 0.26333333333333336]
3 4 [0.31, 0.315, 0.1]
You can always .apply the numpy mean.
df.groupby('id')['data'].apply(np.mean).apply(np.mean)
# returns:
id
1 0.278333
2 0.596667
3 0.298889
4 0.241667
Name: data, dtype: float64

Increase value of several rows based on condition fulfilling all rows

I have a pandas dataframe with three columns and want to multiply/increase the float numbers of each row by the same amount until the sum of all three cells (one row) fulfils the critera (value equal or greater than 0.9)
df = pd.DataFrame({'A':[0.03, 0.0, 0.4],
'B': [0.1234, 0.4, 0.333],
'C': [0.5, 0.4, 0.0333]})
Outcome:
The different cells in each row were multiplied so that the sum of all three cells of each row is 0.9 (The sum of each row is not exactly 0.9 as I tried to come close with simple multiplication, hence the actual outcome would get to 0.9). It is important that the cells which are 0 would stay 0.
print (df)
A B C
0 0.0414 0.170292 0.690000
1 0.0000 0.452000 0.452000
2 0.4720 0.392940 0.039294
You can take sum on axis=1 and subtract with 0.9 ,then divide with df.shape[1] to add it back:
df.add((0.9-df.sum(axis=1))/df.shape[1],axis=0)
A B C
0 0.112200 0.205600 0.582200
1 0.033333 0.433333 0.433333
2 0.444567 0.377567 0.077867
You want to apply a scaling function along the rows:
def scale(xs, target=0.9):
"""Scale the features such that their sum equals the target."""
xs_sum = xs.sum()
if xs_sum < target:
return xs * (target / xs_sum)
else:
return xs
df.apply(scale), axis=1)
For example:
df = pd.DataFrame({'A':[0.03, 0.0, 0.4],
'B': [0.1234, 0.4, 0.333],
'C': [0.5, 0.4, 0.0333]})
df.apply(scale, axis=1)
Should give:
A B C
0 0.041322 0.169972 0.688705
1 0.000000 0.450000 0.450000
2 0.469790 0.391100 0.039110
The rows of that dataframe all sum to 0.9:
df.apply(scale), axis=1).sum(axis=1)
0 0.9
1 0.9
2 0.9
dtype: float64

Pandas DataFrame Conditional Groupby

I have this DF:
df = pd.DataFrame(data=[[-2.000000, -1.958010, 0.2],
[-1.958010, -1.916030, 0.4],
[-1.916030, -1.874040, 0.3],
[-1.874040, -1.832050, 0.6],
[-1.832050, -1.790070, 0.8],
[-1.790070, -1.748080, 0.2]],columns=['egystart','egyend','fx'])
So I want to groupby every two rows and get fx as the mean value of the two rows. egystart should by egystart of the first row and egyend should by egyend of the second row.
In this case I should obtain:
-2.000000 -1.916030 0.3
-1.916030 -1.832050 0.45
-1.832050 -1.748080 0.5
So I have tried something like this:
df.groupby((df.egystart == df.egyend.shift(1)).cumsum()).agg({'egystart':min, 'egyend':max, 'fx':HERE_THE_MEAN_VALUE})
But it doesnt work
You could try this to get the mean of fx every 2 rows:
result = df.groupby(np.arange(len(df))//2).mean()
print(result)
egystart egyend fx
0 -1.979005 -1.937020 0.30
1 -1.895035 -1.853045 0.45
2 -1.811060 -1.769075 0.50

Convert whole Pandas dataframe containing NaN values from string to float

I would like to convert all the values in a pandas dataframe from strings to floats. My dataframe contains various NaN values (e.g. NaN, NA, None). For example,
import pandas as pd
import numpy as np
my_data = np.array([[0.5, 0.2, 0.1], ["NA", 0.45, 0.2], [0.9, 0.02, "N/A"]])
df = pd.DataFrame(my_data, dtype=str)
I have found here and here (among other places) that convert_objects might be the way to go. However, I get a message that it is deprecated (I am using Pandas 0.17.1) and should instead use to_numeric.
df2 = df.convert_objects(convert_numeric=True)
Output:
FutureWarning: convert_objects is deprecated. Use the data-type specific converters pd.to_datetime, pd.to_timedelta and pd.to_numeric.
But to_numeric doesn't seem to actually convert the strings.
df3 = pd.to_numeric(df, errors='force')
Output:
df2:
0 1 2
0 0.5 0.20 0.1
1 NaN 0.45 0.2
2 0.9 0.02 NaN
df2 dtypes:
0 float64
1 float64
2 float64
dtype: object
df3:
0 1 2
0 0.5 0.2 0.1
1 NA 0.45 0.2
2 0.9 0.02 N/A
df3 dtypes:
0 object
1 object
2 object
dtype: object
Should I use convert_objects and deal with the warning message, or is there a proper way to do what I want with to_numeric?
Strangely this works:
In [11]:
df.apply(lambda x: pd.to_numeric(x, errors='force'))
Out[11]:
0 1 2
0 0.5 0.20 0.1
1 NaN 0.45 0.2
2 0.9 0.02 NaN
It seems that it's not able to coerce the entire df for some reason which is a little surprising
If you hate typing (thanks to #Zero) then you can just use:
df.apply(pd.to_numeric, errors='force')
You can try replace and astype:
import pandas as pd
import numpy as np
my_data = np.array([[0.5, 0.2, 0.1], ["NA", 0.45, 0.2], [0.9, 0.02, "N/A"]])
df = pd.DataFrame(my_data, dtype=str)
print df.replace({r'N': np.nan}, regex=True).astype(float)
0 1 2
0 0.5 0.20 0.1
1 NaN 0.45 0.2
2 0.9 0.02 NaN

Categories

Resources