Summarize a column in pandas data frame based on other columns - python

I have a small data frame tbl:
CatAreaSqKm CatMean CatPctFull CatCount CatSum COMID
1861888 0.2439 0.0000 0.000000 0 0.000000
1862004 0.4050 27.9765 18.222222 82 2294.072964
1862014 0.0720 27.9765 28.750000 23 643.459490
UpCatAreaSqKm UpCatMean UpCatPctFull UpCatCount UpCatSum
COMID
1861888 105360.5349 29.177349 97.901832 114610993 3.344045e+09
1862004 105445.4517 29.174944 97.902537 114704191 3.346488e+09
1862014 105360.2127 29.177349 97.902093 114610948 3.344044e+09
I want to do the following operation:
tbl['WsMean'] = ((tbl.CatSum + tbl.UpCatSum)/(tbl.CatCount + tbl.UpCatCount))
However, if I get a zero for CatCount + UpCatCount I will be dividing by zero, so for that particular row I want to set 'WsMean' to zero but for the others I would like it to be computed for the value calculated by the statement above. How can I do this? I can only think of a statement like:
tbl['WsMean'] = 0
but that would blanket all records in the table with 0.
Any ideas? Thanks

Dividing by zero results in a NaN value. You could use fillna(0) to replace the NaNs with zeros:
tbl['WsMean'] = ((tbl.CatSum + tbl.UpCatSum)/(tbl.CatCount + tbl.UpCatCount)).fillna(0)

Related

Iterate over rows of a dataframe based on index in python

I am trying to build a loop that iterate over each rows of several Dataframes in order to create two new columns. The original dataframes contain two columns (time, velocity), which can vary in length and stored in nested dictionaries. Here an exemple of one of them :
time velocity
0 0.000000 0.136731
1 0.020373 0.244889
2 0.040598 0.386443
3 0.060668 0.571861
4 0.080850 0.777680
5 0.101137 1.007287
6 0.121206 1.207533
7 0.141284 1.402833
8 0.161388 1.595385
9 0.181562 1.762003
10 0.201640 1.857233
11 0.221788 2.006104
12 0.241866 2.172649
The two new columns should de a normalization of the 'time' and 'velocity' column, respectively. Each rows of the new columns should therefore be equal to the following transformation :
t_norm = (time(n) - time(n-1)) / (time(max) - time(min))
vel_norm = (velocity(n) - velocity(n-1)) / (velocity(max) - velocity(min))
Also, the first value of the two new column should be set to 0.
My problem is that I don't know how to properly indicate to python how to access to n and n-1 values to realize such operations, and I don't know if that could be done using pd.DataFrame.iterrows() or the .iloc function.
I have come with the following piece of code, but it miss the crucial parts :
for nested_dict in dict_all_raw.values():
for dflist in nested_dict.values():
dflist['t_norm'] = ? / (dflist['time'].max() - dflist['time'].min())
dflist['vel_norm'] = ? / (dflist['velocity'].max() - dflist['velocity'].min())
dflist['acc_norm'] = dflist['vel_norm'] / dflist['t_norm']
Any help is welcome..! :)
If you just want to normalise, you can write the expression directly, using Series.min and Series.max:
m = df['time'].min()
df['normtime'] = (df['time'] - m) / (df['time'].max() - m)
However, if you want the difference between successive elements, you can use Series.diff:
df['difftime'] = df['time'].diff() / (df['time'].max() - df['time'].min())
Testing:
df = pd.DataFrame({'time': [0.000000, 0.020373, 0.040598], 'velocity': [0.136731, 0.244889, 0.386443]})
print(df)
# time velocity
# 0 0.000000 0.136731
# 1 0.020373 0.244889
# 2 0.040598 0.386443
m = df['time'].min()
df['normtime'] = (df['time'] - m) / (df['time'].max() - m)
df['difftime'] = df['time'].diff() / (df['time'].max() - df['time'].min())
print(df)
# time velocity normtime difftime
# 0 0.000000 0.136731 0.000000 NaN
# 1 0.020373 0.244889 0.501823 0.501823
# 2 0.040598 0.386443 1.000000 0.498177
You can use shift (see the doc here) to create lagged columns
df['time_n-1']=df['time'].shift(1)
Also, the first value of the two new column should be set to 0.
Use df['column']=df['column'].fillna(0) after your calculations

Calculate weighted sum using two columns in pandas dataframe

I am trying to calculate weighted sum using two columns in a python dataframe.
Dataframe structure:
unique_id weight value
1 0.061042375 20.16094523
1 0.3064548 19.50932003
1 0.008310739 18.76469039
1 0.624192086 21.25
2 0.061042375 20.23776924
2 0.3064548 19.63366165
2 0.008310739 18.76299395
2 0.624192086 21.25
.......
Output I desired is:
Weighted sum for each unique_id = sum((weight) * (value))
Example: Weighted sum for unique_id 1 = ( (0.061042375 * 20.16094523) + (0.3064548 * 19.50932003) + (0.008310739 * 18.76469039) + (0.624192086 * 21.25) )
I checked out this answer (Calculate weighted average using a pandas/dataframe) but could not figure out the correct way of applying it to my specific scenario.
This is what I am doing based on the above answer:
#Assume temp_weighted_sum_dataframe is the dataframe stated above
grouped_data = temp_weighted_sum_dataframe.groupby('unique_id') #I think this groups data based on unique_id values
weighted_sum_output = (grouped_data.weight * grouped_data.value).transform("sum") #This should allow me to multiple weight and value for every record within each group and sum it up to one value for that group.
# On above line I am getting the error > TypeError: unsupported operand type(s) for *: 'SeriesGroupBy' and 'SeriesGroupBy'
Any help is appreciated, thanks
The accepted answer in the linked question would indeed solve your problem. However, I would solve it differently with just one groupby:
u = (df.assign(s=df['weight']*df['value'])
.groupby('unique_id')
[['s', 'weight']]
.sum()
)
u['s']/u['weight']
Output:
unique_id
1 20.629427
2 20.672208
dtype: float64
you could do it this way:
df['partial_sum'] = df['weight']*df['value']
out = df.groupby('unique_id')['partial_sum'].agg('sum')
output:
unique_id
1 20.629427
2 20.672208
or..
df['weight'].mul(df['value']).groupby(df['unique_id']).sum()
same output
You may take advantage agg by using agg with # (it is dot)
df.groupby('unique_id')[['weight']].agg(lambda x: x.weight # x.value)
Out[24]:
weight
unique_id
1 20.629427
2 20.672208

combine pandas apply results as multiple columns in a single dataframe

Summary
Suppose that you apply a function to a groupby object, so that every g.apply for every g in the df.groupby(...) gives you a series/dataframe. How do I combine these results into a single dataframe, but with the group names as columns?
Details
I have a dataframe event_df that looks like this:
index event note time
0 on C 0.5
1 on D 0.75
2 off C 1.0
...
I want to create a sampling of the event for every note, and the sampling is done at times as given by t_df:
index t
0 0
1 0.5
2 1.0
...
So that I'd get something like this.
t C D
0 off off
0.5 on off
1.0 off on
...
What I've done so far:
def get_t_note_series(notedata_row, t_arr):
"""Return the time index in the sampling that corresponds to the event."""
t_idx = np.argwhere(t_arr >= notedata_row['time']).flatten()[0]
return t_idx
def get_t_for_gb(group, **kwargs):
t_idxs = group.apply(get_t_note_series, args=(t_arr,), axis=1)
t_idxs.rename('t_arr_idx', inplace=True)
group_with_t = pd.concat([group, t_idxs], axis=1).set_index('t_arr_idx')
print(group_with_t)
return group_with_t
t_arr = np.arange(0,10,0.5)
t_df = pd.DataFrame({'t': t_arr}).rename_axis('t_arr_idx')
gb = event_df.groupby('note')
gb.apply(get_t_for_gb, **kwargs)
So what I get is a number of dataframes for each note, all of the same size (same as t_df):
t event
0 on
0.5 off
...
t event
0 off
0.5 on
...
How do I go from here to my desired dataframe, with each group corresponding to a column in a new dataframe, and the index being t?
EDIT: sorry, I didn't take into account below, that you rescale your time column and can't present a whole solution now because I have to leave, but I think, you could do the rescaling by using pandas.merge_asof with your two dataframes to get the nearest "rescaled" time and from the merged dataframe you could try the code below. I hope this is, what you wanted.
import pandas as pd
import io
sio= io.StringIO("""index event note time
0 on C 0.5
1 on D 0.75
2 off C 1.0""")
df= pd.read_csv(sio, sep='\s+', index_col=0)
df.groupby(['time', 'note']).agg({'event': 'first'}).unstack(-1).fillna('off')
Take the first row in each time-note group by agg({'event': 'first'}), then use the note-index column and transpose it, so the note values become columns. Then at the end fill all cells, for which no datapoints could be found with 'off' by fillna.
This outputs:
Out[28]:
event
note C D
time
0.50 on off
0.75 off on
1.00 off off
You might also want to try min or max in case on/off is not unambiguous for a combination of time/note (if there are more rows for the same time/note where some have on and some have off) and you prefer one of these values (say if there is one on, then no matter how many offs are there, you want an on etc.). If you want something like a mayority-vote, I would suggest to add a mayority vote column in the aggregated dataframe (before the unstack()).
Oh so I found it! All I had to do was to unstack the groupby results. Going back to generating the groupby result:
def get_t_note_series(notedata_row, t_arr):
"""Return the time index in the sampling that corresponds to the event."""
t_idx = np.argwhere(t_arr >= notedata_row['time']).flatten()[0]
return t_idx
def get_t_for_gb(group, **kwargs):
t_idxs = group.apply(get_t_note_series, args=(t_arr,), axis=1)
t_idxs.rename('t_arr_idx', inplace=True)
group_with_t = pd.concat([group, t_idxs], axis=1).set_index('t_arr_idx')
## print(group_with_t) ## unnecessary!
return group_with_t
t_arr = np.arange(0,10,0.5)
t_df = pd.DataFrame({'t': t_arr}).rename_axis('t_arr_idx')
gb = event_df.groupby('note')
result = gb.apply(get_t_for_gb, **kwargs)
At this point, result is a dataframe with note as an index:
>> print(result)
event
note t
C 0 off
0.5 on
1.0 off
....
D 0 off
0.5 off
1.0 on
....
Doing result = result.unstack('note') does the trick:
>> result = result.unstack('note')
>> print(result)
event
note C D
t
0 off off
0.5 on on
1.0 off off
....
D 0 off
0.5 off
1.0 on
....

Python: cohorts analysis about calculating ARPU

I have a dataframe with one column:revenue_sum
revenue_sum
10000.0
12324.0
15534.0
26435.0
45623.0
56736.0
56353.0
And I want to write a function that creates all new columns at once that shows the sum of revenues.
For example, first row in the 'revenue_1'should show the sum of first two float in revenue_sum;
second row in the 'revenue_1'should show the sum of 2nd and 3rd float in revenue_sum.
First row in the 'revenue_2' should show the sum of first 3 float in revenue_sum
revenue_sum revenue_1 revenue_2
10000.0 22324.0 47858.0
12324.0 27858.0 54293.0
15534.0 41969.0 87592.0
26435.0 72058.0 128794.0
45623.0 102359.0 158712.0
56736.0 113089.0 NaN
56353.0 NaN NaN
Here is my code:
'''python
df_revenue_sum1 = df_revenue_sum1.iloc[::-1]
len_sum1 = len(df_revenue_sum1)+1
def func(df_revenue_sum1):
for i in range(1,len_sum1):
df_revenue_sum1['revenue_'+'i']=
df_revenue_sum1['revenue_sum'].rolling(i+1).sum()
return df_revenue_sum1
df_revenue_sum1 = df_revenue_sum1.applymap(func)
'''
And it shows the error:
"'float' object is not subscriptable", 'occurred at index revenue_sum'
I think there might be an easier way to do this without a for loop. The pandas function rolling (http://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.rolling.html) might do what you need. It sums along a sliding window specified by the min_periods and window parameters. Min periods means how many values it should sum at least. Window means it will sum at most that many values. Applying this works as follows:
import pandas as pd
# The dataframe provided
d = {
'revenue_sum': [
10000.0,
12324.0,
15534.0,
26435.0,
45623.0,
56736.0,
56353.0
]
}
# Reverse the dataframe because rolling only looks backwards and
# we want to make a rolling window forward
d1 = pd.DataFrame(data=d)
df = d1[::-1]
# apply rolling summing 2 at a time
df['revenue_1'] = df['revenue_sum'].rolling(min_periods=2, window=2).sum()
# apply rolling window 3 at a time
df['revenue_2'] = df['revenue_sum'].rolling(min_periods=3, window=3).sum()
print(df[::-1])
This gave me the following dataframe:
revenue_sum revenue_1 revenue_2
0 10000.0 22324.0 37858.0
1 12324.0 27858.0 54293.0
2 15534.0 41969.0 87592.0
3 26435.0 72058.0 128794.0
4 45623.0 102359.0 158712.0
5 56736.0 113089.0 NaN
6 56353.0 NaN NaN

Pandas dataframe values not changing

I am currently getting my toes wet with neural nets, using colaboratory, pandas and keras. To set up my data, I need to normalize all the data (for which I am getting all values between 0 and 1 by dividing by the largest value). However, I've run into 2 issues.
For some reason, the column "stroke_count" isn't being modified, or if it is it's being round down to 0 no matter what.
I saw that
df.fillna(7)
supposedly replaces all Null or NaN values with the value inside the parenthesis, but it isn't doing that.
# generating character dictionary & normalizing data
hanzi_dict = {}
hanzi_counter = 0
df.fillna(7)
for index, row in df.iterrows():
hanzi_dict[str(hanzi_counter)] = row['charcter']
hanzi_counter = hanzi_counter + 1
df.at[index, 'radical_code'] = row['radical_code'] / 214.9 # max value of any radical
df.at[index, 'stroke_count'] = row['stroke_count'] / 35.0 # max # of strokes
df.at[index, 'hsk_levl'] = row['hsk_levl'] / 7 # max level + 1
print(hanzi_dict)
display(df)

Categories

Resources