Aggregating and custom function pandas - python

I have a dataframe like the following:
Label
Indicator
Value1
Value2
A
77
50
50
A
776
60
70
A
771
70
40
A
7
80
50
A
7775
90
40
B
776
100
40
B
771
41
50
B
775
54
40
B
7775
55
50
What I want is an output like that:
Label
aggregation1
aggregation2
A
aggregation1_A_value
aggregation2_A_value
B
aggregation1_B_value
aggregation2_B_value
Knowing that the way I want to aggregate value is the following (example):
aggregation1 = value1 of indicator starting with 77 (but not 776) - value2 of indicator 776 and 775.
What I am doing now is the following: I split the Indicator into several columns, to have a new data frame:
Label
Indicator0
Indicator1
Indicator2
...
A
7
77
77
...
A
7
77
776
...
A
7
77
771
...
...
...
...
...
...
B
7
77
777
...
aggregation1_A = df.query("Label=='A' and Indicator1 is in ["77"] and Indicator2 is not in ["776"]")["value1"].sum()
aggregation1_A -= df.query("Label=='A' and Indicator2 is in ["776","775"]")["value2"].sum()
My issue is that I have more than 70 000 differents labels, and about 20 aggregations to run.
Dataframe is 500MB large.
I am wondering if there is any better way. I had a look with pandas UDF and apply a custom aggregation function but I didn't succeed so far.
Thank you for your help

You can use get_dummies to replace the step where you split your indicator separate columns. Then you can use those bool values to carry out your aggregations:
dummies = pd.get_dummies(df, columns=['Indicator'])
def agg_1(df):
ret = df.apply(lambda x: x['Value1']*x[['Indicator_77','Indicator_771', 'Indicator_7775']], axis=1).sum().sum()
ret -= df.apply(lambda x: x['Value2']*x[['Indicator_775', 'Indicator_776']], axis=1).sum().sum()
return ret
dummies.groupby('Label').apply([agg_1])
The lambda functions are just multiplying the values by whether or not the relevant indicators are in that row. The sum().sum() flattens the result of that multiplication into a scalar.
You can put all your aggregation functions in the list with agg_1.

Related

Numpy Vectorized Window Operations

I'm interested in figuring out how to do vectorized computations in a numpy array / pandas dataframe where each new cell is updated with local information.
For example, lets say I'm a weatherman interested in making predictions about the weather. My prediction algorithm will be the mean of the past 3 days. While this prediction is simple, I'd like to be able to do this with an arbitrary function.
Example data:
day temp
1 70
2 72
3 68
4 67
...
After a transformation should become
day temp prediction
1 70 None (no previous data)
2 72 70 (only one data point)
3 68 71 (two data points)
4 67 70
5 70 69
...
I'm only interested in the prediction column, so no need to make an attempt to join the data back together after achieving the prediction! Thanks!
Use rolling with a window of 3 and the min_periods of 1
df['prediction'] = df['temp'].rolling(window = 3, min_periods = 1).mean().shift()
df
day temp prediction
0 1 70 NaN
1 2 72 70
2 3 68 71
3 4 67 70
4 5 70 69

Find occurences where a column from one dataframe equals another, based on condition

I have the following two dataframes, which have different size: df1 (966 rows x 2 cols), df2 (36 rows, 2 cols), where
df1:
Video_# Selected Joint.1
484 1 Left_shoulder
778 1 Left_shoulder
418 1 Right_shoulder
964 1 Right_shoulder
193 1 Right_shoulder
... ... ... ... ...
285 36 Right_elbow
267 36 Left_hand
216 36 Shoulder_centre
139 36 Right_shoulder
df2:
Video_# Ann.1
0 1 Shoulder_center
1 2 Head
2 3 Right_hip
... ... ... ... ...
33 34 Left_knee
34 35 Right_knee
35 36 Right_shoulder
Where Video_# goes from 1-36. In df2 the Video_# column is just a single occurrence of each, so 1-36 just once. Whereas in df1 there are multiple occurrences of each 1-36, not the same number for each 1-36 (I hope that makes sense).
What I want to check is the number of occurrences df1['Selected Joint.1'] == df2['Ann.1'] based on Video_#. So the expected output is (eg.):
Video_# Equality Occurrences
1 3
2 5
... ... ...
36 6
Is that possible?
Use DataFrame.merge with GroupBy.size:
df = (df1.merge(df2,
left_on=['Video_#','Selected Joint.1'],
right_on=['Video_#','Ann.1'])
.groupby('Video_')
.size()
.reset_index(name='Equality Occurrences'))

DataFrame add a dataframe row that is sum each row's sum

I am new in Python and i have a question. I have an exported .csv with values and i want to sum each row's total value than make a total column to there.
I've tried that but it doesnt work.
import pandas as pd
wine = pd.read_csv('testelek.csv', 'rb', delimiter=';')
wine['Total'] = [wine[row].sum(axis=1) for row in wine]
I want to make my DataFrame like this.
101 102 103 104 .... Total
__________________________________________________________________________
0 80 84 86 78 .... 328
1 78 76 77 79 .... 310
2 79 81 88 83 .... 331
3 70 85 89 84 .... 328
4 78 84 88 85 .... 335
You can bypass the need for the list comprehension and just use the axis=1 parameter to get what you want.
wine['Total'] = wine.sum(axis=1)
A nice way to do this is by using .apply().
Suppose that you want to create a new column named Total by adding the values per row for columns named 101, 102, and 103 you can try the following:
wine['Total'] = wine.apply(lambda row: sum([row['101'], row['102'], row['103']]), axis=1)

I need help building new dataframe from old one, by applying method to each row, keeping same index and columns

I have a dataframe (df_input), and im trying to convert it to another dataframe (df_output), through applying a formula to each element in each row. The formula requires information about the the whole row (min, max, median).
df_input:
A B C D E F G H I J
2011-01-01 60 48 26 29 41 91 93 87 39 65
2011-01-02 88 52 24 99 1 27 12 26 64 87
2011-01-03 13 1 38 60 8 50 59 1 3 76
df_output:
F(A)F(B)F(C)F(D)F(E)F(F)F(G)F(H)F(I)F(J)
2011-01-01 93 54 45 52 8 94 65 37 2 53
2011-01-02 60 44 94 62 78 77 37 97 98 76
2011-01-03 53 58 16 63 60 9 31 44 79 35
Im trying to go from df_input to df_output, as above, after applying f(x) to each cell per row. The function foo is trying to map element x to f(x) by doing an OLS regression of the min, median and max of the row to some co-ordinates. This is done each period.
I'm aware that I iterate over the rows and then for each row apply the function to each element. Where i am struggling is getting the output of foo, into df_output.
for index, row in df_input.iterrows():
min=row.min()
max=row.max()
mean=row.mean()
#apply function to row
new_row = row.apply(lambda x: foo(x,min,max,mean)
#add this to df_output
help!
My current thinking is to build up the new df row by row? I'm trying to do that but im getting a lot of multiindex columns etc. Any pointers would be great.
thanks so much... merry xmas to you all.
Consider calculating row aggregates with DataFrame.* methods and then pass series values in a DataFrame.apply() across columns:
# ROW-WISE AGGREGATES
df['row_min'] = df.min(axis=1)
df['row_max'] = df.max(axis=1)
df['row_mean'] = df.mean(axis=1)
# COLUMN-WISE CALCULATION (DEFAULT axis=0)
new_df = df[list('ABCDEFGHIJ')].apply(lambda col: foo(col,
df['row_min'],
df['row_max'],
df['row_mean']))

Join dataframe with matrix output using pandas

I am trying to translate the input dataframe (inp_df) to output dataframe (out_df) using the the data from the cell based intermediate dataframe (matrix_df) as shown below.
There are several cell number based files with distance values shown in matrix_df .
The program iterates by cell & fetches data from appropriate file so each time matrix_df will have the data for all rows of the current cell# that we are iterating for in inp_df.
inp_df
A B cell
100 200 1
115 270 1
145 255 2
115 266 1
matrix_df (cell_1.csv)
B 100 115 199 avg_distance
200 7.5 80.7 67.8 52
270 6.8 53 92 50
266 58 84 31 57
matrix_df (cell_2.csv)
B 145 121 166 avg_distance
255 74.9 77.53 8 53.47
out_df dataframe
A B cell distance avg_distance
100 200 1 7.5 52
115 270 1 53 50
145 255 2 74.9 53.47
115 266 1 84 57
My current thought process for each cell# based data is
use a apply function to go row by row
then use a join based on column B in the inp_df with with matrix_df, where the matrix df is somehow translated into a tuple of column name, distance & average distance.
But I am looking for a pandonic way of doing this since my approach will slow down when there are millions of rows in the input. I am specifically looking for core logic inside an iteration to fetch the matches, since in each cell the number of columns in matrix_df would vary
If its any help the matrix files is the distance based outputs from sklearn.metrics.pairwise.pairwise_distances .
NB: In inp_df the value of column B is unique and values of column A may or may not be unique
Also the matrix_dfs first column was empty & i had renamed it with the following code for easiness in understanding since it was a header-less matrix output file.
dist_df = pd.read_csv(mypath,index_col=False)
dist_df.rename(columns={'Unnamed: 0':'B'}, inplace=True)​
Step 1: Concatenate your inputs with pd.concat and merge with inp_df using df.merge
In [641]: out_df = pd.concat([matrix_df1, matrix_df2]).merge(inp_df)
Step 2: Create the distance column with df.apply by using A's values to index into the correct column
In [642]: out_df.assign(distance=out_df.apply(lambda x: x[str(int(x['A']))], axis=1))\
[['A', 'B', 'cell', 'distance', 'avg_distance']]
Out[642]:
A B cell distance avg_distance
0 100 200 1 7.5 52.00
1 115 270 1 53.0 50.00
2 115 266 1 84.0 57.00
3 145 255 2 74.9 53.47

Categories

Resources