Subdivide rows into intervals of 50 and take mean - python

I want to subdivide my dataframe into intervals of 50, then perform the mean of these individual intervals, and then create a new dataframe.
2 3 4
0 20.229517 21.444166 19.528986
1 19.420929 21.029457 18.895041
2 19.214857 21.122784 19.228065
3 19.454653 21.148373 19.249720
4 20.152334 22.183264 20.149488
... ... ... ...
9995 20.673738 22.252024 21.587578
9996 21.948563 24.904633 23.962317
9997 24.318361 27.220770 25.322933
9998 24.570177 26.371695 23.503048
9999 23.274368 25.145500 22.028172
10000 rows × 3 columns
That is a 200 x 3, with the mean values. I would like to do this in pandas if possible. Thx in advance!

You can use integer division by 50 and pass to groupby with aggregate mean:
np.random.seed(2020)
df = pd.DataFrame(np.random.random(size=(10000, 3)))
print (df)
0 1 2
0 0.986277 0.873392 0.509746
1 0.271836 0.336919 0.216954
2 0.276477 0.343316 0.862159
3 0.156700 0.140887 0.757080
4 0.736325 0.355663 0.341093
... ... ...
9995 0.524481 0.454835 0.108934
9996 0.816516 0.354442 0.224834
9997 0.090518 0.887463 0.444833
9998 0.413673 0.315459 0.691306
9999 0.656559 0.113400 0.063397
[10000 rows x 3 columns]
df = df.groupby(df.index // 50).mean()
print (df)
0 1 2
0 0.537299 0.484187 0.512674
1 0.446181 0.503547 0.455493
2 0.504955 0.446041 0.464571
3 0.567661 0.494185 0.519785
4 0.485611 0.553636 0.396364
.. ... ... ...
195 0.516285 0.433178 0.526545
196 0.476906 0.474619 0.465957
197 0.497325 0.511659 0.490382
198 0.564468 0.453961 0.467758
199 0.520884 0.455529 0.479706
[200 rows x 3 columns]

Related

Find occurences where a column from one dataframe equals another, based on condition

I have the following two dataframes, which have different size: df1 (966 rows x 2 cols), df2 (36 rows, 2 cols), where
df1:
Video_# Selected Joint.1
484 1 Left_shoulder
778 1 Left_shoulder
418 1 Right_shoulder
964 1 Right_shoulder
193 1 Right_shoulder
... ... ... ... ...
285 36 Right_elbow
267 36 Left_hand
216 36 Shoulder_centre
139 36 Right_shoulder
df2:
Video_# Ann.1
0 1 Shoulder_center
1 2 Head
2 3 Right_hip
... ... ... ... ...
33 34 Left_knee
34 35 Right_knee
35 36 Right_shoulder
Where Video_# goes from 1-36. In df2 the Video_# column is just a single occurrence of each, so 1-36 just once. Whereas in df1 there are multiple occurrences of each 1-36, not the same number for each 1-36 (I hope that makes sense).
What I want to check is the number of occurrences df1['Selected Joint.1'] == df2['Ann.1'] based on Video_#. So the expected output is (eg.):
Video_# Equality Occurrences
1 3
2 5
... ... ...
36 6
Is that possible?
Use DataFrame.merge with GroupBy.size:
df = (df1.merge(df2,
left_on=['Video_#','Selected Joint.1'],
right_on=['Video_#','Ann.1'])
.groupby('Video_')
.size()
.reset_index(name='Equality Occurrences'))

Euclidean Distance over 2 dataframes

I have 2 Dataframes
DF1-
Name X Y
0 Astonished 0.430 0.890
1 Excited 0.700 0.720
2 Expectant 0.320 0.067
3 Passionate 0.333 0.127
[47 rows * 3 columns]
DF2-
Id X Y
0 1 -0.288453 0.076105
1 4 -0.563453 -0.498895
2 5 -0.788453 -0.673895
3 6 -0.063453 -0.373895
4 7 0.311547 0.376105
[767 rows * 3 columns]
Now what I want to achieve is -
Take the X,Y from first entry from DF2, iterate it over DF1, calculate Euclidean Distance between each value of X,Y in DF2.
Find the minimum of all the Euclidean Distance obtained between the two points, save the minimum result somewhere along with the corresponding entry under the name column.
Example-
Say for any tuple of X,Y in DF2, the minimum Euclidean distance is corresponding to the X,Y value in the row 0 of DF1, then the result should be, the distance and name Astonished.
My Attempt-
import pandas as pd
import numpy as np
import csv
mood = pd.read_csv("C:/Users/Desktop/DF1.csv")
song_value = pd.read_csv("C:/Users/Desktop/DF2.csv")
df_temp = mood.loc[:, ['Arousal','Valence']]
df_temp1 = song_value.loc[:, ['Arousal','Valence']]
import scipy
from scipy import spatial
ary = scipy.spatial.distance.cdist(mood.loc[:, ['Arousal','Valence']], song_value.loc[:, ['Arousal','Valence']], metric='euclidean')
print (ary)
Result Obtained -
[[1.08563344 1.70762362 1.98252253 ... 0.64569366 0.47426051 0.83656989]
[1.17967807 1.75556794 2.03922435 ... 0.59326275 0.2469077 0.79334076]
[0.60852124 1.04915517 1.33326431 ... 0.1848471 0.53293637 0.08394834]
...
[1.26151359 1.5500629 1.81168766 ... 0.74070027 0.70209658 0.75277205]
[0.69085994 1.03764923 1.31608627 ... 0.33265268 0.61928227 0.21397822]
[0.84484398 1.11428893 1.38222899 ... 0.48330291 0.69288125 0.3886008 ]]
I have no clue how I should proceed now.
Please suggest something.
EDIT - 1
I converted the array in another data frame using
new_series = pd.DataFrame(ary)
print (new_series)
Result -
0 1 2 ... 764 765 766
0 1.085633 1.707624 1.982523 ... 0.645694 0.474261 0.836570
1 1.179678 1.755568 2.039224 ... 0.593263 0.246908 0.793341
2 0.608521 1.049155 1.333264 ... 0.184847 0.532936 0.083948
3 0.623534 1.093331 1.378075 ... 0.124156 0.479393 0.109057
4 0.791926 1.352785 1.636748 ... 0.197403 0.245908 0.398619
5 0.740038 1.260768 1.545785 ... 0.092072 0.304926 0.281791
6 0.923284 1.523395 1.803676 ... 0.415540 0.293217 0.611312
7 1.202447 1.679660 1.962823 ... 0.554256 0.247391 0.703298
8 0.824898 1.343684 1.628727 ... 0.177560 0.222666 0.360980
9 1.191411 1.604942 1.883150 ... 0.570771 0.395957 0.668736
10 0.822236 1.456863 1.708469 ... 0.706252 0.787271 0.823542
11 0.741683 1.371996 1.618916 ... 0.704496 0.835235 0.798964
12 0.346244 0.967891 1.240839 ... 0.376504 0.715617 0.359700
13 0.526096 1.163209 1.421820 ... 0.520190 0.748265 0.579333
14 0.435992 0.890291 1.083229 ... 0.937048 1.254437 0.884499
15 0.600338 1.162469 1.375755 ... 0.876228 1.116301 0.891714
16 0.634254 1.059083 1.226407 ... 1.088393 1.373536 1.058550
17 0.712227 1.284502 1.498187 ... 0.917272 1.117806 0.956957
18 0.194387 0.799728 1.045745 ... 0.666713 1.013563 0.597524
19 0.456000 0.708741 0.865870 ... 1.068296 1.420654 0.973234
20 0.633776 0.632060 0.709202 ... 1.277083 1.645173 1.157765
21 0.192291 0.597749 0.826602 ... 0.831713 1.204117 0.716746
22 0.522033 0.526969 0.645998 ... 1.170316 1.546040 1.041762
23 0.668148 0.504480 0.547920 ... 1.316602 1.698041 1.176933
24 0.718440 0.285718 0.280984 ... 1.334008 1.727796 1.166364
25 0.759187 0.265412 0.217165 ... 1.362786 1.757580 1.190132
26 0.598326 0.113459 0.380513 ... 1.087573 1.479296 0.896239
27 0.676841 0.263613 0.474246 ... 1.074911 1.456515 0.875707
28 0.865641 0.365394 0.462742 ... 1.239941 1.612779 1.038790
29 0.463623 0.511737 0.786284 ... 0.719525 1.099122 0.519226
30 0.780386 0.550483 0.750532 ... 0.987863 1.336760 0.788449
31 1.077559 0.711697 0.814205 ... 1.274933 1.602953 1.079529
32 1.020408 0.497152 0.522999 ... 1.372444 1.736938 1.170889
33 0.963911 0.367018 0.336035 ... 1.398444 1.778496 1.198905
34 1.092763 0.759612 0.873457 ... 1.256086 1.574565 1.063570
35 0.903631 0.810449 1.018501 ... 0.921287 1.219046 0.740134
36 0.728728 0.795942 1.045868 ... 0.695317 1.009043 0.512147
37 0.738314 0.600405 0.822742 ... 0.895225 1.239125 0.697393
38 1.206901 1.151385 1.343654 ... 1.064721 1.273002 0.922962
39 1.248530 1.293525 1.508517 ... 0.988508 1.137608 0.880669
40 0.988777 1.205968 1.463036 ... 0.622495 0.776919 0.541414
41 0.941001 1.043940 1.285215 ... 0.732293 0.960420 0.595174
42 1.242508 1.321327 1.544222 ... 0.947970 1.080069 0.851396
43 1.262534 1.399453 1.633948 ... 0.900340 0.989603 0.830024
44 1.261514 1.550063 1.811688 ... 0.740700 0.702097 0.752772
45 0.690860 1.037649 1.316086 ... 0.332653 0.619282 0.213978
46 0.844844 1.114289 1.382229 ... 0.483303 0.692881 0.388601
[47 rows x 767 columns]
Moreover, is this the best approach? Sorry, but am not sure, that's why am putting this up.
Say df_1 and df_2 are your dataframes, first extract your pairs as shown below:
pairs_1 = list(tuple(zip(df_1.X, df_1.Y)))
pairs_2 = list(tuple(zip(df_2.X, df_2.Y)))
Then iterate over pairs as per your use case and get the index of minimum distance for the iterated points:
from scipy import spatial
min_distances = []
closest_pairs = []
names = []
for i in pairs_2:
min_dist = scipy.spatial.distance.cdist([i], pairs_1, metric='euclidean').min()
index_min = scipy.spatial.distance.cdist([i], pairs_1, metric='euclidean').argmin()
min_distances.append(min_dist)
closest_pairs.append(df_1.loc[index_min, ['X', 'Y']])
names.append(df_1.loc[index_min, 'Name'])
Insert results to df_2:
df_2['min_distance'] = min_distances
df_2['closest_pairs'] = [tuple(i.values) for i in closest_pairs]
df_2['name'] = names
df_2
Output:
Id X Y min_distance closest_pairs name
0 1 -0.288453 0.076105 0.608521 (0.32, 0.067) Expectant
1 4 -0.563453 -0.498895 1.049155 (0.32, 0.067) Expectant
2 5 -0.788453 -0.673895 1.333264 (0.32, 0.067) Expectant
3 6 -0.063453 -0.373895 0.584316 (0.32, 0.067) Expectant
4 7 0.311547 0.376105 0.250027 (0.33, 0.127) Passionate
I have added min_distance and closest_pairs as well, you can exclude these columns if you want to.

Why is pandas.join() not merging correctly along index?

I'm trying to merge two dataframes, with identical indices into a single dataframe, but i cant seem to get it working. I expect the repeated values due to the resample function. The final dataframe then seems to have sorted the indices in ascending order which is fine. But why is it now 2x as long?
Here is the code:
Original dataframe:
default student balance income
0 No No 729.526495 44361.625074
1 No Yes 817.180407 12106.134700
2 No No 1073.549164 31767.138947
3 No No 529.250605 35704.493935
4 No No 785.655883 38463.495879
... ... ... ... ...
9995 No No 711.555020 52992.378914
9996 No No 757.962918 19660.721768
9997 No No 845.411989 58636.156984
9998 No No 1569.009053 36669.112365
9999 No Yes 200.922183 16862.952321
10000 rows × 4 columns
X = default[['balance','income']]
y = default['default']
boot = resample(X,y,replace=True,n_samples = len(X),random_state=1)
#convert to dataframe
boot = np.array(boot)
X = np.array(boot)[0]
y = np.array(boot)[1]
df = pd.DataFrame(X,index = X.index)
dfy = pd.DataFrame(y,index=y.index)
df = df.join(dfy)
X dataframe:
balance income
235 964.820253 34390.746035
5192 0.000000 29322.631394
905 1234.476479 31313.374575
7813 1598.020831 39163.361056
2895 1270.092810 16809.006452
... ... ...
7920 761.988491 39172.945235
1525 916.536937 20130.915258
4981 1037.573018 18769.579024
8104 912.065531 62142.061061
6990 1341.615739 26319.015588
[10000 rows x 2 columns]
Y dataframe
default
235 No
5192 No
905 No
7813 Yes
2895 No
... ...
7920 No
1525 No
4981 No
8104 No
6990 No
[10000 rows x 1 columns]
Combine to give this for some reason:
balance income default
0 729.526495 44361.625074 No
0 729.526495 44361.625074 No
0 729.526495 44361.625074 No
0 729.526495 44361.625074 No
1 817.180407 12106.134700 No
... ... ... ...
9998 1569.009053 36669.112365 No
9999 200.922183 16862.952321 No
9999 200.922183 16862.952321 No
9999 200.922183 16862.952321 No
9999 200.922183 16862.952321 No
20334 rows × 3 columns
Can someone explain where im going wrong?

How to calculate the expanding mean of all the columns across the DataFrame and add to DataFrame

I am trying to calculate the means of all previous rows for each column of the DataFrame and add the calculated mean column to the DataFrame.
I am using a set of nba games data that contains 20+ features (columns) that I am trying to calculate the means for. Example of the dataset is below. (Note. "...." represent rest of the feature columns)
Team TeamPoints OpponentPoints.... TeamPoints_mean OpponentPoints_mean
ATL 102 109 .... nan nan
ATL 102 92 .... 102 109
ATL 92 94 .... 102 100.5
BOS 119 122 .... 98.67 98.33
BOS 103 96 .... 103.75 104.25
Example for calculating two of the columns:
dataset = pd.read_csv('nba.games.stats.csv')
df = dataset
df['Game_mean'] = (df.groupby('Team')['TeamPoints'].apply(lambda x: x.shift().expanding().mean()))
df['TeamPoints_mean'] = (df.groupby('Team')['OpponentsPoints'].apply(lambda x: x.shift().expanding().mean()))
Again, the code only calculates the mean and adding the column to the DataFrame one at a time. Is there a way to get the column means and add them to the DataFrame without doing one at a time? For loop? Example of what I am looking for is below.
Team TeamPoints OpponentPoints.... TeamPoints_mean OpponentPoints_mean ...("..." = mean columns of rest of the feature columns)
ATL 102 109 .... nan nan
ATL 102 92 .... 102 109
ATL 92 94 .... 102 100.5
BOS 119 122 .... 98.67 98.33
BOS 103 96 .... 103.75 104.25
Try this one:
(0) sample input:
>>> df
col1 col2 col3
0 1.490977 1.784433 0.852842
1 3.726663 2.845369 7.766797
2 0.042541 1.196383 6.568839
3 4.784911 0.444671 8.019933
4 3.831556 0.902672 0.198920
5 3.672763 2.236639 1.528215
6 0.792616 2.604049 0.373296
7 2.281992 2.563639 1.500008
8 4.096861 0.598854 4.934116
9 3.632607 1.502801 0.241920
Then processing:
(1) side table to get all the means on the side (I didn't find cummulative mean function, so went with cumsum + count)
>>> df_side=df.assign(col_temp=1).cumsum()
>>> df_side
col1 col2 col3 col_temp
0 1.490977 1.784433 0.852842 1.0
1 5.217640 4.629801 8.619638 2.0
2 5.260182 5.826184 15.188477 3.0
3 10.045093 6.270855 23.208410 4.0
4 13.876649 7.173527 23.407330 5.0
5 17.549412 9.410166 24.935545 6.0
6 18.342028 12.014215 25.308841 7.0
7 20.624021 14.577855 26.808849 8.0
8 24.720882 15.176708 31.742965 9.0
9 28.353489 16.679509 31.984885 10.0
>>> for el in df.columns:
... df_side["{}_mean".format(el)]=df_side[el]/df_side.col_temp
>>> df_side=df_side.drop([el for el in df.columns] + ["col_temp"], axis=1)
>>> df_side
col1_mean col2_mean col3_mean
0 1.490977 1.784433 0.852842
1 2.608820 2.314901 4.309819
2 1.753394 1.942061 5.062826
3 2.511273 1.567714 5.802103
4 2.775330 1.434705 4.681466
5 2.924902 1.568361 4.155924
6 2.620290 1.716316 3.615549
7 2.578003 1.822232 3.351106
8 2.746765 1.686301 3.526996
9 2.835349 1.667951 3.198489
(2) joining back, on index:
>>> df_final=df.join(df_side)
>>> df_final
col1 col2 col3 col1_mean col2_mean col3_mean
0 1.490977 1.784433 0.852842 1.490977 1.784433 0.852842
1 3.726663 2.845369 7.766797 2.608820 2.314901 4.309819
2 0.042541 1.196383 6.568839 1.753394 1.942061 5.062826
3 4.784911 0.444671 8.019933 2.511273 1.567714 5.802103
4 3.831556 0.902672 0.198920 2.775330 1.434705 4.681466
5 3.672763 2.236639 1.528215 2.924902 1.568361 4.155924
6 0.792616 2.604049 0.373296 2.620290 1.716316 3.615549
7 2.281992 2.563639 1.500008 2.578003 1.822232 3.351106
8 4.096861 0.598854 4.934116 2.746765 1.686301 3.526996
9 3.632607 1.502801 0.241920 2.835349 1.667951 3.198489
I am trying to calculate the means of all previous rows for each column of the DataFrame
To get all of the columns, you can do:
df_means = df.join(df.cumsum()/
df.applymap(lambda x:1).cumsum(),
r_suffix = "_mean")
However, if Team is a column rather the index, you'd want to get rid of it:
df_data = df.drop('Teams', axis=1)
df_means = df.join(df_data.cumsum()/
df_data.applymap(lambda x:1).cumsum(),
r_suffix = "_mean")
You could also do
import numpy as np
df_data = df[[col for col in df.columns
if np.issubdtype(df[col],np.number)]]
Or manually define a list of columns that you want to take the mean of, cols_for_mean, and then do
df_data = df[cols_for_mean]

How to add a repeated column using pandas

I am doing my homework and I encounter a problem, I have a large matrix, the first column Y002 is a nominal variable, which has 3 levels and encoded as 1,2,3 respectively. The other two columns V96 and V97 are just numeric.
Now, I wanna get a group mean corresponds to the variable Y002. I wrote the code like this
group = data2.groupby(by=["Y002"]).mean()
Then I index to get each group mean using
group1 = group["V96"]
group2 = group["V97"]
Now I wanna append this group mean as a new column into the original dataframe, in which each mean matches the corresponding Y002 code(1 or 2 or 3). Actually I tried this code, but it only shows NAN.
data2["group1"] = pd.Series(group1, index=data2.index)
Hope someone could help me with this, many thanks :)
PS: Hope this makes sense. just like R language, we can do the same thing using
data2$group1 = with(data2, tapply(V97,Y002,mean))[data2$Y002]
But how can we implement this in Python and pandas???
You can use .transform()
import pandas as pd
import numpy as np
# your data
# ============================
np.random.seed(0)
df = pd.DataFrame({'Y002': np.random.randint(1,4,100), 'V96': np.random.randn(100), 'V97': np.random.randn(100)})
print(df)
V96 V97 Y002
0 -0.6866 -0.1478 1
1 0.0149 1.6838 2
2 -0.3757 0.9718 1
3 -0.0382 1.6077 2
4 0.3680 -0.2571 2
5 -0.0447 1.8098 3
6 -0.3024 0.8923 1
7 -2.2244 -0.0966 3
8 0.7240 -0.3772 1
9 0.3590 -0.5053 1
.. ... ... ...
90 -0.6906 1.5567 2
91 -0.6815 -0.4189 3
92 -1.5122 -0.4097 1
93 2.1969 1.1164 2
94 1.0412 -0.2510 3
95 -0.0332 -0.4152 1
96 0.0656 -0.6391 3
97 0.2658 2.4978 1
98 1.1518 -3.0051 2
99 0.1380 -0.8740 3
# processing
# ===========================
df['V96_mean'] = df.groupby('Y002')['V96'].transform(np.mean)
df['V97_mean'] = df.groupby('Y002')['V97'].transform(np.mean)
df
V96 V97 Y002 V96_mean V97_mean
0 -0.6866 -0.1478 1 -0.1944 0.0837
1 0.0149 1.6838 2 0.0497 -0.0496
2 -0.3757 0.9718 1 -0.1944 0.0837
3 -0.0382 1.6077 2 0.0497 -0.0496
4 0.3680 -0.2571 2 0.0497 -0.0496
5 -0.0447 1.8098 3 0.0053 -0.0707
6 -0.3024 0.8923 1 -0.1944 0.0837
7 -2.2244 -0.0966 3 0.0053 -0.0707
8 0.7240 -0.3772 1 -0.1944 0.0837
9 0.3590 -0.5053 1 -0.1944 0.0837
.. ... ... ... ... ...
90 -0.6906 1.5567 2 0.0497 -0.0496
91 -0.6815 -0.4189 3 0.0053 -0.0707
92 -1.5122 -0.4097 1 -0.1944 0.0837
93 2.1969 1.1164 2 0.0497 -0.0496
94 1.0412 -0.2510 3 0.0053 -0.0707
95 -0.0332 -0.4152 1 -0.1944 0.0837
96 0.0656 -0.6391 3 0.0053 -0.0707
97 0.2658 2.4978 1 -0.1944 0.0837
98 1.1518 -3.0051 2 0.0497 -0.0496
99 0.1380 -0.8740 3 0.0053 -0.0707
[100 rows x 5 columns]

Categories

Resources