Compare two dataframes and find rows based on a value with condition - python

I have two DataFrames. column "video_path" is common in both the dataframes. I need to extract details from df1 if it matches with df2 and also with the value of yes/no.
df1
df2
Expected result:
What i tried
newdf= df1.merge(frame, left_on='video_path', right_on='Video_path', how='inner')
But I'm sure its not correct.
code to create data frames
df1 = {'video_path': {0: 'video/file_path/1.mp4', 1: 'video/file_path/1.mp4', 2: 'video/file_path/1.mp4', 3: 'video/file_path/2.mp4', 4: 'video/file_path/2.mp4', 5: 'video/file_path/2.mp4', 6: 'video/file_path/2.mp4', 7: 'video/file_path/2.mp4', 8: 'video/file_path/3.mp4', 9: 'video/file_path/3.mp4', 10: 'video/file_path/3.mp4', 11: 'video/file_path/4.mp4', 12: 'video/file_path/4.mp4', 13: 'video/file_path/4.mp4', 14: 'video/file_path/4.mp4', 15: 'video/file_path/5.mp4', 16: 'video/file_path/5.mp4', 17: 'video/file_path/5.mp4', 18: 'video/file_path/5.mp4', 19: 'video/file_path/6.mp4', 20: 'video/file_path/6.mp4', 21: 'video/file_path/6.mp4', 22: 'video/file_path/6.mp4', 23: 'video/file_path/6.mp4'}, 'frame_details': {0: 'frame_1.jpg', 1: 'frame_2.jpg', 2: 'frame_3.jpg', 3: 'frame_1.jpg', 4: 'frame_2.jpg', 5: 'frame_3.jpg', 6: 'frame_4.jpg', 7: 'frame_5.jpg', 8: 'frame_1.jpg', 9: 'frame_2.jpg', 10: 'frame_3.jpg', 11: 'frame_1.jpg', 12: 'frame_2.jpg', 13: 'frame_3.jpg', 14: 'frame_4.jpg', 15: 'frame_1.jpg', 16: 'frame_2.jpg', 17: 'frame_3.jpg', 18: 'frame_4.jpg', 19: 'frame_1.jpg', 20: 'frame_2.jpg', 21: 'frame_3.jpg', 22: 'frame_4.jpg', 23: 'frame_5.jpg'}, 'width': {0: 520, 1: 520, 2: 520, 3: 120, 4: 120, 5: 120, 6: 120, 7: 120, 8: 720, 9: 720, 10: 720, 11: 1080, 12: 1080, 13: 1080, 14: 1080, 15: 480, 16: 480, 17: 480, 18: 480, 19: 640, 20: 640, 21: 640, 22: 640, 23: 640}, 'height': {0: 225, 1: 225, 2: 225, 3: 120, 4: 120, 5: 120, 6: 120, 7: 120, 8: 480, 9: 480, 10: 480, 11: 1920, 12: 1920, 13: 1920, 14: 1920, 15: 640, 16: 640, 17: 640, 18: 640, 19: 480, 20: 480, 21: 480, 22: 480, 23: 480}, 'hasAudio': {0: 'yes', 1: 'yes', 2: 'yes', 3: 'yes', 4: 'yes', 5: 'yes', 6: 'yes', 7: 'yes', 8: 'yes', 9: 'yes', 10: 'yes', 11: 'no', 12: 'no', 13: 'no', 14: 'no', 15: 'no', 16: 'no', 17: 'no', 18: 'no', 19: 'yes', 20: 'yes', 21: 'yes', 22: 'yes', 23: 'yes'}}
df2 = {'Video_path': {0: 'video/file_path/1.mp4',
1: 'video/file_path/2.mp4',
2: 'video/file_path/4.mp4',
3: 'video/file_path/6.mp4',
4: 'video/file_path/7.mp4',
5: 'video/file_path/8.mp4',
6: 'video/file_path/9.mp4'},
'isPresent': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan, 5: nan, 6: nan}

Swap df1 and df2 with left join and indicator parameter, last set column isPresent by Series.map:
newdf= df2.merge(df1.rename(columns={'video_path':'Video_path'}),
on='Video_path',
how='left',
indicator=True)
newdf['isPresent'] = newdf.pop('_merge').map({'both':'yes', 'left_only':'no'})
print (newdf)
Video_path isPresent frame_details width height hasAudio
0 video/file_path/1.mp4 yes frame_1.jpg 520.0 225.0 yes
1 video/file_path/1.mp4 yes frame_2.jpg 520.0 225.0 yes
2 video/file_path/1.mp4 yes frame_3.jpg 520.0 225.0 yes
3 video/file_path/2.mp4 yes frame_1.jpg 120.0 120.0 yes
4 video/file_path/2.mp4 yes frame_2.jpg 120.0 120.0 yes
5 video/file_path/2.mp4 yes frame_3.jpg 120.0 120.0 yes
6 video/file_path/2.mp4 yes frame_4.jpg 120.0 120.0 yes
7 video/file_path/2.mp4 yes frame_5.jpg 120.0 120.0 yes
8 video/file_path/4.mp4 yes frame_1.jpg 1080.0 1920.0 no
9 video/file_path/4.mp4 yes frame_2.jpg 1080.0 1920.0 no
10 video/file_path/4.mp4 yes frame_3.jpg 1080.0 1920.0 no
11 video/file_path/4.mp4 yes frame_4.jpg 1080.0 1920.0 no
12 video/file_path/6.mp4 yes frame_1.jpg 640.0 480.0 yes
13 video/file_path/6.mp4 yes frame_2.jpg 640.0 480.0 yes
14 video/file_path/6.mp4 yes frame_3.jpg 640.0 480.0 yes
15 video/file_path/6.mp4 yes frame_4.jpg 640.0 480.0 yes
16 video/file_path/6.mp4 yes frame_5.jpg 640.0 480.0 yes
17 video/file_path/7.mp4 no NaN NaN NaN NaN
18 video/file_path/8.mp4 no NaN NaN NaN NaN
19 video/file_path/9.mp4 no NaN NaN NaN NaN

Related

How to put values on a single raw from multiple columns in Pandas

I have been scratching my head for days about this problem. Please, find below the structure of my input data and the output that I want.
I color-coded per ID, Plot, Survey, Trial and the 3 estimation methods.
In the output, I want to get all the scorings for each group, which are represented by color, on the same row. By doing that, we should get rid of the Estimation Method column in the output. I kept it for sake of clarity.
This is my code. Thank you in advance for your time.
import pandas as pd
import functools
data_dict = {'ID': {0: 'id1',
1: 'id1',
2: 'id1',
3: 'id1',
4: 'id1',
5: 'id1',
6: 'id1',
7: 'id1',
8: 'id1',
9: 'id1',
10: 'id1',
11: 'id1',
12: 'id1',
13: 'id1',
14: 'id1',
15: 'id1',
16: 'id1',
17: 'id1',
18: 'id1',
19: 'id1',
20: 'id1',
21: 'id1',
22: 'id1',
23: 'id1'},
'Plot': {0: 'p1',
1: 'p1',
2: 'p1',
3: 'p1',
4: 'p1',
5: 'p1',
6: 'p1',
7: 'p1',
8: 'p1',
9: 'p1',
10: 'p1',
11: 'p1',
12: 'p1',
13: 'p1',
14: 'p1',
15: 'p1',
16: 'p1',
17: 'p1',
18: 'p1',
19: 'p1',
20: 'p1',
21: 'p1',
22: 'p1',
23: 'p1'},
'Survey': {0: 'Sv1',
1: 'Sv1',
2: 'Sv1',
3: 'Sv1',
4: 'Sv1',
5: 'Sv1',
6: 'Sv2',
7: 'Sv2',
8: 'Sv2',
9: 'Sv2',
10: 'Sv2',
11: 'Sv2',
12: 'Sv1',
13: 'Sv1',
14: 'Sv1',
15: 'Sv1',
16: 'Sv1',
17: 'Sv1',
18: 'Sv2',
19: 'Sv2',
20: 'Sv2',
21: 'Sv2',
22: 'Sv2',
23: 'Sv2'},
'Trial': {0: 't1',
1: 't1',
2: 't1',
3: 't2',
4: 't2',
5: 't2',
6: 't1',
7: 't1',
8: 't1',
9: 't2',
10: 't2',
11: 't2',
12: 't1',
13: 't1',
14: 't1',
15: 't2',
16: 't2',
17: 't2',
18: 't1',
19: 't1',
20: 't1',
21: 't2',
22: 't2',
23: 't2'},
'Mission': {0: 'mission1',
1: 'mission1',
2: 'mission1',
3: 'mission1',
4: 'mission1',
5: 'mission1',
6: 'mission1',
7: 'mission1',
8: 'mission1',
9: 'mission1',
10: 'mission1',
11: 'mission2',
12: 'mission2',
13: 'mission2',
14: 'mission2',
15: 'mission2',
16: 'mission2',
17: 'mission2',
18: 'mission2',
19: 'mission2',
20: 'mission2',
21: 'mission2',
22: 'mission2',
23: 'mission2'},
'Estimation Method': {0: 'MCARI2',
1: 'NDVI',
2: 'NDRE',
3: 'MCARI2',
4: 'NDVI',
5: 'NDRE',
6: 'MCARI2',
7: 'NDVI',
8: 'NDRE',
9: 'MCARI2',
10: 'NDVI',
11: 'NDRE',
12: 'MCARI2',
13: 'NDVI',
14: 'NDRE',
15: 'MCARI2',
16: 'NDVI',
17: 'NDRE',
18: 'MCARI2',
19: 'NDVI',
20: 'NDRE',
21: 'MCARI2',
22: 'NDVI',
23: 'NDRE'},
'MCARI2_sd': {0: 1.5,
1: np.nan,
2: np.nan,
3: 10.0,
4: np.nan,
5: np.nan,
6: 1.5,
7: np.nan,
8: np.nan,
9: 10.0,
10: np.nan,
11: np.nan,
12: 101.0,
13: np.nan,
14: np.nan,
15: 23.5,
16: np.nan,
17: np.nan,
18: 111.0,
19: np.nan,
20: np.nan,
21: 72.0,
22: np.nan,
23: np.nan},
'MACRI2_50': {0: 12.4,
1: np.nan,
2: np.nan,
3: 11.0,
4: np.nan,
5: np.nan,
6: 12.4,
7: np.nan,
8: np.nan,
9: 11.0,
10: np.nan,
11: np.nan,
12: 102.0,
13: np.nan,
14: np.nan,
15: 2.1,
16: np.nan,
17: np.nan,
18: 112.0,
19: np.nan,
20: np.nan,
21: 74.0,
22: np.nan,
23: np.nan},
'MACRI2_AVG': {0: 15.0,
1: np.nan,
2: np.nan,
3: 12.0,
4: np.nan,
5: np.nan,
6: 15.0,
7: np.nan,
8: np.nan,
9: 12.0,
10: np.nan,
11: np.nan,
12: 103.0,
13: np.nan,
14: np.nan,
15: 24.0,
16: np.nan,
17: np.nan,
18: 113.0,
19: np.nan,
20: np.nan,
21: 77.0,
22: np.nan,
23: np.nan},
'NDVI_sd': {0: np.nan,
1: 2.9,
2: np.nan,
3: np.nan,
4: 20.0,
5: np.nan,
6: np.nan,
7: 2.9,
8: np.nan,
9: np.nan,
10: 20.0,
11: np.nan,
12: np.nan,
13: 201.0,
14: np.nan,
15: np.nan,
16: 11.0,
17: np.nan,
18: np.nan,
19: 200.0,
20: np.nan,
21: np.nan,
22: 32.0,
23: np.nan},
'NDVI_50': {0: np.nan,
1: 21.0,
2: np.nan,
3: np.nan,
4: 21.0,
5: np.nan,
6: np.nan,
7: 21.0,
8: np.nan,
9: np.nan,
10: 21.0,
11: np.nan,
12: np.nan,
13: 201.0,
14: np.nan,
15: np.nan,
16: 12.0,
17: np.nan,
18: np.nan,
19: 300.0,
20: np.nan,
21: np.nan,
22: 39.0,
23: np.nan},
'NDVI_AVG': {0: np.nan,
1: 27.0,
2: np.nan,
3: np.nan,
4: 22.0,
5: np.nan,
6: np.nan,
7: 27.0,
8: np.nan,
9: np.nan,
10: 22.0,
11: np.nan,
12: np.nan,
13: 203.0,
14: np.nan,
15: np.nan,
16: 13.0,
17: np.nan,
18: np.nan,
19: 400.0,
20: np.nan,
21: np.nan,
22: 40.0,
23: np.nan},
'NDRE_sd': {0: np.nan,
1: np.nan,
2: 3.1,
3: np.nan,
4: np.nan,
5: 31.0,
6: np.nan,
7: np.nan,
8: 3.1,
9: np.nan,
10: np.nan,
11: 31.0,
12: np.nan,
13: np.nan,
14: 301.0,
15: np.nan,
16: np.nan,
17: 15.0,
18: np.nan,
19: np.nan,
20: 57.0,
21: np.nan,
22: np.nan,
23: 21.0},
'NDRE_50': {0: np.nan,
1: np.nan,
2: 33.0,
3: np.nan,
4: np.nan,
5: 32.0,
6: np.nan,
7: np.nan,
8: 33.0,
9: np.nan,
10: np.nan,
11: 32.0,
12: np.nan,
13: np.nan,
14: 302.0,
15: np.nan,
16: np.nan,
17: 16.0,
18: np.nan,
19: np.nan,
20: 58.0,
21: np.nan,
22: np.nan,
23: 22.0},
'NDRE_AVG': {0: np.nan,
1: np.nan,
2: 330.0,
3: np.nan,
4: np.nan,
5: 33.0,
6: np.nan,
7: np.nan,
8: 330.0,
9: np.nan,
10: np.nan,
11: 33.0,
12: np.nan,
13: np.nan,
14: 303.0,
15: np.nan,
16: np.nan,
17: 17.0,
18: np.nan,
19: np.nan,
20: 59.0,
21: np.nan,
22: np.nan,
23: 32.0}}
df_test = pd.DataFrame(data_dict)
def generate_data_per_EM(df):
data_survey = []
for (survey,mission,trial,em),data in df.groupby(['Survey','Mission','Trial','Estimation Method']):
df_em = data.set_index('ID').dropna(axis=1)
df_em.to_csv(f'tmp_data_{survey}_{mission}_{trial}_{em}.csv') #This generates 74 files, but not sure how to join/merge them
data_survey.append(df_em)
#Merge the df_em column-wise
df_final = functools.reduce(lambda left, right: pd.merge(left, right, on=['ID','Survey','Mission','Trial']), data_survey)
df_final.to_csv(f'final_{survey}_{mission}_{em}.csv') #Output is not what I expected
generate_data_per_EM(df_test)
You need a groupby:
(df_test
.groupby(['ID', 'Plot', 'Survey', 'Trial','Mission'], as_index=False, sort=False)
.first(numeric_only=True)
ID Plot Survey Trial Mission MCARI2_sd MACRI2_50 MACRI2_AVG NDVI_sd NDVI_50 NDVI_AVG NDRE_sd NDRE_50 NDRE_AVG
0 id1 p1 Sv1 t1 mission1 1.5 12.4 15.0 2.9 21.0 27.0 3.1 33.0 330.0
1 id1 p1 Sv1 t2 mission1 10.0 11.0 12.0 20.0 21.0 22.0 31.0 32.0 33.0
2 id1 p1 Sv2 t1 mission1 1.5 12.4 15.0 2.9 21.0 27.0 3.1 33.0 330.0
3 id1 p1 Sv2 t2 mission1 10.0 11.0 12.0 20.0 21.0 22.0 NaN NaN NaN
4 id1 p1 Sv2 t2 mission2 72.0 74.0 77.0 32.0 39.0 40.0 31.0 32.0 33.0
5 id1 p1 Sv1 t1 mission2 101.0 102.0 103.0 201.0 201.0 203.0 301.0 302.0 303.0
6 id1 p1 Sv1 t2 mission2 23.5 2.1 24.0 11.0 12.0 13.0 15.0 16.0 17.0
7 id1 p1 Sv2 t1 mission2 111.0 112.0 113.0 200.0 300.0 400.0 57.0 58.0 59.0

How to get cumulative sum over index and columns in pandas? [duplicate]

This question already has an answer here:
How can I use cumsum within a group in Pandas?
(1 answer)
Closed 6 months ago.
I have a periodic table that includes premium in different categories over a year for different companies. The dataframe looks like the below:
Company
Type
Month
Year
Ferdi Grup
Premium
1
Allianz
Birikimli Hayat
1
2022
Ferdi
325
2
Allianz
Birikimli Hayat
2
2022
Ferdi
476
3
Axa
Birikimli Hayat
3
2022
Ferdi
687
I want to get a table where I can see the premium cumulated over 'Company' and 'Year'. For each month I want to see premium cumulated from the beginning of the year.
This is the regular sum operation which works well in this case.
data.pivot_table(
columns = 'Company',
index = 'Month',
values = 'Premium',
aggfunc= np.sum
)
However when I change to np.cumsum the result is a series. I want a cumulated pivot table for each year, adding each month's value to the previous ones. How can I do that?
Expected output:
Company
Month
Year
Premium
1
Allianz
1
2022
325
2
Allianz
2
2022
801
3
Axa
3
2022
687
So, this is the original data I am working with:
{'Company': {0: 'AgeSA',
1: 'Türkiye',
2: 'Türkiye',
3: 'AgeSA',
4: 'AgeSA',
5: 'Türkiye',
6: 'AgeSA',
7: 'Türkiye',
8: 'Türkiye',
9: 'AgeSA',
10: 'Türkiye',
11: 'Türkiye',
12: 'AgeSA',
13: 'Türkiye',
14: 'Türkiye',
15: 'AgeSA',
16: 'AgeSA',
17: 'Türkiye',
18: 'AgeSA',
19: 'Türkiye',
20: 'Türkiye',
21: 'AgeSA',
22: 'Türkiye',
23: 'Türkiye'},
'Type': {0: 'Birikimli Hayat',
1: 'Birikimli Hayat',
2: 'Sadece Yaşam Teminatlı',
3: 'Karma Sigorta',
4: 'Yıllık Vefat',
5: 'Yıllık Vefat',
6: 'Uzun Süreli Vefat',
7: 'Uzun Süreli Vefat',
8: 'Birikimli Hayat',
9: 'Yıllık Vefat',
10: 'Yıllık Vefat',
11: 'Uzun Süreli Vefat',
12: 'Birikimli Hayat',
13: 'Birikimli Hayat',
14: 'Sadece Yaşam Teminatlı',
15: 'Karma Sigorta',
16: 'Yıllık Vefat',
17: 'Yıllık Vefat',
18: 'Uzun Süreli Vefat',
19: 'Uzun Süreli Vefat',
20: 'Birikimli Hayat',
21: 'Yıllık Vefat',
22: 'Yıllık Vefat',
23: 'Uzun Süreli Vefat'},
'Month': {0: 1,
1: 1,
2: 1,
3: 1,
4: 1,
5: 1,
6: 1,
7: 1,
8: 1,
9: 1,
10: 1,
11: 1,
12: 2,
13: 2,
14: 2,
15: 2,
16: 2,
17: 2,
18: 2,
19: 2,
20: 2,
21: 2,
22: 2,
23: 2},
'Year': {0: 2022,
1: 2022,
2: 2022,
3: 2022,
4: 2022,
5: 2022,
6: 2022,
7: 2022,
8: 2022,
9: 2022,
10: 2022,
11: 2022,
12: 2022,
13: 2022,
14: 2022,
15: 2022,
16: 2022,
17: 2022,
18: 2022,
19: 2022,
20: 2022,
21: 2022,
22: 2022,
23: 2022},
'Ferdi Grup': {0: 'Ferdi',
1: 'Ferdi',
2: 'Ferdi',
3: 'Ferdi',
4: 'Ferdi',
5: 'Ferdi',
6: 'Ferdi',
7: 'Ferdi',
8: 'Grup',
9: 'Grup',
10: 'Grup',
11: 'Grup',
12: 'Ferdi',
13: 'Ferdi',
14: 'Ferdi',
15: 'Ferdi',
16: 'Ferdi',
17: 'Ferdi',
18: 'Ferdi',
19: 'Ferdi',
20: 'Grup',
21: 'Grup',
22: 'Grup',
23: 'Grup'},
'Premium': {0: 936622.43,
1: 14655.67,
2: 8496.0,
3: 124768619.29,
4: 6651019.24,
5: 11055383.530005993,
6: 54273212.457471885,
7: 22163192.66,
8: 81000.95,
9: 9338009.52,
10: 251790130.54997802,
11: 140949274.79999998,
12: 910808.77,
13: 8754.71,
14: 7128.0,
15: 129753498.31,
16: 8015974.454128993,
17: 16776490.000003006,
18: 67607915.34000003,
19: 24683694.700000003,
20: 60887.56,
21: 1497105.2458709963,
22: 195019190.297756,
23: 167424048.43},
'cumsum': {0: 936622.43,
1: 14655.67,
2: 23151.67,
3: 125705241.72000001,
4: 132356260.96000001,
5: 11078535.200005993,
6: 186629473.4174719,
7: 33241727.860005993,
8: 33322728.810005993,
9: 195967482.9374719,
10: 285112859.35998404,
11: 426062134.159984,
12: 196878291.7074719,
13: 426070888.869984,
14: 426078016.869984,
15: 326631790.0174719,
16: 334647764.4716009,
17: 442854506.869987,
18: 402255679.8116009,
19: 467538201.569987,
20: 467599089.129987,
21: 403752785.05747193,
22: 662618279.427743,
23: 830042327.857743}}
This is the result of a regular sum pivot:
AgeSA
Türkiye
1
195967482.9374719
426062134.159984
2
207785302.12000003
403980193.69775903
When I use the suggested code as below:
df_2 = data.copy()
df_2['cumsum'] = df_2.groupby(['Company', 'Year'])[['Premium']].cumsum()
df_2.sort_values(['Company', 'Year', 'cumsum']).reset_index(drop = True)
Each line gets a cumsum value from the above lines it seems:
For me to be able to get the table I need, I need to get max in each group again in a pivot_table:
df_2.pivot_table(
index = ['Year', 'Month'],
values = ['Premium', 'cumsum'],
columns = 'Company',
aggfunc = {'Premium': 'sum', 'cumsum': 'max'}
)
which finally gets me to this result:
Is it that difficult to get the cumsum table in pandas or am I just doing it the hard way?
Your dataframe is already in the right format, why you want to pivot it again?
I think what you are searching for is a pandas.groupby.
df['cumsum_by_group'] = df.groupby(['Company', 'Year'])['Premium'].cumsum()
Output:
Company Type Month Year Ferdi Grup Premium cumsum_by_group
1 Allianz Birikimli Hayat 1 2022 Ferdi 325 325
2 Allianz Birikimli Hayat 2 2022 Ferdi 476 801
3 Axa Birikimli Hayat 3 2022 Ferdi 687 687
To calculate the cumulative sum over multiple colums of a dataframe, you can use pandas.DataFrame.groupby and pandas.DataFrame.cumsum combined.
Assuming that data is the dataframe that holds the original dataset, use the code below :
data['Premium'] = data.groupby(['Company', 'Year'])['Premium'].cumsum()
out = data[['Company', 'Month', 'Year', 'Premium']] #to select the specific columns
>>> print(out)

Computing the average

In the following dataset:
import pandas as pd
df = pd.DataFrame({'globalid': {0: '4388064', 1: '4388200', 2: '4399344', 3: '4400638', 4: '4401765', 5: '4401831', 6: '4402098', 7: '4406997', 8: '4407331', 9: '4417043', 10: '4437380', 11: '4442467', 12: '4401955', 13: '4425140', 14: '4426164', 15: '4405473', 16: '4411249', 17: '4388584', 18: '4400483', 19: '4433927', 20: '4413441', 21: '4436355', 22: '4443361', 23: '4443375', 24: '4388176'}, 'postcode': {0: '1774PG', 1: '7481LK', 2: '1068MS', 3: '5628EN', 4: '7731TV', 5: '5971CR', 6: '9571BM', 7: '1031KA', 8: '9076BK', 9: '4465AL', 10: '1096AC', 11: '3601', 12: '2563PT', 13: '2341HN', 14: '2553DM', 15: '2403EM', 16: '1051AN', 17: '4525AB', 18: '4542BA', 19: '1096AC', 20: '5508AE', 21: '1096AC', 22: '3543GC', 23: '4105TA', 24: '7742EH'}, 'koopprijs': {0: '139000', 1: '209000', 2: '267500', 3: '349000', 4: '495000', 5: '162500', 6: '217500', 7: '655000', 8: '180000', 9: '495000', 10: '2395000', 11: '355000', 12: '150000', 13: '167500', 14: '710000', 15: '275000', 16: '498000', 17: '324500', 18: '174500', 19: '610000', 20: '300000', 21: '2230000', 22: '749000', 23: '504475', 24: '239000'}, 'place_name': {0: 'Slootdorp', 1: 'Haaksbergen', 2: 'Amsterdam', 3: 'Eindhoven', 4: 'Ommen', 5: 'Grubbenvorst', 6: '2e Exloërmond', 7: 'Amsterdam', 8: 'St.-Annaparochie', 9: 'Goes', 10: 'Amsterdam', 11: 'Maarssen', 12: "'s-Gravenhage", 13: 'Oegstgeest', 14: "'s-Gravenhage", 15: 'Alphen aan den Rijn', 16: 'Amsterdam', 17: 'Retranchement', 18: 'Hoek', 19: 'Amsterdam', 20: 'Veldhoven', 21: 'Amsterdam', 22: 'Utrecht', 23: 'Culemborg', 24: 'Coevorden'}})
print(df)
I would like to compute the average asking price, which is indicated by 'koopprijs' per place_name. Can someone please provide the code, or explain how this can be computed? As there are multiple 'koopprijs' per place_name, such as Amsterdam I am looking to compute the average price per placename.
You can try below:
df['koopprijs'] = df['koopprijs'].astype(int) # just make sure the values are int.
df2 = df.groupby('place_name')['koopprijs'].mean()
print(df2)
You will get the output as:
place_name
's-Gravenhage 430000
2e Exloërmond 217500
Alphen aan den Rijn 275000
Amsterdam 1109250
Coevorden 239000
Culemborg 504475
Eindhoven 349000
Goes 495000
Grubbenvorst 162500
Haaksbergen 209000
Hoek 174500
Maarssen 355000
Oegstgeest 167500
Ommen 495000
Retranchement 324500
Slootdorp 139000
St.-Annaparochie 180000
Utrecht 749000
Veldhoven 300000
Name: koopprijs, dtype: int32
First change the data type for koopprijs and then use groupby-agg
df['koopprijs'] = df['koopprijs'].astype('int')
df = df.groupby(['place_name'])['koopprijs'].agg('mean')

Typeerror in pandas when idxmax()

I have the following dataframe:
OI CHNG IN OI VOLUME IV LTP CHNG BID QTY BID PRICE ASK PRICE ASK QTY
STRIKE PRICE
17,450.00 NaN NaN 26 19.45 1.40 -4.05 600 1.15 2.10 500
17,500.00 351 351 772 20.06 1.35 -3.15 2,050 1.35 1.65 450
17,550.00 4 4 13 21.81 2.00 -1.65 600 1.25 2.45 300
17,600.00 1 1 1 21.91 1.60 -1.40 600 1.25 1.95 300
17,650.00 NaN NaN 7 22.15 1.35 -1.05 2,000 1.05 1.95 300
When i do put['OI'].idxmax() it throws error:
TypeError: reduction operation 'argmax' not allowed for this dtype
Earlier I did this. put.replace('-', np.nan, inplace = True) to replace the dash but i am still getting the same error.
I did some looking and it seems like its because it was looking for numeric dtype but as you can see from the below df, the dtype is numbers:
df['OI']
STRIKE PRICE
15,200.00 39
15,250.00 14
15,300.00 60
15,350.00 10
15,400.00 199
15,450.00 25
15,500.00 925
15,550.00 131
15,600.00 634
15,650.00 120
15,700.00 1,290
15,750.00 887
15,800.00 4,039
15,850.00 1,207
15,900.00 6,504
15,950.00 1,503
16,000.00 10,704
16,050.00 2,366
16,100.00 9,328
16,150.00 3,348
16,200.00 17,240
16,250.00 9,100
16,300.00 18,938
16,350.00 3,685
16,400.00 15,145
16,450.00 3,654
16,500.00 16,496
16,550.00 2,053
16,600.00 8,982
16,650.00 1,156
16,700.00 6,872
16,750.00 849
16,800.00 4,026
16,850.00 339
16,900.00 3,167
16,950.00 13
17,000.00 6,160
17,050.00 197
17,100.00 641
17,150.00 1
17,200.00 373
17,250.00 NaN
17,300.00 66
17,350.00 236
17,400.00 551
17,450.00 NaN
17,500.00 351
17,550.00 4
17,600.00 1
17,650.00 NaN
Name: OI, dtype: object
I am not sure why i am getting this error
Perhaps it's because there are , in your column. Try replacing them with a ., converting to float, and re-running to get your idxmax():
>>> df['col'].str.replace(',','.').astype(float).idxmax()
6
Data used
>>> df.to_dict()
{'STRIKE PRICE': {0: '15,200.00',
1: '15,250.00',
2: '15,300.00',
3: '15,350.00',
4: '15,400.00',
5: '15,450.00',
6: '15,500.00',
7: '15,550.00',
8: '15,600.00',
9: '15,650.00',
10: '15,700.00',
11: '15,750.00',
12: '15,800.00',
13: '15,850.00',
14: '15,900.00',
15: '15,950.00',
16: '16,000.00',
17: '16,050.00',
18: '16,100.00',
19: '16,150.00',
20: '16,200.00',
21: '16,250.00',
22: '16,300.00',
23: '16,350.00',
24: '16,400.00',
25: '16,450.00',
26: '16,500.00',
27: '16,550.00',
28: '16,600.00',
29: '16,650.00',
30: '16,700.00',
31: '16,750.00',
32: '16,800.00',
33: '16,850.00',
34: '16,900.00',
35: '16,950.00',
36: '17,000.00',
37: '17,050.00',
38: '17,100.00',
39: '17,150.00',
40: '17,200.00',
41: '17,250.00',
42: '17,300.00',
43: '17,350.00',
44: '17,400.00',
45: '17,450.00',
46: '17,500.00',
47: '17,550.00',
48: '17,600.00',
49: '17,650.00'},
'col': {0: '39',
1: '14',
2: '60',
3: '10',
4: '199',
5: '25',
6: '925',
7: '131',
8: '634',
9: '120',
10: '1,290',
11: '887',
12: '4,039',
13: '1,207',
14: '6,504',
15: '1,503',
16: '10,704',
17: '2,366',
18: '9,328',
19: '3,348',
20: '17,240',
21: '9,100',
22: '18,938',
23: '3,685',
24: '15,145',
25: '3,654',
26: '16,496',
27: '2,053',
28: '8,982',
29: '1,156',
30: '6,872',
31: '849',
32: '4,026',
33: '339',
34: '3,167',
35: '13',
36: '6,160',
37: '197',
38: '641',
39: '1',
40: '373',
41: 'NaN',
42: '66',
43: '236',
44: '551',
45: 'NaN',
46: '351',
47: '4',
48: '1',
49: 'NaN'}}
# Check dtypes
>>> df.dtypes
STRIKE PRICE object
col object
# Run idxmax()
>>> df.idxmax()
TypeError: reduction operation 'argmax' not allowed for this dtype
You column are dtype:object, .idxmax() operate on numerical dtype.
Try:
put['OI'].apply(lambda x : float(x.replace(',',''))).idxmax()
It'll remove comma from the numerical string and convert to float

Pandas Compare rows in Dataframe

I have following data frame (represented by dictionary below):
{'Name': {0: '204',
1: '110838',
2: '110999',
3: '110998',
4: '111155',
5: '111710',
6: '111157',
7: '111156',
8: '111144',
9: '118972',
10: '111289',
11: '111288',
12: '111145',
13: '121131',
14: '118990',
15: '110653',
16: '110693',
17: '110694',
18: '111577',
19: '111702',
20: '115424',
21: '115127',
22: '115178',
23: '111578',
24: '115409',
25: '115468',
26: '111711',
27: '115163',
28: '115149',
29: '115251'},
'Sequence_new': {0: 1.0,
1: 2.0,
2: 3.0,
3: 4.0,
4: 5.0,
5: 6.0,
6: 7.0,
7: 8.0,
8: 9.0,
9: 10.0,
10: 11.0,
11: 12.0,
12: nan,
13: 13.0,
14: 14.0,
15: 15.0,
16: 16.0,
17: 17.0,
18: 18.0,
19: 19.0,
20: 20.0,
21: 21.0,
22: 22.0,
23: 23.0,
24: 24.0,
25: 25.0,
26: 26.0,
27: 27.0,
28: 28.0,
29: 29.0},
'Sequence_old': {0: 1,
1: 2,
2: 3,
3: 4,
4: 5,
5: 6,
6: 7,
7: 8,
8: 9,
9: 10,
10: 11,
11: 12,
12: 13,
13: 14,
14: 15,
15: 16,
16: 17,
17: 18,
18: 19,
19: 20,
20: 21,
21: 22,
22: 23,
23: 24,
24: 25,
25: 26,
26: 27,
27: 28,
28: 29,
29: 30}}
I am trying to understand what changed between old and new sequences. If by Name Sequence_old = Sequence_new, nothing changed. If Sequence+_new is 'nan', Name removed. Can you please help implement this in pandas?
What tried till now without success:
for i in range(0, len(Merge)):
if Merge.iloc[i]['Sequence_x'] == Merge.iloc[i]['Sequence_y']:
Merge.iloc[i]['New'] = 'N'
else:
Merge.iloc[i]['New'] = 'Y'
Thank you
You can use double numpy.where with condition with isnull:
mask = df.Sequence_old == df.Sequence_new
df['New'] = np.where(df.Sequence_new.isnull(), 'Removed',
np.where(mask, 'N', 'Y'))
print (df)
Name Sequence_new Sequence_old New
0 204 1.0 1 N
1 110838 2.0 2 N
2 110999 3.0 3 N
3 110998 4.0 4 N
4 111155 5.0 5 N
5 111710 6.0 6 N
6 111157 7.0 7 N
7 111156 8.0 8 N
8 111144 9.0 9 N
9 118972 10.0 10 N
10 111289 11.0 11 N
11 111288 12.0 12 N
12 111145 NaN 13 Removed
13 121131 13.0 14 Y
14 118990 14.0 15 Y
15 110653 15.0 16 Y
16 110693 16.0 17 Y
17 110694 17.0 18 Y
18 111577 18.0 19 Y
19 111702 19.0 20 Y
20 115424 20.0 21 Y
21 115127 21.0 22 Y
22 115178 22.0 23 Y
23 111578 23.0 24 Y
24 115409 24.0 25 Y
25 115468 25.0 26 Y
26 111711 26.0 27 Y
27 115163 27.0 28 Y
28 115149 28.0 29 Y
29 115251 29.0 30 Y
dic_new = {0: 1.0, 1: 2.0, 2: 3.0, 3: 4.0, 4: 5.0, 5: 6.0, 6: 7.0, 7: 8.0, 8: 9.0, 9: 10.0, 10: 11.0, 11: 12.0,
12: 'Nan', 13: 13.0, 14: 14.0, 15: 15.0, 16: 16.0, 17: 17.0, 18: 18.0, 19: 19.0, 20: 20.0, 21: 21.0,
22: 22.0, 23: 23.0, 24: 24.0, 25: 25.0, 26: 26.0, 27: 27.0, 28: 28.0, 29: 29.0}
dic_old = {0: 1, 1: 2, 2: 3, 3: 4, 4: 5, 5: 6, 6: 7, 7: 8, 8: 9, 9: 10, 10: 11, 11: 12, 12: 13, 13: 14, 14: 15, 15: 16,
16: 17, 17: 18, 18: 19, 19: 20, 20: 21, 21: 22, 22: 23, 23: 24, 24: 25, 25: 26, 26: 27, 27: 28, 28: 29,
29: 30}
# Does the same thing as the code below
for a, b in zip(dic_new.items(), dic_old.items()):
if b[1].lower() != 'nan':
# You can add whatever print statement you want here
print(a[1] == b[1])
# Does the same thing as the code above
[print(a[1] == b[1]) for a, b in zip(dic_new.items(), dic_old.items()) if b[1].lower() != 'nan']

Categories

Resources