Pandas Compare rows in Dataframe

Pandas Compare rows in Dataframe - python

I have following data frame (represented by dictionary below):
{'Name': {0: '204',
1: '110838',
2: '110999',
3: '110998',
4: '111155',
5: '111710',
6: '111157',
7: '111156',
8: '111144',
9: '118972',
10: '111289',
11: '111288',
12: '111145',
13: '121131',
14: '118990',
15: '110653',
16: '110693',
17: '110694',
18: '111577',
19: '111702',
20: '115424',
21: '115127',
22: '115178',
23: '111578',
24: '115409',
25: '115468',
26: '111711',
27: '115163',
28: '115149',
29: '115251'},
'Sequence_new': {0: 1.0,
1: 2.0,
2: 3.0,
3: 4.0,
4: 5.0,
5: 6.0,
6: 7.0,
7: 8.0,
8: 9.0,
9: 10.0,
10: 11.0,
11: 12.0,
12: nan,
13: 13.0,
14: 14.0,
15: 15.0,
16: 16.0,
17: 17.0,
18: 18.0,
19: 19.0,
20: 20.0,
21: 21.0,
22: 22.0,
23: 23.0,
24: 24.0,
25: 25.0,
26: 26.0,
27: 27.0,
28: 28.0,
29: 29.0},
'Sequence_old': {0: 1,
1: 2,
2: 3,
3: 4,
4: 5,
5: 6,
6: 7,
7: 8,
8: 9,
9: 10,
10: 11,
11: 12,
12: 13,
13: 14,
14: 15,
15: 16,
16: 17,
17: 18,
18: 19,
19: 20,
20: 21,
21: 22,
22: 23,
23: 24,
24: 25,
25: 26,
26: 27,
27: 28,
28: 29,
29: 30}}
I am trying to understand what changed between old and new sequences. If by Name Sequence_old = Sequence_new, nothing changed. If Sequence+_new is 'nan', Name removed. Can you please help implement this in pandas?
What tried till now without success:
for i in range(0, len(Merge)):
if Merge.iloc[i]['Sequence_x'] == Merge.iloc[i]['Sequence_y']:
Merge.iloc[i]['New'] = 'N'
else:
Merge.iloc[i]['New'] = 'Y'
Thank you

You can use double numpy.where with condition with isnull:
mask = df.Sequence_old == df.Sequence_new
df['New'] = np.where(df.Sequence_new.isnull(), 'Removed',
np.where(mask, 'N', 'Y'))
print (df)
Name Sequence_new Sequence_old New
0 204 1.0 1 N
1 110838 2.0 2 N
2 110999 3.0 3 N
3 110998 4.0 4 N
4 111155 5.0 5 N
5 111710 6.0 6 N
6 111157 7.0 7 N
7 111156 8.0 8 N
8 111144 9.0 9 N
9 118972 10.0 10 N
10 111289 11.0 11 N
11 111288 12.0 12 N
12 111145 NaN 13 Removed
13 121131 13.0 14 Y
14 118990 14.0 15 Y
15 110653 15.0 16 Y
16 110693 16.0 17 Y
17 110694 17.0 18 Y
18 111577 18.0 19 Y
19 111702 19.0 20 Y
20 115424 20.0 21 Y
21 115127 21.0 22 Y
22 115178 22.0 23 Y
23 111578 23.0 24 Y
24 115409 24.0 25 Y
25 115468 25.0 26 Y
26 111711 26.0 27 Y
27 115163 27.0 28 Y
28 115149 28.0 29 Y
29 115251 29.0 30 Y

dic_new = {0: 1.0, 1: 2.0, 2: 3.0, 3: 4.0, 4: 5.0, 5: 6.0, 6: 7.0, 7: 8.0, 8: 9.0, 9: 10.0, 10: 11.0, 11: 12.0,
12: 'Nan', 13: 13.0, 14: 14.0, 15: 15.0, 16: 16.0, 17: 17.0, 18: 18.0, 19: 19.0, 20: 20.0, 21: 21.0,
22: 22.0, 23: 23.0, 24: 24.0, 25: 25.0, 26: 26.0, 27: 27.0, 28: 28.0, 29: 29.0}
dic_old = {0: 1, 1: 2, 2: 3, 3: 4, 4: 5, 5: 6, 6: 7, 7: 8, 8: 9, 9: 10, 10: 11, 11: 12, 12: 13, 13: 14, 14: 15, 15: 16,
16: 17, 17: 18, 18: 19, 19: 20, 20: 21, 21: 22, 22: 23, 23: 24, 24: 25, 25: 26, 26: 27, 27: 28, 28: 29,
29: 30}
# Does the same thing as the code below
for a, b in zip(dic_new.items(), dic_old.items()):
if b[1].lower() != 'nan':
# You can add whatever print statement you want here
print(a[1] == b[1])
# Does the same thing as the code above
[print(a[1] == b[1]) for a, b in zip(dic_new.items(), dic_old.items()) if b[1].lower() != 'nan']

Related

How to put values on a single raw from multiple columns in Pandas

I have been scratching my head for days about this problem. Please, find below the structure of my input data and the output that I want.
I color-coded per ID, Plot, Survey, Trial and the 3 estimation methods.
In the output, I want to get all the scorings for each group, which are represented by color, on the same row. By doing that, we should get rid of the Estimation Method column in the output. I kept it for sake of clarity.
This is my code. Thank you in advance for your time.
import pandas as pd
import functools
data_dict = {'ID': {0: 'id1',
1: 'id1',
2: 'id1',
3: 'id1',
4: 'id1',
5: 'id1',
6: 'id1',
7: 'id1',
8: 'id1',
9: 'id1',
10: 'id1',
11: 'id1',
12: 'id1',
13: 'id1',
14: 'id1',
15: 'id1',
16: 'id1',
17: 'id1',
18: 'id1',
19: 'id1',
20: 'id1',
21: 'id1',
22: 'id1',
23: 'id1'},
'Plot': {0: 'p1',
1: 'p1',
2: 'p1',
3: 'p1',
4: 'p1',
5: 'p1',
6: 'p1',
7: 'p1',
8: 'p1',
9: 'p1',
10: 'p1',
11: 'p1',
12: 'p1',
13: 'p1',
14: 'p1',
15: 'p1',
16: 'p1',
17: 'p1',
18: 'p1',
19: 'p1',
20: 'p1',
21: 'p1',
22: 'p1',
23: 'p1'},
'Survey': {0: 'Sv1',
1: 'Sv1',
2: 'Sv1',
3: 'Sv1',
4: 'Sv1',
5: 'Sv1',
6: 'Sv2',
7: 'Sv2',
8: 'Sv2',
9: 'Sv2',
10: 'Sv2',
11: 'Sv2',
12: 'Sv1',
13: 'Sv1',
14: 'Sv1',
15: 'Sv1',
16: 'Sv1',
17: 'Sv1',
18: 'Sv2',
19: 'Sv2',
20: 'Sv2',
21: 'Sv2',
22: 'Sv2',
23: 'Sv2'},
'Trial': {0: 't1',
1: 't1',
2: 't1',
3: 't2',
4: 't2',
5: 't2',
6: 't1',
7: 't1',
8: 't1',
9: 't2',
10: 't2',
11: 't2',
12: 't1',
13: 't1',
14: 't1',
15: 't2',
16: 't2',
17: 't2',
18: 't1',
19: 't1',
20: 't1',
21: 't2',
22: 't2',
23: 't2'},
'Mission': {0: 'mission1',
1: 'mission1',
2: 'mission1',
3: 'mission1',
4: 'mission1',
5: 'mission1',
6: 'mission1',
7: 'mission1',
8: 'mission1',
9: 'mission1',
10: 'mission1',
11: 'mission2',
12: 'mission2',
13: 'mission2',
14: 'mission2',
15: 'mission2',
16: 'mission2',
17: 'mission2',
18: 'mission2',
19: 'mission2',
20: 'mission2',
21: 'mission2',
22: 'mission2',
23: 'mission2'},
'Estimation Method': {0: 'MCARI2',
1: 'NDVI',
2: 'NDRE',
3: 'MCARI2',
4: 'NDVI',
5: 'NDRE',
6: 'MCARI2',
7: 'NDVI',
8: 'NDRE',
9: 'MCARI2',
10: 'NDVI',
11: 'NDRE',
12: 'MCARI2',
13: 'NDVI',
14: 'NDRE',
15: 'MCARI2',
16: 'NDVI',
17: 'NDRE',
18: 'MCARI2',
19: 'NDVI',
20: 'NDRE',
21: 'MCARI2',
22: 'NDVI',
23: 'NDRE'},
'MCARI2_sd': {0: 1.5,
1: np.nan,
2: np.nan,
3: 10.0,
4: np.nan,
5: np.nan,
6: 1.5,
7: np.nan,
8: np.nan,
9: 10.0,
10: np.nan,
11: np.nan,
12: 101.0,
13: np.nan,
14: np.nan,
15: 23.5,
16: np.nan,
17: np.nan,
18: 111.0,
19: np.nan,
20: np.nan,
21: 72.0,
22: np.nan,
23: np.nan},
'MACRI2_50': {0: 12.4,
1: np.nan,
2: np.nan,
3: 11.0,
4: np.nan,
5: np.nan,
6: 12.4,
7: np.nan,
8: np.nan,
9: 11.0,
10: np.nan,
11: np.nan,
12: 102.0,
13: np.nan,
14: np.nan,
15: 2.1,
16: np.nan,
17: np.nan,
18: 112.0,
19: np.nan,
20: np.nan,
21: 74.0,
22: np.nan,
23: np.nan},
'MACRI2_AVG': {0: 15.0,
1: np.nan,
2: np.nan,
3: 12.0,
4: np.nan,
5: np.nan,
6: 15.0,
7: np.nan,
8: np.nan,
9: 12.0,
10: np.nan,
11: np.nan,
12: 103.0,
13: np.nan,
14: np.nan,
15: 24.0,
16: np.nan,
17: np.nan,
18: 113.0,
19: np.nan,
20: np.nan,
21: 77.0,
22: np.nan,
23: np.nan},
'NDVI_sd': {0: np.nan,
1: 2.9,
2: np.nan,
3: np.nan,
4: 20.0,
5: np.nan,
6: np.nan,
7: 2.9,
8: np.nan,
9: np.nan,
10: 20.0,
11: np.nan,
12: np.nan,
13: 201.0,
14: np.nan,
15: np.nan,
16: 11.0,
17: np.nan,
18: np.nan,
19: 200.0,
20: np.nan,
21: np.nan,
22: 32.0,
23: np.nan},
'NDVI_50': {0: np.nan,
1: 21.0,
2: np.nan,
3: np.nan,
4: 21.0,
5: np.nan,
6: np.nan,
7: 21.0,
8: np.nan,
9: np.nan,
10: 21.0,
11: np.nan,
12: np.nan,
13: 201.0,
14: np.nan,
15: np.nan,
16: 12.0,
17: np.nan,
18: np.nan,
19: 300.0,
20: np.nan,
21: np.nan,
22: 39.0,
23: np.nan},
'NDVI_AVG': {0: np.nan,
1: 27.0,
2: np.nan,
3: np.nan,
4: 22.0,
5: np.nan,
6: np.nan,
7: 27.0,
8: np.nan,
9: np.nan,
10: 22.0,
11: np.nan,
12: np.nan,
13: 203.0,
14: np.nan,
15: np.nan,
16: 13.0,
17: np.nan,
18: np.nan,
19: 400.0,
20: np.nan,
21: np.nan,
22: 40.0,
23: np.nan},
'NDRE_sd': {0: np.nan,
1: np.nan,
2: 3.1,
3: np.nan,
4: np.nan,
5: 31.0,
6: np.nan,
7: np.nan,
8: 3.1,
9: np.nan,
10: np.nan,
11: 31.0,
12: np.nan,
13: np.nan,
14: 301.0,
15: np.nan,
16: np.nan,
17: 15.0,
18: np.nan,
19: np.nan,
20: 57.0,
21: np.nan,
22: np.nan,
23: 21.0},
'NDRE_50': {0: np.nan,
1: np.nan,
2: 33.0,
3: np.nan,
4: np.nan,
5: 32.0,
6: np.nan,
7: np.nan,
8: 33.0,
9: np.nan,
10: np.nan,
11: 32.0,
12: np.nan,
13: np.nan,
14: 302.0,
15: np.nan,
16: np.nan,
17: 16.0,
18: np.nan,
19: np.nan,
20: 58.0,
21: np.nan,
22: np.nan,
23: 22.0},
'NDRE_AVG': {0: np.nan,
1: np.nan,
2: 330.0,
3: np.nan,
4: np.nan,
5: 33.0,
6: np.nan,
7: np.nan,
8: 330.0,
9: np.nan,
10: np.nan,
11: 33.0,
12: np.nan,
13: np.nan,
14: 303.0,
15: np.nan,
16: np.nan,
17: 17.0,
18: np.nan,
19: np.nan,
20: 59.0,
21: np.nan,
22: np.nan,
23: 32.0}}
df_test = pd.DataFrame(data_dict)
def generate_data_per_EM(df):
data_survey = []
for (survey,mission,trial,em),data in df.groupby(['Survey','Mission','Trial','Estimation Method']):
df_em = data.set_index('ID').dropna(axis=1)
df_em.to_csv(f'tmp_data_{survey}_{mission}_{trial}_{em}.csv') #This generates 74 files, but not sure how to join/merge them
data_survey.append(df_em)
#Merge the df_em column-wise
df_final = functools.reduce(lambda left, right: pd.merge(left, right, on=['ID','Survey','Mission','Trial']), data_survey)
df_final.to_csv(f'final_{survey}_{mission}_{em}.csv') #Output is not what I expected
generate_data_per_EM(df_test)

You need a groupby:
(df_test
.groupby(['ID', 'Plot', 'Survey', 'Trial','Mission'], as_index=False, sort=False)
.first(numeric_only=True)
ID Plot Survey Trial Mission MCARI2_sd MACRI2_50 MACRI2_AVG NDVI_sd NDVI_50 NDVI_AVG NDRE_sd NDRE_50 NDRE_AVG
0 id1 p1 Sv1 t1 mission1 1.5 12.4 15.0 2.9 21.0 27.0 3.1 33.0 330.0
1 id1 p1 Sv1 t2 mission1 10.0 11.0 12.0 20.0 21.0 22.0 31.0 32.0 33.0
2 id1 p1 Sv2 t1 mission1 1.5 12.4 15.0 2.9 21.0 27.0 3.1 33.0 330.0
3 id1 p1 Sv2 t2 mission1 10.0 11.0 12.0 20.0 21.0 22.0 NaN NaN NaN
4 id1 p1 Sv2 t2 mission2 72.0 74.0 77.0 32.0 39.0 40.0 31.0 32.0 33.0
5 id1 p1 Sv1 t1 mission2 101.0 102.0 103.0 201.0 201.0 203.0 301.0 302.0 303.0
6 id1 p1 Sv1 t2 mission2 23.5 2.1 24.0 11.0 12.0 13.0 15.0 16.0 17.0
7 id1 p1 Sv2 t1 mission2 111.0 112.0 113.0 200.0 300.0 400.0 57.0 58.0 59.0

graphs overlapping and redundant code to clear it out

I've been using RMarkdown to create graphs. Then I take the graphs and copy and paste them into Powerpoint presentations. That's been my workflow.
Here is the dataframe that I am using.
{'Unnamed: 0': {0: 'Mazda RX4', 1: 'Mazda RX4 Wag', 2: 'Datsun 710', 3: 'Hornet 4 Drive', 4: 'Hornet Sportabout', 5: 'Valiant', 6: 'Duster 360', 7: 'Merc 240D', 8: 'Merc 230', 9: 'Merc 280', 10: 'Merc 280C', 11: 'Merc 450SE', 12: 'Merc 450SL', 13: 'Merc 450SLC', 14: 'Cadillac Fleetwood', 15: 'Lincoln Continental', 16: 'Chrysler Imperial', 17: 'Fiat 128', 18: 'Honda Civic', 19: 'Toyota Corolla', 20: 'Toyota Corona', 21: 'Dodge Challenger', 22: 'AMC Javelin', 23: 'Camaro Z28', 24: 'Pontiac Firebird', 25: 'Fiat X1-9', 26: 'Porsche 914-2', 27: 'Lotus Europa', 28: 'Ford Pantera L', 29: 'Ferrari Dino', 30: 'Maserati Bora', 31: 'Volvo 142E'}, 'mpg': {0: 21.0, 1: 21.0, 2: 22.8, 3: 21.4, 4: 18.7, 5: 18.1, 6: 14.3, 7: 24.4, 8: 22.8, 9: 19.2, 10: 17.8, 11: 16.4, 12: 17.3, 13: 15.2, 14: 10.4, 15: 10.4, 16: 14.7, 17: 32.4, 18: 30.4, 19: 33.9, 20: 21.5, 21: 15.5, 22: 15.2, 23: 13.3, 24: 19.2, 25: 27.3, 26: 26.0, 27: 30.4, 28: 15.8, 29: 19.7, 30: 15.0, 31: 21.4}, 'cyl': {0: 6, 1: 6, 2: 4, 3: 6, 4: 8, 5: 6, 6: 8, 7: 4, 8: 4, 9: 6, 10: 6, 11: 8, 12: 8, 13: 8, 14: 8, 15: 8, 16: 8, 17: 4, 18: 4, 19: 4, 20: 4, 21: 8, 22: 8, 23: 8, 24: 8, 25: 4, 26: 4, 27: 4, 28: 8, 29: 6, 30: 8, 31: 4}, 'disp': {0: 160.0, 1: 160.0, 2: 108.0, 3: 258.0, 4: 360.0, 5: 225.0, 6: 360.0, 7: 146.7, 8: 140.8, 9: 167.6, 10: 167.6, 11: 275.8, 12: 275.8, 13: 275.8, 14: 472.0, 15: 460.0, 16: 440.0, 17: 78.7, 18: 75.7, 19: 71.1, 20: 120.1, 21: 318.0, 22: 304.0, 23: 350.0, 24: 400.0, 25: 79.0, 26: 120.3, 27: 95.1, 28: 351.0, 29: 145.0, 30: 301.0, 31: 121.0}, 'hp': {0: 110, 1: 110, 2: 93, 3: 110, 4: 175, 5: 105, 6: 245, 7: 62, 8: 95, 9: 123, 10: 123, 11: 180, 12: 180, 13: 180, 14: 205, 15: 215, 16: 230, 17: 66, 18: 52, 19: 65, 20: 97, 21: 150, 22: 150, 23: 245, 24: 175, 25: 66, 26: 91, 27: 113, 28: 264, 29: 175, 30: 335, 31: 109}, 'drat': {0: 3.9, 1: 3.9, 2: 3.85, 3: 3.08, 4: 3.15, 5: 2.76, 6: 3.21, 7: 3.69, 8: 3.92, 9: 3.92, 10: 3.92, 11: 3.07, 12: 3.07, 13: 3.07, 14: 2.93, 15: 3.0, 16: 3.23, 17: 4.08, 18: 4.93, 19: 4.22, 20: 3.7, 21: 2.76, 22: 3.15, 23: 3.73, 24: 3.08, 25: 4.08, 26: 4.43, 27: 3.77, 28: 4.22, 29: 3.62, 30: 3.54, 31: 4.11}, 'wt': {0: 2.62, 1: 2.875, 2: 2.32, 3: 3.215, 4: 3.44, 5: 3.46, 6: 3.57, 7: 3.19, 8: 3.15, 9: 3.44, 10: 3.44, 11: 4.07, 12: 3.73, 13: 3.78, 14: 5.25, 15: 5.424, 16: 5.345, 17: 2.2, 18: 1.615, 19: 1.835, 20: 2.465, 21: 3.52, 22: 3.435, 23: 3.84, 24: 3.845, 25: 1.935, 26: 2.14, 27: 1.513, 28: 3.17, 29: 2.77, 30: 3.57, 31: 2.78}, 'qsec': {0: 16.46, 1: 17.02, 2: 18.61, 3: 19.44, 4: 17.02, 5: 20.22, 6: 15.84, 7: 20.0, 8: 22.9, 9: 18.3, 10: 18.9, 11: 17.4, 12: 17.6, 13: 18.0, 14: 17.98, 15: 17.82, 16: 17.42, 17: 19.47, 18: 18.52, 19: 19.9, 20: 20.01, 21: 16.87, 22: 17.3, 23: 15.41, 24: 17.05, 25: 18.9, 26: 16.7, 27: 16.9, 28: 14.5, 29: 15.5, 30: 14.6, 31: 18.6}, 'vs': {0: 0, 1: 0, 2: 1, 3: 1, 4: 0, 5: 1, 6: 0, 7: 1, 8: 1, 9: 1, 10: 1, 11: 0, 12: 0, 13: 0, 14: 0, 15: 0, 16: 0, 17: 1, 18: 1, 19: 1, 20: 1, 21: 0, 22: 0, 23: 0, 24: 0, 25: 1, 26: 0, 27: 1, 28: 0, 29: 0, 30: 0, 31: 1}, 'am': {0: 1, 1: 1, 2: 1, 3: 0, 4: 0, 5: 0, 6: 0, 7: 0, 8: 0, 9: 0, 10: 0, 11: 0, 12: 0, 13: 0, 14: 0, 15: 0, 16: 0, 17: 1, 18: 1, 19: 1, 20: 0, 21: 0, 22: 0, 23: 0, 24: 0, 25: 1, 26: 1, 27: 1, 28: 1, 29: 1, 30: 1, 31: 1}, 'gear': {0: 4, 1: 4, 2: 4, 3: 3, 4: 3, 5: 3, 6: 3, 7: 4, 8: 4, 9: 4, 10: 4, 11: 3, 12: 3, 13: 3, 14: 3, 15: 3, 16: 3, 17: 4, 18: 4, 19: 4, 20: 3, 21: 3, 22: 3, 23: 3, 24: 3, 25: 4, 26: 5, 27: 5, 28: 5, 29: 5, 30: 5, 31: 4}, 'carb': {0: 4, 1: 4, 2: 1, 3: 1, 4: 2, 5: 1, 6: 4, 7: 2, 8: 2, 9: 4, 10: 4, 11: 3, 12: 3, 13: 3, 14: 4, 15: 4, 16: 4, 17: 1, 18: 2, 19: 1, 20: 1, 21: 2, 22: 2, 23: 4, 24: 2, 25: 1, 26: 2, 27: 2, 28: 4, 29: 6, 30: 8, 31: 2}}
The code looks like this.
```{r, warning = FALSE, message = FALSE}
ggplot2::ggplot(data = mtcars, aes(x = wt, y = after_stat(count))) +
geom_histogram(bins = 32, color = 'black', fill = '#ffe6b7') +
labs(title = "Mtcars", subtitle = "Histogram") +
theme(plot.title = element_text(face = "bold"))
ggplot2::ggplot(data = mtcars, aes(x = mpg, y = after_stat(count))) +
geom_histogram(bins = 32, color = 'black', fill = '#ffe6b7') +
labs(title = "Mtcars", subtitle = "Histogram") +
theme(plot.title = element_text(face = "bold"))
ggplot2::ggplot(data = mtcars, aes(x = disp, y = after_stat(count))) +
geom_histogram(bins = 32, color = 'black', fill = '#ffe6b7') +
labs(title = "Mtcars", subtitle = "Histogram") +
theme(plot.title = element_text(face = "bold"))
```
And here is a screenshot of the output.
Now I'm trying to do the same using python graphs. I'm seeing that I can't do the same thing exactly because the graphs start overlapping.
```{python}
seaborn.histplot(data=mtcars, x="wt", bins = 30)
plt.title("wt histogram", loc = 'left')
plt.show()
seaborn.histplot(data=mtcars, x="mpg", bins = 30)
plt.title("mpg histogram", loc = 'left')
plt.show()
seaborn.histplot(data=mtcars, x="disp", bins = 30)
plt.title("disp histogram", loc = 'left')
plt.show()
```
So now what I'm doing is I'm clearing out the space after I create every single graph. The output now looks fine - I get a distinct histogram for each variable I'm calling.
```{python}
plt.figure().clear()
plt.close()
plt.cla()
plt.clf()
seaborn.histplot(data=mtcars, x="wt", bins = 30)
plt.title("wt histogram", loc = 'left')
plt.show()
plt.figure().clear()
plt.close()
plt.cla()
plt.clf()
seaborn.histplot(data=mtcars, x="mpg", bins = 30)
plt.title("mpg histogram", loc = 'left')
plt.show()
plt.figure().clear()
plt.close()
plt.cla()
plt.clf()
seaborn.histplot(data=mtcars, x="disp", bins = 30)
plt.title("disp histogram", loc = 'left')
plt.show()
```
The output is definitely better.
But isn't this method really redundant? What do people who use python more regularly do to maintain what is happening with the graphs? Do you all clear out the space every time in this way?

Making a column out of a part of an multiindex column in pandas

I have a df:
df_test = pd.DataFrame.from_dict({('group', ''): {0: 'A',
1: 'A',
2: 'A',
3: 'A',
4: 'A',
5: 'A',
6: 'A',
7: 'A',
8: 'A',
9: 'B',
10: 'B',
11: 'B',
12: 'B',
13: 'B',
14: 'B',
15: 'B',
16: 'B',
17: 'B',
18: 'all',
19: 'all'},
('category', ''): {0: 'Amazon',
1: 'Apple',
2: 'Facebook',
3: 'Google',
4: 'Netflix',
5: 'Tesla',
6: 'Total',
7: 'Uber',
8: 'total',
9: 'Amazon',
10: 'Apple',
11: 'Facebook',
12: 'Google',
13: 'Netflix',
14: 'Tesla',
15: 'Total',
16: 'Uber',
17: 'total',
18: 'Total',
19: 'total'},
(pd.Timestamp('2020-06-29'), 'last_sales'): {0: 195.0,
1: 61.0,
2: 106.0,
3: 61.0,
4: 37.0,
5: 13.0,
6: 954.0,
7: 4.0,
8: 477.0,
9: 50.0,
10: 50.0,
11: 75.0,
12: 43.0,
13: 17.0,
14: 14.0,
15: 504.0,
16: 3.0,
17: 252.0,
18: 2916.0,
19: 2916.0},
(pd.Timestamp('2020-06-29'), 'sales'): {0: 1268.85,
1: 18274.385000000002,
2: 19722.65,
3: 55547.255,
4: 15323.800000000001,
5: 1688.6749999999997,
6: 227463.23,
7: 1906.0,
8: 113731.615,
9: 3219.6499999999996,
10: 15852.060000000001,
11: 17743.7,
12: 37795.15,
13: 5918.5,
14: 1708.75,
15: 166349.64,
16: 937.01,
17: 83174.82,
18: 787625.7400000001,
19: 787625.7400000001},
(pd.Timestamp('2020-06-29'), 'difference'): {0: 0.0,
1: 0.0,
2: 0.0,
3: 0.0,
4: 0.0,
5: 0.0,
6: 0.0,
7: 0.0,
8: 0.0,
9: 0.0,
10: 0.0,
11: 0.0,
12: 0.0,
13: 0.0,
14: 0.0,
15: 0.0,
16: 0.0,
17: 0.0,
18: 0.0,
19: 0.0},
(pd.Timestamp('2020-07-06'), 'last_sales'): {0: 26.0,
1: 39.0,
2: 79.0,
3: 49.0,
4: 10.0,
5: 10.0,
6: 436.0,
7: 5.0,
8: 218.0,
9: 89.0,
10: 34.0,
11: 133.0,
12: 66.0,
13: 21.0,
14: 20.0,
15: 732.0,
16: 3.0,
17: 366.0,
18: 2336.0,
19: 2336.0},
(pd.Timestamp('2020-07-06'), 'sales'): {0: 3978.15,
1: 12138.96,
2: 19084.175,
3: 40033.46000000001,
4: 4280.15,
5: 1495.1,
6: 165548.29,
7: 1764.15,
8: 82774.145,
9: 8314.92,
10: 12776.649999999996,
11: 28048.075,
12: 55104.21000000002,
13: 6962.844999999999,
14: 3053.2000000000003,
15: 231049.11000000002,
16: 1264.655,
17: 115524.55500000001,
18: 793194.8000000002,
19: 793194.8000000002},
(pd.Timestamp('2020-07-06'), 'difference'): {0: 0.0,
1: 0.0,
2: 0.0,
3: 0.0,
4: 0.0,
5: 0.0,
6: 0.0,
7: 0.0,
8: 0.0,
9: 0.0,
10: 0.0,
11: 0.0,
12: 0.0,
13: 0.0,
14: 0.0,
15: 0.0,
16: 0.0,
17: 0.0,
18: 0.0,
19: 0.0},
(pd.Timestamp('2021-06-28'), 'last_sales'): {0: 96.0,
1: 56.0,
2: 106.0,
3: 44.0,
4: 34.0,
5: 13.0,
6: 716.0,
7: 9.0,
8: 358.0,
9: 101.0,
10: 22.0,
11: 120.0,
12: 40.0,
13: 13.0,
14: 8.0,
15: 610.0,
16: 1.0,
17: 305.0,
18: 2652.0,
19: 2652.0},
(pd.Timestamp('2021-06-28'), 'sales'): {0: 5194.95,
1: 19102.219999999994,
2: 22796.420000000002,
3: 30853.115,
4: 11461.25,
5: 992.6,
6: 188143.41,
7: 3671.15,
8: 94071.705,
9: 6022.299999999998,
10: 7373.6,
11: 33514.0,
12: 35943.45,
13: 4749.000000000001,
14: 902.01,
15: 177707.32,
16: 349.3,
17: 88853.66,
18: 731701.46,
19: 731701.46},
(pd.Timestamp('2021-06-28'), 'difference'): {0: 0.0,
1: 0.0,
2: 0.0,
3: 0.0,
4: 0.0,
5: 0.0,
6: 0.0,
7: 0.0,
8: 0.0,
9: 0.0,
10: 0.0,
11: 0.0,
12: 0.0,
13: 0.0,
14: 0.0,
15: 0.0,
16: 0.0,
17: 0.0,
18: 0.0,
19: 0.0},
(pd.Timestamp('2021-07-07'), 'last_sales'): {0: 45.0,
1: 47.0,
2: 87.0,
3: 45.0,
4: 13.0,
5: 8.0,
6: 494.0,
7: 2.0,
8: 247.0,
9: 81.0,
10: 36.0,
11: 143.0,
12: 56.0,
13: 9.0,
14: 9.0,
15: 670.0,
16: 1.0,
17: 335.0,
18: 2328.0,
19: 2328.0},
(pd.Timestamp('2021-07-07'), 'sales'): {0: 7556.414999999998,
1: 14985.05,
2: 16790.899999999998,
3: 36202.729999999996,
4: 4024.97,
5: 1034.45,
6: 163960.32999999996,
7: 1385.65,
8: 81980.16499999998,
9: 5600.544999999999,
10: 11209.92,
11: 32832.61,
12: 42137.44500000001,
13: 3885.1499999999996,
14: 1191.5,
15: 194912.34000000003,
16: 599.0,
17: 97456.17000000001,
18: 717745.3400000001,
19: 717745.3400000001},
(pd.Timestamp('2021-07-07'), 'difference'): {0: 0.0,
1: 0.0,
2: 0.0,
3: 0.0,
4: 0.0,
5: 0.0,
6: 0.0,
7: 0.0,
8: 0.0,
9: 0.0,
10: 0.0,
11: 0.0,
12: 0.0,
13: 0.0,
14: 0.0,
15: 0.0,
16: 0.0,
17: 0.0,
18: 0.0,
19: 0.0}}).set_index(['group','category'])
Which follows this structure:
2020-06-29 2020-07-05
group category last_sales sales difference
A Amazon 195 1268 0
Apple 61 18247 0
Facebook 106 19722 0
...
B Amazon 50 3219 0
Apple 50 15852 0
Facebook 75 17743 0
...
I am trying to move sales, last_sales, difference next to category. So at the moment those are columns in each week, I am trying to make them some kind of index where the df would look like this:
2020-06-29 2020-07-05
group category
A Amazon last_sales 195 ...
sales 1268 ...
difference 0 ...
Apple last_sales 61 ...
sales 18247 ...
difference 0 ...
Facebook last_sales 106 ...
sales 19722 ...
difference 0 ...
...
B Amazon last_sales 50 ...
sales 3219 ...
difference 0 ...
Apple last_sales 50 ...
sales 15852 ...
difference 0 ...
Facebook last_sales 75 ...
sales 17743 ...
difference 0 ...
...
I've been trying to achieve this using .unstack(), .reset_index() but with no success, I am not sure how I should proceed with this problem.
The problem I am unable to solves comes from the fact that :
df_test.columns
Returns:
MultiIndex([(2020-06-29 00:00:00, 'last_sales'),
(2020-06-29 00:00:00, 'sales'),
(2020-06-29 00:00:00, 'difference'),
(2020-07-06 00:00:00, 'last_sales'),
(2020-07-06 00:00:00, 'sales'),
(2020-07-06 00:00:00, 'difference'),
(2021-06-28 00:00:00, 'last_sales'),
(2021-06-28 00:00:00, 'sales'),
(2021-06-28 00:00:00, 'difference'),
(2021-07-07 00:00:00, 'last_sales'),
(2021-07-07 00:00:00, 'sales'),
(2021-07-07 00:00:00, 'difference')],
)
So the one part, the date should stay as a column and the other part, the sales, last_sales and difference should become one new column ( or index, I am not sure, but I hope my desired output explains it).

If original ordering is not important, use DataFrame.stack:
df = df_test.stack()
print (df.head())
2020-06-29 00:00:00 2020-07-06 00:00:00 \
group category
A Amazon difference 0.00 0.00
last_sales 195.00 26.00
sales 1268.85 3978.15
Apple difference 0.00 0.00
last_sales 61.00 39.00
2021-06-28 00:00:00 2021-07-07 00:00:00
group category
A Amazon difference 0.00 0.000
last_sales 96.00 45.000
sales 5194.95 7556.415
Apple difference 0.00 0.000
last_sales 56.00 47.000
For original ordering use:
#because sample data
#df_test.columns = df_test.columns.remove_unused_levels()
cats = pd.CategoricalIndex(df_test.columns.levels[1],
ordered=True,
categories=['last_sales','sales','difference'])
df_test.columns = df_test.columns.set_levels(cats, level=1)
df = df_test.stack()
print (df.head())
2020-06-29 00:00:00 2020-07-06 00:00:00 \
group category
A Amazon last_sales 195.000 26.00
sales 1268.850 3978.15
difference 0.000 0.00
Apple last_sales 61.000 39.00
sales 18274.385 12138.96
2021-06-28 00:00:00 2021-07-07 00:00:00
group category
A Amazon last_sales 96.00 45.000
sales 5194.95 7556.415
difference 0.00 0.000
Apple last_sales 56.00 47.000
sales 19102.22 14985.050

How to create a range in a dictionairy

I have following dictionairy:
total_working_years_dict = dict(df.TotalWorkingYears.value_counts())
total_working_years_dict
{10: 202,
6: 125,
8: 103,
9: 96,
5: 88,
1: 81,
7: 81,
4: 63,
12: 48,
3: 42,
15: 40,
16: 37,
13: 36,
11: 36,
21: 34,
17: 33,
14: 31,
2: 31,
20: 30,
18: 27,
19: 22,
23: 22,
22: 21,
24: 18,
25: 14,
28: 14,
26: 14,
0: 11,
29: 10,
31: 9,
32: 9,
27: 7,
30: 7,
33: 7,
36: 6,
34: 5,
37: 4,
35: 3,
40: 2,
38: 1}
The keys are working years and values are numbers of employees which have such experience. I would like to transform my dictionairy so that total working yeras are given in ranges (0,6), (6,11) etc.
Do you have any idea how to do that ?

Let's start from your dict as a Series:
s = pd.Series(total_working_years_dict)
you can use pandas.cut to form your groups:
s.index = pd.cut(s.index, bins=range(0,100,6))
output:
(6.0, 12.0] 202
(0.0, 6.0] 125
(6.0, 12.0] 103
(6.0, 12.0] 96
(0.0, 6.0] 88
...
(30.0, 36.0] 3
(36.0, 42.0] 2
(36.0, 42.0] 1
dtype: int64
NB. if you now want to aggregate the counts per group, it would be more efficient to proceed to the pandas.cut operation before your initial value_counts. Also, I don't get the point of converting the Series to dict if you need to further process it.

Column Values still shown after .isin()

As requested, here is a minimal reproducable example that will generate the issue of .isin() not dropping the values not in .isin() but just setting them to zero:
import os
import pandas as pd
df_example = pd.DataFrame({'Requesting as': {0: 'Employee', 1: 'Ex- Employee', 2: 'Employee', 3: 'Employee', 4: 'Ex-Employee', 5: 'Employee', 6: 'Employee', 7: 'Employee', 8: 'Ex-Employee', 9: 'Ex-Employee', 10: 'Employee', 11: 'Employee', 12: 'Ex-Employee', 13: 'Ex-Employee', 14: 'Employee', 15: 'Employee', 16: 'Employee', 17: 'Ex-Employee', 18: 'Employee', 19: 'Employee', 20: 'Ex-Employee', 21: 'Employee', 22: 'Employee', 23: 'Ex-Employee', 24: 'Employee', 25: 'Employee', 26: 'Ex-Employee', 27: 'Employee', 28: 'Employee', 29: 'Ex-Employee', 30: 'Employee', 31: 'Employee', 32: 'Ex-Employee', 33: 'Employee', 34: 'Employee', 35: 'Ex-Employee', 36: 'Employee', 37: 'Employee', 38: 'Ex-Employee', 39: 'Employee', 40: 'Employee'}, 'Years of service': {0: -0.4, 1: -0.3, 2: -0.2, 3: 1.0, 4: 1.0, 5: 1.0, 6: 2.0, 7: 2.0, 8: 2.0, 9: 2.0, 10: 3.0, 11: 3.0, 12: 3.0, 13: 4.0, 14: 4.0, 15: 4.0, 16: 5.0, 17: 5.0, 18: 5.0, 19: 5.0, 20: 6.0, 21: 6.0, 22: 6.0, 23: 11.0, 24: 11.0, 25: 11.0, 26: 16.0, 27: 17.0, 28: 18.0, 29: 21.0, 30: 22.0, 31: 23.0, 32: 26.0, 33: 27.0, 34: 28.0, 35: 31.0, 36: 32.0, 37: 33.0, 38: 35.0, 39: 36.0, 40: 37.0}, 'yos_bins': {0: 0, 1: 0, 2: 0, 3: '0-1', 4: '0-1', 5: '0-1', 6: '1-2', 7: '1-2', 8: '1-2', 9: '1-2', 10: '2-3', 11: '2-3', 12: '2-3', 13: '3-4', 14: '3-4', 15: '3-4', 16: '4-5', 17: '4-5', 18: '4-5', 19: '4-5', 20: '5-6', 21: '5-6', 22: '5-6', 23: '10-15', 24: '10-15', 25: '10-15', 26: '15-20', 27: '15-20', 28: '15-20', 29: '20-40', 30: '20-40', 31: '20-40', 32: '20-40', 33: '20-40', 34: '20-40', 35: '20-40', 36: '20-40', 37: '20-40', 38: '20-40', 39: '20-40', 40: '20-40'}})
cut_labels = ['0-1','1-2', '2-3', '3-4', '4-5', '5-6', '6-10', '10-15', '15-20', '20-40']
cut_bins = (0, 1, 2, 3, 4, 5, 6, 10, 15, 20, 40)
df_example['yos_bins'] = pd.cut(df_example['Years of service'], bins=cut_bins, labels=cut_labels)
print(df_example['yos_bins'].value_counts())
print(len(df_example['yos_bins']))
print(len(df_example))
print(df_example['yos_bins'].value_counts())
test = df_example[df_example['yos_bins'].isin(['0-1', '1-2', '2-3'])]
print('test dataframe:\n',test)
print('\n')
print('test value counts of yos_bins:\n', test['yos_bins'].value_counts())
print('\n')
dic_test = test.to_dict()
print(dic_test)
print('\n')
print(test.value_counts())ervr
I have created bins for a column with "years of service":
cut_labels = ['0-1','1-2', '2-3', '3-4', '4-5', '5-6', '6-10', '10-15', '15-20', '20-40']
cut_bins = (0, 1, 2, 3, 4, 5, 6, 10, 15, 20, 40)
df['yos_bins'] = pd.cut(df['Years of service'], bins=cut_bins, labels=cut_labels)
Then I applied .isin() to the dataframe column called 'yos_bins' with the intention to filter for a selection of column values. Excerpt from column in df.
The column I use to slice is called 'yos_bins' (i.e. binned Years of Service). I want to select only 3 ranges (0-1, 1-2, 2-3 years), but apparently there are more ranges included in the column.
To my surprise, when I apply value_counts(), I still get all values of the yos_bins column from the df dataframe (but with 0 counts).
test.yos_bins.value_counts()
Looks like this:
This was not intended, all other bins except the 3 in isin() should have been dropped. The resulting issue is that the 0 values are shown in sns.countplots, so I end up with undesired columns with zero counts.
When I save the df to_excel(), all "10-15" value fields show a "Text Date with 2-Digit Year" error. I do not load that dataframe back into python, so not sure if this could cause the problem?
Does anybody know how I can create the test dataframe that merely consists of the 3 yos_bins values instead of showing all yos_bins values, but some with zeros?

An ugly solution because numpy and pandas are misfeatured in terms of element-wise "is in". In my experience I do the comparison manually with numpy arrays.
yos_bins = np.array(df["yos_bins"])
yos_bins_sel = np.array(["0-1", "1-2", "2-3"])
mask = (yos_bins[:, None] == yos_bins_sel[None, :]).any(1)
df[mask]
Requesting as Years of service yos_bins
3 Employee 1.0 0-1
4 Ex-Employee 1.0 0-1
5 Employee 1.0 0-1
6 Employee 2.0 1-2
7 Employee 2.0 1-2
8 Ex-Employee 2.0 1-2
9 Ex-Employee 2.0 1-2
10 Employee 3.0 2-3
11 Employee 3.0 2-3
12 Ex-Employee 3.0 2-3
Explanation
(using x as yos_bins and y as yos_bins_sel)
x[:, None] == y[None, :]).all(1) is the main takeaway, x[:, None] converts x from shape to (n,) to (n, 1). y[None, :] converts y from shape (m,) to (1, m). Comparing them with == forms a broadcasted element-wise boolean array of shape (n, m), we want our array to be (n,)-shaped, so we apply .any(1) so that the second dimension is compressed to True if at least one of it's booleans is True (which is if the element is in the yos_bins_sel array). You end up with a boolean array which can be used to mask the original Data Frame. Replace x with the array containing the values to be compared and y with the array that the values of x should be contained in and you will be able to do this for any data set.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas Compare rows in Dataframe - python

Related

How to put values on a single raw from multiple columns in Pandas

graphs overlapping and redundant code to clear it out

Making a column out of a part of an multiindex column in pandas

How to create a range in a dictionairy

Column Values still shown after .isin()

Categories

Resources