i have the following dataset , which i have extracted from pandas dataframe
{'Batch': {0: 'Nos705', 1: 'Nos706', 2: 'Nos707', 3: 'Nos708', 4: 'Nos709', 5: 'Nos710', 6: 'Nos711', 7: 'Nos713', 8: 'Nos714', 9: 'Nos715'},
'Message': {0: 'ACBB', 1: 'ACBL', 2: 'ACBL', 3: 'ACBC', 4: 'ACBC', 5: 'ACBC', 6: 'ACBL', 7: 'ACBL', 8: 'ACBL', 9: 'ACBL'},
'DCC': {0: 284, 1: 21, 2: 43, 3: 19, 4: 0, 5: 0, 6: 19, 7: 27, 8: 27, 9: 19},
'DCB': {0: 299, 1: 22, 2: 24, 3: 28, 4: 167, 5: 167, 6: 20, 7: 27, 8: 27, 9: 28},
'ISC': {0: 'Car010030', 1: 'Car010054', 2: 'Car010047', 3: 'Car010182', 4: 'Car010004', 5: 'Car010004', 6: 'Car010182', 7: 'Car010182', 8: 'Car010182', 9: 'Car010182'},
'ISB': {0: 'Car010010', 1: None, 2: None, 3: None, 4: None, 5: None, 6: None, 7: None, 8: None, 9: None},
'VSC': {0: 25, 1: 25, 2: 25, 3: 25, 4: 25, 5: 25, 6: 25, 7: 25, 8: 25, 9: 25},
'VSB': {0: 27.0, 1: nan, 2: nan, 3: nan, 4: nan, 5: nan, 6: nan, 7: nan, 8: nan, 9: nan},
'PGC': {0: 2.78, 1: 2.79, 2: 2.08, 3: 2.08, 4: 2.08, 5: 2.08, 6: 2.71, 7: 1.73, 8: 1.73, 9: 1.73},
'PGB': {0: 2.95, 1: nan, 2: nan, 3: nan, 4: nan, 5: nan, 6: nan, 7: nan, 8: nan, 9: nan},
'PHB': {0: 2.96, 1: nan, 2: nan, 3: nan, 4: nan, 5: nan, 6: nan, 7: nan, 8: nan, 9: nan},
'PHC': {0: 2.94, 1: 2.94, 2: 1.63, 3: 1.63, 4: 1.63, 5: 1.63, 6: 2.06, 7: 1.75, 8: 1.75, 9: 1.75},
'BPC': {0: 3.17, 1: 3.17, 2: 3.17, 3: 3.17, 4: 3.17, 5: 3.17, 6: 3.17, 7: 3.17, 8: 3.17, 9: 3.17},
'BPB': {0: None, 1: None, 2: None, 3: None, 4: None, 5: None, 6: None, 7: None, 8: None, 9: None}}
I want to create a dataframe which is stacked for related columns
eg: all values of DCC & DCB should appear in one column , one below another. Similarly for ISC & ISB, VSC & VSB, PGC & PCB, PHC & PHB, BPC & BPB
Batch remains the Primary Key here. How do it do it, in Python
First convert columns for repeating to MultiIndex:
df1 = df.set_index(['Batch','Message'])
Then create MultiIndex in columns by all values without last and last values of columns names in MultiIndex.from_arrays and reshape by DataFrame.stack, for correct order add DataFrame.sort_values:
df1.columns = pd.MultiIndex.from_arrays([df1.columns.str[:-1], df1.columns.str[-1]],
names=[None, 'types'])
df1 = (df1.stack(dropna=False)
.reset_index()
.sort_values(['Batch','Message','types'],
ascending=[True, True, False],
ignore_index=True))
print (df1)
Batch Message types BP DC IS PG PH VS
0 Nos705 ACBB C 3.17 284 Car010030 2.78 2.94 25.0
1 Nos705 ACBB B None 299 Car010010 2.95 2.96 27.0
2 Nos706 ACBL C 3.17 21 Car010054 2.79 2.94 25.0
3 Nos706 ACBL B None 22 None NaN NaN NaN
4 Nos707 ACBL C 3.17 43 Car010047 2.08 1.63 25.0
5 Nos707 ACBL B None 24 None NaN NaN NaN
6 Nos708 ACBC C 3.17 19 Car010182 2.08 1.63 25.0
7 Nos708 ACBC B None 28 None NaN NaN NaN
8 Nos709 ACBC C 3.17 0 Car010004 2.08 1.63 25.0
9 Nos709 ACBC B None 167 None NaN NaN NaN
10 Nos710 ACBC C 3.17 0 Car010004 2.08 1.63 25.0
11 Nos710 ACBC B None 167 None NaN NaN NaN
12 Nos711 ACBL C 3.17 19 Car010182 2.71 2.06 25.0
13 Nos711 ACBL B None 20 None NaN NaN NaN
14 Nos713 ACBL C 3.17 27 Car010182 1.73 1.75 25.0
15 Nos713 ACBL B None 27 None NaN NaN NaN
16 Nos714 ACBL C 3.17 27 Car010182 1.73 1.75 25.0
17 Nos714 ACBL B None 27 None NaN NaN NaN
18 Nos715 ACBL C 3.17 19 Car010182 1.73 1.75 25.0
19 Nos715 ACBL B None 28 None NaN NaN NaN
Last if necessary remove types column:
df1 = df1.drop('types', axis=1)
Related
I have two DataFrames. column "video_path" is common in both the dataframes. I need to extract details from df1 if it matches with df2 and also with the value of yes/no.
df1
df2
Expected result:
What i tried
newdf= df1.merge(frame, left_on='video_path', right_on='Video_path', how='inner')
But I'm sure its not correct.
code to create data frames
df1 = {'video_path': {0: 'video/file_path/1.mp4', 1: 'video/file_path/1.mp4', 2: 'video/file_path/1.mp4', 3: 'video/file_path/2.mp4', 4: 'video/file_path/2.mp4', 5: 'video/file_path/2.mp4', 6: 'video/file_path/2.mp4', 7: 'video/file_path/2.mp4', 8: 'video/file_path/3.mp4', 9: 'video/file_path/3.mp4', 10: 'video/file_path/3.mp4', 11: 'video/file_path/4.mp4', 12: 'video/file_path/4.mp4', 13: 'video/file_path/4.mp4', 14: 'video/file_path/4.mp4', 15: 'video/file_path/5.mp4', 16: 'video/file_path/5.mp4', 17: 'video/file_path/5.mp4', 18: 'video/file_path/5.mp4', 19: 'video/file_path/6.mp4', 20: 'video/file_path/6.mp4', 21: 'video/file_path/6.mp4', 22: 'video/file_path/6.mp4', 23: 'video/file_path/6.mp4'}, 'frame_details': {0: 'frame_1.jpg', 1: 'frame_2.jpg', 2: 'frame_3.jpg', 3: 'frame_1.jpg', 4: 'frame_2.jpg', 5: 'frame_3.jpg', 6: 'frame_4.jpg', 7: 'frame_5.jpg', 8: 'frame_1.jpg', 9: 'frame_2.jpg', 10: 'frame_3.jpg', 11: 'frame_1.jpg', 12: 'frame_2.jpg', 13: 'frame_3.jpg', 14: 'frame_4.jpg', 15: 'frame_1.jpg', 16: 'frame_2.jpg', 17: 'frame_3.jpg', 18: 'frame_4.jpg', 19: 'frame_1.jpg', 20: 'frame_2.jpg', 21: 'frame_3.jpg', 22: 'frame_4.jpg', 23: 'frame_5.jpg'}, 'width': {0: 520, 1: 520, 2: 520, 3: 120, 4: 120, 5: 120, 6: 120, 7: 120, 8: 720, 9: 720, 10: 720, 11: 1080, 12: 1080, 13: 1080, 14: 1080, 15: 480, 16: 480, 17: 480, 18: 480, 19: 640, 20: 640, 21: 640, 22: 640, 23: 640}, 'height': {0: 225, 1: 225, 2: 225, 3: 120, 4: 120, 5: 120, 6: 120, 7: 120, 8: 480, 9: 480, 10: 480, 11: 1920, 12: 1920, 13: 1920, 14: 1920, 15: 640, 16: 640, 17: 640, 18: 640, 19: 480, 20: 480, 21: 480, 22: 480, 23: 480}, 'hasAudio': {0: 'yes', 1: 'yes', 2: 'yes', 3: 'yes', 4: 'yes', 5: 'yes', 6: 'yes', 7: 'yes', 8: 'yes', 9: 'yes', 10: 'yes', 11: 'no', 12: 'no', 13: 'no', 14: 'no', 15: 'no', 16: 'no', 17: 'no', 18: 'no', 19: 'yes', 20: 'yes', 21: 'yes', 22: 'yes', 23: 'yes'}}
df2 = {'Video_path': {0: 'video/file_path/1.mp4',
1: 'video/file_path/2.mp4',
2: 'video/file_path/4.mp4',
3: 'video/file_path/6.mp4',
4: 'video/file_path/7.mp4',
5: 'video/file_path/8.mp4',
6: 'video/file_path/9.mp4'},
'isPresent': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan, 5: nan, 6: nan}
Swap df1 and df2 with left join and indicator parameter, last set column isPresent by Series.map:
newdf= df2.merge(df1.rename(columns={'video_path':'Video_path'}),
on='Video_path',
how='left',
indicator=True)
newdf['isPresent'] = newdf.pop('_merge').map({'both':'yes', 'left_only':'no'})
print (newdf)
Video_path isPresent frame_details width height hasAudio
0 video/file_path/1.mp4 yes frame_1.jpg 520.0 225.0 yes
1 video/file_path/1.mp4 yes frame_2.jpg 520.0 225.0 yes
2 video/file_path/1.mp4 yes frame_3.jpg 520.0 225.0 yes
3 video/file_path/2.mp4 yes frame_1.jpg 120.0 120.0 yes
4 video/file_path/2.mp4 yes frame_2.jpg 120.0 120.0 yes
5 video/file_path/2.mp4 yes frame_3.jpg 120.0 120.0 yes
6 video/file_path/2.mp4 yes frame_4.jpg 120.0 120.0 yes
7 video/file_path/2.mp4 yes frame_5.jpg 120.0 120.0 yes
8 video/file_path/4.mp4 yes frame_1.jpg 1080.0 1920.0 no
9 video/file_path/4.mp4 yes frame_2.jpg 1080.0 1920.0 no
10 video/file_path/4.mp4 yes frame_3.jpg 1080.0 1920.0 no
11 video/file_path/4.mp4 yes frame_4.jpg 1080.0 1920.0 no
12 video/file_path/6.mp4 yes frame_1.jpg 640.0 480.0 yes
13 video/file_path/6.mp4 yes frame_2.jpg 640.0 480.0 yes
14 video/file_path/6.mp4 yes frame_3.jpg 640.0 480.0 yes
15 video/file_path/6.mp4 yes frame_4.jpg 640.0 480.0 yes
16 video/file_path/6.mp4 yes frame_5.jpg 640.0 480.0 yes
17 video/file_path/7.mp4 no NaN NaN NaN NaN
18 video/file_path/8.mp4 no NaN NaN NaN NaN
19 video/file_path/9.mp4 no NaN NaN NaN NaN
score is the score received on each delivery and runs are the cumulative of the score. sequence is the the 6 delivery sequence of length/type each over. I am trying to find the average score of each delivery within a sequence across the whole dataset and the average runs for a sequence.
using this code I have got something like this but the problem is that each length/type does not repeat when it is grouped, so the cumulative of the averages, runs is not the correct 6 ball total:
df_seq=df_seq.reset_index()
df_sq = df_seq.groupby(['sequence', 'length/type']).agg({'score':'mean'})
df_sq['runs']=df_sq.groupby(['sequence'])['score'].cumsum()
df_sq
Here is the original dataframe to_dict with index reset:
{'Event_name': {0: 'fulham',
1: 'fulham',
2: 'fulham',
3: 'fulham',
4: 'fulham',
5: 'fulham',
6: 'fulham',
7: 'fulham',
8: 'fulham',
9: 'fulham',
10: 'fulham',
11: 'fulham'},
'Batfast_id': {0: 'bfs00200002',
1: 'bfs00200002',
2: 'bfs00200002',
3: 'bfs00200002',
4: 'bfs00200002',
5: 'bfs00200002',
6: 'bfs00200002',
7: 'bfs00200002',
8: 'bfs00200002',
9: 'bfs00200002',
10: 'bfs00200002',
11: 'bfs00200002'},
'Session_no': {0: 1,
1: 1,
2: 1,
3: 1,
4: 1,
5: 1,
6: 1,
7: 1,
8: 1,
9: 1,
10: 1,
11: 1},
'Overs': {0: 0,
1: 0,
2: 0,
3: 0,
4: 0,
5: 0,
6: 1,
7: 1,
8: 1,
9: 1,
10: 1,
11: 1},
'Deliveries_faced': {0: 0,
1: 1,
2: 2,
3: 3,
4: 4,
5: 5,
6: 6,
7: 7,
8: 8,
9: 9,
10: 10,
11: 11},
'score': {0: 0.0,
1: 0.0,
2: 0.0,
3: 0.0,
4: 6.0,
5: 4.0,
6: 0.0,
7: 0.0,
8: 0.0,
9: 0.0,
10: 0.0,
11: 0.0},
'runs': {0: 0.0,
1: 0.0,
2: 0.0,
3: 0.0,
4: 6.0,
5: 10.0,
6: 10.0,
7: 10.0,
8: 10.0,
9: 10.0,
10: 10.0,
11: 10.0},
'delivery_type': {0: 'Extra Slow Leg Spin',
1: 'Extra Slow Leg Spin',
2: 'Slow Straight',
3: 'Extra Slow Off Spin',
4: 'Extra Slow Leg Spin',
5: 'Extra Slow Leg Spin',
6: 'Extra Slow Off Spin',
7: 'Extra Slow Off Spin',
8: 'Slow Straight',
9: 'Extra Slow Leg Spin',
10: 'Extra Slow Off Spin',
11: 'Extra Slow Off Spin'},
'length': {0: 'Yorker',
1: 'Yorker',
2: 'Yorker',
3: 'Yorker',
4: 'Yorker',
5: 'Yorker',
6: 'Yorker',
7: 'Yorker',
8: 'Yorker',
9: 'Yorker',
10: 'Yorker',
11: 'Yorker'},
'length/type': {0: 'ES_LS_Y',
1: 'ES_LS_Y',
2: 'S_S_Y',
3: 'ES_OS_Y',
4: 'ES_LS_Y',
5: 'ES_LS_Y',
6: 'ES_OS_Y',
7: 'ES_OS_Y',
8: 'S_S_Y',
9: 'ES_LS_Y',
10: 'ES_OS_Y',
11: 'ES_OS_Y'},
'sequence': {0: 'ES_LS_Y,ES_LS_Y,S_S_Y,ES_OS_Y,ES_LS_Y,ES_LS_Y',
1: 'ES_LS_Y,ES_LS_Y,S_S_Y,ES_OS_Y,ES_LS_Y,ES_LS_Y',
2: 'ES_LS_Y,ES_LS_Y,S_S_Y,ES_OS_Y,ES_LS_Y,ES_LS_Y',
3: 'ES_LS_Y,ES_LS_Y,S_S_Y,ES_OS_Y,ES_LS_Y,ES_LS_Y',
4: 'ES_LS_Y,ES_LS_Y,S_S_Y,ES_OS_Y,ES_LS_Y,ES_LS_Y',
5: 'ES_LS_Y,ES_LS_Y,S_S_Y,ES_OS_Y,ES_LS_Y,ES_LS_Y',
6: 'ES_OS_Y,ES_OS_Y,S_S_Y,ES_LS_Y,ES_OS_Y,ES_OS_Y',
7: 'ES_OS_Y,ES_OS_Y,S_S_Y,ES_LS_Y,ES_OS_Y,ES_OS_Y',
8: 'ES_OS_Y,ES_OS_Y,S_S_Y,ES_LS_Y,ES_OS_Y,ES_OS_Y',
9: 'ES_OS_Y,ES_OS_Y,S_S_Y,ES_LS_Y,ES_OS_Y,ES_OS_Y',
10: 'ES_OS_Y,ES_OS_Y,S_S_Y,ES_LS_Y,ES_OS_Y,ES_OS_Y',
11: 'ES_OS_Y,ES_OS_Y,S_S_Y,ES_LS_Y,ES_OS_Y,ES_OS_Y'}}
Below is the perfect example because there are only 2 overs of this sequence in the data set:
The result I'm trying to get from this sequence is an average of the score for each delivery and the cumulative average runs as follows:
score runs
sequence length/type
ES_LS_F,ES_LS_F,ES_LS_F,ES_LS_F,ES_LS_F,ES_LS_F ES_LS_F 0.0 0.0
ES_LS_F 2.0 2.0
ES_LS_F 0.0 2.0
ES_LS_F 0.0 2.0
ES_LS_F 2.0 4.0
ES_LS_F 0.0 4.0
ie the score for the first delivery would be (0+0)/2 = 0. the second would be (0+4)/2 = 2 and so on. runs are the cumulative of this. The current solution is (4+4)/12 giving 0.67 as an average score to every delivery, this is not correct.
df_reg['sequence'] = (df_reg.groupby(["Event_name", "Batfast_id", "Session_no", "Overs"])["length/type"]
.apply(lambda x: ",".join(x)).loc[lambda x: x.str.count(",") == 5]
)
If I was able to uniquely number each delivery in the sequence I would be able to do it.
The question here is not really clear, but I'll give it a try.
First you want to establish what are the averages for sequence, length/type.
df_grp = df.groupby(['sequence', 'length/type'])[['score']].mean()
Then add those calculations to your original structure
df2 = df.set_index(['sequence', 'length/type'])
df2 = df2[[]].merge(df_grp, left_index=True, right_index=True)
This gets us:
score
sequence length/type
ES_LS_Y,ES_LS_Y,S_S_Y,ES_OS_Y,ES_LS_Y,ES_LS_Y ES_LS_Y 2.5
ES_LS_Y 2.5
ES_LS_Y 2.5
ES_LS_Y 2.5
ES_OS_Y 0.0
S_S_Y 0.0
ES_OS_Y,ES_OS_Y,S_S_Y,ES_LS_Y,ES_OS_Y,ES_OS_Y ES_LS_Y 0.0
ES_OS_Y 0.0
ES_OS_Y 0.0
ES_OS_Y 0.0
ES_OS_Y 0.0
S_S_Y 0.0
Now you just need to calculate the cumulative sum for each sequence.
df2_runs = df2.groupby(df2.index)[['score']].cumsum().rename(columns={"score" : "runs"})
df2['runs'] = df2_runs.runs
The final result is
score runs
sequence length/type
ES_LS_Y,ES_LS_Y,S_S_Y,ES_OS_Y,ES_LS_Y,ES_LS_Y ES_LS_Y 2.5 2.5
ES_LS_Y 2.5 5.0
ES_LS_Y 2.5 7.5
ES_LS_Y 2.5 10.0
ES_OS_Y 0.0 0.0
S_S_Y 0.0 0.0
ES_OS_Y,ES_OS_Y,S_S_Y,ES_LS_Y,ES_OS_Y,ES_OS_Y ES_LS_Y 0.0 0.0
ES_OS_Y 0.0 0.0
ES_OS_Y 0.0 0.0
ES_OS_Y 0.0 0.0
ES_OS_Y 0.0 0.0
S_S_Y 0.0 0.0
I have a dataframe df which looks as
Unnamed: 0 Characters Split A B C D Set Names
0 FROKDUWJU [FRO, KDU, WJU] FRO KDU WJU NaN {WJU, KDU, FRO}
1 IDJWPZSUR [IDJ, WPZ, SUR] IDJ WPZ SUR NaN {SUR, WPZ, IDJ}
2 UCFURKIRODCQ [UCF, URK, IRO, DCQ] UCF URK IRO DCQ {UCF, URK, DCQ, IRO}
3 ORI [ORI] ORI NaN NaN NaN {ORI}
4 PROIRKIQARTIBPO [PRO, IRK, IQA, RTI, BPO] PRO IRK IQA RTI {IQA, BPO, PRO, IRK, RTI}
5 QAZWREDCQIBR [QAZ, WRE, DCQ, IBR] QAZ WRE DCQ IBR {DCQ, QAZ, IBR, WRE}
6 PLPRUFSWURKI [PLP, RUF, SWU, RKI] PLP RUF SWU RKI {PLP, SWU, RKI, RUF}
7 FROIEUSKIKIR [FRO, IEU, SKI, KIR] FRO IEU SKI KIR {SKI, IEU, KIR, FRO}
8 ORIUWJZSRFRO [ORI, UWJ, ZSR, FRO] ORI UWJ ZSR FRO {UWJ, ORI, ZSR, FRO}
9 URKIFJVUR [URK, IFJ, VUR] URK IFJ VUR NaN {URK, VUR, IFJ}
10 RUFOFR [RUF, OFR] RUF OFR NaN NaN {OFR, RUF}
11 IEU [IEU] IEU NaN NaN NaN {IEU}
12 PIMIEU [PIM, IEU] PIM IEU NaN NaN {PIM, IEU}
The first column contains certain names. The Characters Split column contains the name split into every 3 letters in the form of a list. Columns A, B, C, and D contain the breakdown of those 3-letters. Column Set Names have the same 3-letters but in the form of a set.
Some of the 3-letters are common in different names. For example: "FRO" is present in name in index 0, 7 and 8. For these names which have one 3-letter set in common, I'd like to put them into one category, perferably in the form of list. Is it possible to have these categories for each unique 3-letter set? What would be the suitable way to do it?
df.to_dict() is as shown:
{'Unnamed: 0': {0: 'FROKDUWJU',
1: 'IDJWPZSUR',
2: 'UCFURKIRODCQ',
3: 'ORI',
4: 'PROIRKIQARTIBPO',
5: 'QAZWREDCQIBR',
6: 'PLPRUFSWURKI',
7: 'FROIEUSKIKIR',
8: 'ORIUWJZSRFRO',
9: 'URKIFJVUR',
10: 'RUFOFR',
11: 'IEU',
12: 'PIMIEU'},
'Characters Split': {0: ['FRO', 'KDU', 'WJU'],
1: ['IDJ', 'WPZ', 'SUR'],
2: ['UCF', 'URK', 'IRO', 'DCQ'],
3: ['ORI'],
4: ['PRO', 'IRK', 'IQA', 'RTI', 'BPO'],
5: ['QAZ', 'WRE', 'DCQ', 'IBR'],
6: ['PLP', 'RUF', 'SWU', 'RKI'],
7: ['FRO', 'IEU', 'SKI', 'KIR'],
8: ['ORI', 'UWJ', 'ZSR', 'FRO'],
9: ['URK', 'IFJ', 'VUR'],
10: ['RUF', 'OFR'],
11: ['IEU'],
12: ['PIM', 'IEU']},
'A': {0: 'FRO',
1: 'IDJ',
2: 'UCF',
3: 'ORI',
4: 'PRO',
5: 'QAZ',
6: 'PLP',
7: 'FRO',
8: 'ORI',
9: 'URK',
10: 'RUF',
11: 'IEU',
12: 'PIM'},
'B': {0: 'KDU',
1: 'WPZ',
2: 'URK',
3: nan,
4: 'IRK',
5: 'WRE',
6: 'RUF',
7: 'IEU',
8: 'UWJ',
9: 'IFJ',
10: 'OFR',
11: nan,
12: 'IEU'},
'C': {0: 'WJU',
1: 'SUR',
2: 'IRO',
3: nan,
4: 'IQA',
5: 'DCQ',
6: 'SWU',
7: 'SKI',
8: 'ZSR',
9: 'VUR',
10: nan,
11: nan,
12: nan},
'D': {0: nan,
1: nan,
2: 'DCQ',
3: nan,
4: 'RTI',
5: 'IBR',
6: 'RKI',
7: 'KIR',
8: 'FRO',
9: nan,
10: nan,
11: nan,
12: nan},
'Set Names': {0: {'FRO', 'KDU', 'WJU'},
1: {'IDJ', 'SUR', 'WPZ'},
2: {'DCQ', 'IRO', 'UCF', 'URK'},
3: {'ORI'},
4: {'BPO', 'IQA', 'IRK', 'PRO', 'RTI'},
5: {'DCQ', 'IBR', 'QAZ', 'WRE'},
6: {'PLP', 'RKI', 'RUF', 'SWU'},
7: {'FRO', 'IEU', 'KIR', 'SKI'},
8: {'FRO', 'ORI', 'UWJ', 'ZSR'},
9: {'IFJ', 'URK', 'VUR'},
10: {'OFR', 'RUF'},
11: {'IEU'},
12: {'IEU', 'PIM'}}}
You can explode 'Set Names', then groupby the exploded columns and merge the 'Unnamed: 0' into a list per group:
(df.explode('Set Names')
.groupby('Set Names')
['Unnamed: 0'].apply(list)
)
output:
Set Names
BPO [PROIRKIQARTIBPO]
DCQ [UCFURKIRODCQ, QAZWREDCQIBR]
FRO [FROKDUWJU, FROIEUSKIKIR, ORIUWJZSRFRO]
IBR [QAZWREDCQIBR]
IDJ [IDJWPZSUR]
... ...
WJU [FROKDUWJU]
WPZ [IDJWPZSUR]
WRE [QAZWREDCQIBR]
ZSR [ORIUWJZSRFRO]
If you want to filter the output to have a minimal number of items per group (here > 1):
(df.explode('Set Names')
.groupby('Set Names')
['Unnamed: 0'].apply(lambda g: list(g) if len(g) > 1 else None)
.dropna()
)
output:
Set Names
DCQ [UCFURKIRODCQ, QAZWREDCQIBR]
FRO [FROKDUWJU, FROIEUSKIKIR, ORIUWJZSRFRO]
IEU [FROIEUSKIKIR, IEU, PIMIEU]
ORI [ORI, ORIUWJZSRFRO]
RUF [PLPRUFSWURKI, RUFOFR]
URK [UCFURKIRODCQ, URKIFJVUR]
I am having the following code.
pd.DataFrame({'user_wid': {0: 3305613, 1: 57, 2: 80, 3: 31, 4: 38, 5: 12, 6: 35, 7: 25, 8: 42, 9: 16}, 'user_name': {0: 'Ter', 1: 'Am', 2: 'Wi', 3: 'Ma', 4: 'St', 5: 'Ju', 6: 'De', 7: 'Ri', 8: 'Ab', 9: 'Ti'}, 'user_age': {0: 41, 1: 34, 2: 45, 3: 47, 4: 70, 5: 64, 6: 64, 7: 63, 8: 32, 9: 24}, 'user_gender': {0: 'Male', 1: 'Female', 2: 'Male', 3: 'Male', 4: 'Male', 5: 'Female', 6: 'Female', 7: 'Female', 8: 'Female', 9: 'Female'}, 'sale_date': {0: '2018-05-15', 1: '2020-02-28', 2: '2020-04-02', 3: '2020-05-09', 4: '2020-11-29', 5: '2020-12-14', 6: '2020-04-21', 7: '2020-06-15', 8: '2020-07-03', 9: '2020-08-10'}, 'days_since_first_visit': {0: 426, 1: 0, 2: 0, 3: 8, 4: 126, 5: 283, 6: 0, 7: 189, 8: 158, 9: 270}, 'visit': {0: 4, 1: 1, 2: 1, 3: 2, 4: 4, 5: 3, 6: 1, 7: 2, 8: 4, 9: 2}, 'num_user_visits': {0: 4, 1: 2, 2: 1, 3: 2, 4: 10, 5: 7, 6: 1, 7: 4, 8: 4, 9: 2}, 'product': {0: 13, 1: 2, 2: 2, 3: 2, 4: 5, 5: 5, 6: 1, 7: 8, 8: 5, 9: 4}, 'sale_price': {0: 10.0, 1: 0.0, 2: 41.3, 3: 41.3, 4: 49.95, 5: 74.95, 6: 49.95, 7: 5.0, 8: 0.0, 9: 0.0}, 'whether_member': {0: 0, 1: 0, 2: 0, 3: 0, 4: 0, 5: 0, 6: 0, 7: 0, 8: 0, 9: 0}})
def f(x):
d = {}
d['user_name'] = x['user_name'].max()
d['user_age'] = x['user_age'].max()
d['user_gender'] = x['user_gender'].max()
d['last_visit_date'] = x['sale_date'].max()
d['days_since_first_visit'] = x['days_since_first_visit'].max()
d['num_visits_window'] = x['visit'].max()
d['num_visits_total'] = x['num_user_visits'].max()
d['products_used'] = x['product'].max()
d['user_total_sales'] = (x['sale_price'].sum()).round(2)
d['avg_spend_visit'] = (x['sale_price'].sum() / x['visit'].max()).round(2)
d['membership'] = x['whether_member'].max()
return pd.Series(d)
users = xactions.groupby('user_wid').apply(f).reset_index()
It is taking too much time to execute, I want to optimize the following function.
Any suggestions would be appreciated.
Thanks in advance.
Try:
users2 = xactions.groupby("user_wid", as_index=False).agg(
user_name=("user_name", "max"),
user_age=("user_age", "max"),
user_gender=("user_gender", "max"),
last_visit_date=("sale_date", "max"),
days_since_first_visit=("days_since_first_visit", "max"),
num_visits_window=("visit", "max"),
num_visits_total=("num_user_visits", "max"),
products_used=("product", "max"),
user_total_sales=("sale_price", "sum"),
membership=("whether_member", "max"),
)
users2["avg_spend_visit"] = (
users2["user_total_sales"] / users2["num_visits_window"]
).round(2)
print(users2)
Prints:
user_wid user_name user_age user_gender last_visit_date days_since_first_visit num_visits_window num_visits_total products_used user_total_sales membership avg_spend_visit
0 12 Ju 64 Female 2020-12-14 283 3 7 5 74.95 0 24.98
1 16 Ti 24 Female 2020-08-10 270 2 2 4 0.00 0 0.00
2 25 Ri 63 Female 2020-06-15 189 2 4 8 5.00 0 2.50
3 31 Ma 47 Male 2020-05-09 8 2 2 2 41.30 0 20.65
4 35 De 64 Female 2020-04-21 0 1 1 1 49.95 0 49.95
5 38 St 70 Male 2020-11-29 126 4 10 5 49.95 0 12.49
6 42 Ab 32 Female 2020-07-03 158 4 4 5 0.00 0 0.00
7 57 Am 34 Female 2020-02-28 0 1 2 2 0.00 0 0.00
8 80 Wi 45 Male 2020-04-02 0 1 1 2 41.30 0 41.30
9 3305613 Ter 41 Male 2018-05-15 426 4 4 13 10.00 0 2.50
I have df:
pd.DataFrame({'period': {0: pd.Timestamp('2016-05-01 00:00:00'),
1: pd.Timestamp('2017-05-01 00:00:00'),
2: pd.Timestamp('2018-03-01 00:00:00'),
3: pd.Timestamp('2018-04-01 00:00:00'),
4: pd.Timestamp('2016-05-01 00:00:00'),
5: pd.Timestamp('2017-05-01 00:00:00'),
6: pd.Timestamp('2016-03-01 00:00:00'),
7: pd.Timestamp('2016-04-01 00:00:00')},
'cost2': {0: 15,
1: 144,
2: 44,
3: 34,
4: 13,
5: 11,
6: 12,
7: 13},
'rev2': {0: 154,
1: 13,
2: 33,
3: 37,
4: 15,
5: 11,
6: 12,
7: 13},
'cost1': {0: 19,
1: 39,
2: 53,
3: 16,
4: 19,
5: 11,
6: 12,
7: 13},
'rev1': {0: 34,
1: 34,
2: 74,
3: 22,
4: 34,
5: 11,
6: 12,
7: 13},
'destination': {0: 'YYZ',
1: 'YYZ',
2: 'YYZ',
3: 'YYZ',
4: 'DFW',
5: 'DFW',
6: 'DFW',
7: 'DFW'},
'source': {0: 'SFO',
1: 'SFO',
2: 'SFO',
3: 'SFO',
4: 'MIA',
5: 'MIA',
6: 'MIA',
7: 'MIA'}})
df = df[['source','destination','period','rev1','rev2','cost1','cost2']]
which looks like:
I want the final df to have the following columns:
2017-05-01 2016-05-01
source, destination, rev1, rev2, cost1, cost2, rev1, rev2, cost1, cost2...
So essentially, for every source/destination pair, I want revenue and cost numbers for each date in a single row.
I've been tinkering with stack and unstack but haven't been able to achieve my objective.
You can using set_index + unstack, to change the long to wide , then using swaplevel to change the format of columns index you need
df.set_index(['destination','source','period']).unstack().swaplevel(0,1,axis=1).sort_index(level=0,axis=1)
An alternative to .set_index + .unstack is .pivot_table:
df.pivot_table( \
index=['source', 'destination'], \
columns=['period'], \
values=['rev1', 'rev2', 'cost1', 'cost2'] \
).swaplevel(axis=1).sort_index(axis=1, level=0)
# period 2016-03-01 2016-04-01 ...
# cost1 cost2 rev1 rev2 cost1 cost2 rev1 rev2
# source destination
# MIA DFW 12.0 12.0 12.0 12.0 13.0 13.0 13.0 13.0
# SFO YYZ NaN NaN NaN NaN NaN NaN NaN NaN