Random selection of values in a tuple - python

Is there any way to use the values from column 'records_to_select' as 'k', in order to select the number of random values from column 'pnrTuple' (looking like this (35784905, 40666303, 47603805, 68229102) - a tuple). I need to do this for the whole dataframe. Basically if this is the value for 'pnrTuple' (35784905, 40666303, 47603805, 68229102) and this 3 is the value for 'records_to_select', I need to select randomly 3 of the id's. I'm open for any other ways to do this.
The code obviously is not working, I just want to show what i want to do.
mass_grouped3['pnrTuple'] = mass_grouped3['pnrTuple'].map(lambda x: random.choices(x, k=mass_grouped3['records_to_select']))
bula gender age hhgr pnr freq records_to_select pnrTuple
1 1 1 3 ['35784905', '40666303', '47603805', '68229102'] 4 4 ('35784905', '40666303', '47603805', '68229102')
1 1 2 1 ['06299501', '07694901', '35070201', '36765601', '97818801'] 5 5 ('06299501', '07694901', '35070201', '36765601', '97818801')
1 1 2 2 ['17182402'] 1 1 ('17182402',)
1 1 2 3 ['07992601', '20164401', '26817203', '50584001'] 4 4 ('07992601', '20164401', '26817203', '50584001')
1 1 3 1 ['07935501', '08720401', '19604501', '26873301', '46069001', '65829601'] 6 6 ('07935501', '08720401', '19604501', '26873301', '46069001', '65829601')
1 1 3 2 ['06529901', '21623801', '21624202', '31730001', '35448801', '36460001', '79142201', '98476701'] 8 5 ('06529901', '21623801', '21624202', '31730001', '35448801', '36460001', '79142201', '98476701')
1 1 3 3 ['08786301', '17729602', '34827202', '35191802', '36106801', '41139001', '60815801', '65889401', '82642901', '89476501', '97523201', '98668501'] 12 8 ('08786301', '17729602', '34827202', '35191802', '36106801', '41139001', '60815801', '65889401', '82642901', '89476501', '97523201', '98668501')
1 1 4 1 ['04282501', '07389801', '08988001', '13514901', '33755101', '36010101', '40009501', '46641001', '49795401', '51045401', '78502101', '84993601', '85047501'] 13 9 ('04282501', '07389801', '08988001', '13514901', '33755101', '36010101', '40009501', '46641001', '49795401', '51045401', '78502101', '84993601', '85047501')
1 1 4 2 ['05250501', '17896401', '27035401', '32701701', '34741602', '42196402', '42891001', '67090301', '69240301', '77546701', '87855401', '96712602'] 12 8 ('05250501', '17896401', '27035401', '32701701', '34741602', '42196402', '42891001', '67090301', '69240301', '77546701', '87855401', '96712602')
1 1 4 3 ['08047701', '08735402', '15113502', '16648302', '21618901', '26166801', '36508001', '40297801', '42864202', '47068001', '54051002', '68229104', '68555401', '76081901', '80639302', '86100502', '88471102', '98655102', '98672301'] 19 13 ('08047701', '08735402', '15113502', '16648302', '21618901', '26166801', '36508001', '40297801', '42864202', '47068001', '54051002', '68229104', '68555401', '76081901', '80639302', '86100502', '88471102', '98655102', '98672301')
1 1 5 1 ['06027001', '14817601', '17035701', '26482001', '40580701', '41411301', '43383101', '50290201', '66963901', '98378101'] 10 7 ('06027001', '14817601', '17035701', '26482001', '40580701', '41411301', '43383101', '50290201', '66963901', '98378101')
1 1 5 2 ['04215802', '04986702', '06021301', '07696001', '08310701', '09248301', '10429402', '13377101', '14652801', '14742402', '16179901', '19003801', '26296401', '30262201', '32109302', '42196401', '43343005', '69230101', '79169901', '81551801', '85026001', '88785201'] 22 15 ('04215802', '04986702', '06021301', '07696001', '08310701', '09248301', '10429402', '13377101', '14652801', '14742402', '16179901', '19003801', '26296401', '30262201', '32109302', '42196401', '43343005', '69230101', '79169901', '81551801', '85026001', '88785201')
1 1 5 3 ['06208701', '10235601', '11200904', '26165901', '28133401', '30318101', '42304401', '48289402', '68324402', '79444601', '86214301', '89292601', '89644901', '95844702', '98833201'] 15 10 ('06208701', '10235601', '11200904', '26165901', '28133401', '30318101', '42304401', '48289402', '68324402', '79444601', '86214301', '89292601', '89644901', '95844702', '98833201')
1 1 6 1 ['04076601', '04299501', '05992601', '06070001', '06749701', '10940601', '11880801', '13789901', '15641601', '15652201', '16359701', '17115201', '17944501', '27168601', '30034901', '40494901', '41876001', '43269501', '43443801', '65935901', '72038401', '76173101', '85624501', '85865301', '86858901', '88302301', '97266501'] 27 19 ('04076601', '04299501', '05992601', '06070001', '06749701', '10940601', '11880801', '13789901', '15641601', '15652201', '16359701', '17115201', '17944501', '27168601', '30034901', '40494901', '41876001', '43269501', '43443801', '65935901', '72038401', '76173101', '85624501', '85865301', '86858901', '88302301', '97266501')
1 1 6 2 ['00305501', '00364401', '00467701', '06004101', '06760101', '13484301', '14101401', '14604101', '15296601', '16701801', '17295801', '19292501', '21692601', '22043401', '26117302', '30296102', '31566301', '32082501', '32975801', '33007502', '33901301', '36627901', '40933601', '40950801', '40953901', '41599201', '41647601', '42030702', '43249601', '43253601', '46177002', '46425001', '60285901', '62801802', '63203001', '63641601', '71358803', '72198201', '78789501', '79287901', '82297701', '85000802', '85458401', '86637402', '86755601', '87113101', '87312501', '87457701', '87617901', '96706301', '97494201', '97549601'] 52 36 ('00305501', '00364401', '00467701', '06004101', '06760101', '13484301', '14101401', '14604101', '15296601', '16701801', '17295801', '19292501', '21692601', '22043401', '26117302', '30296102', '31566301', '32082501', '32975801', '33007502', '33901301', '36627901', '40933601', '40950801', '40953901', '41599201', '41647601', '42030702', '43249601', '43253601', '46177002', '46425001', '60285901', '62801802', '63203001', '63641601', '71358803', '72198201', '78789501', '79287901', '82297701', '85000802', '85458401', '86637402', '86755601', '87113101', '87312501', '87457701', '87617901', '96706301', '97494201', '97549601')
1 1 6 3 ['10368305', '17205801', '20164403', '26295901', '26817201', '40666302', '60751201', '89908101'] 8 5 ('10368305', '17205801', '20164403', '26295901', '26817201', '40666302', '60751201', '89908101')
1 2 1 1 ['00854101'] 1 1 ('00854101',)

Lets create a dummy dataframe first:
import random
df = pd.DataFrame({'id':range(1, 101),
'tups':[(random.randint(1, 1000000), random.randint(1, 1000000), random.randint(1, 1000000),
random.randint(1, 1000000), random.randint(1, 1000000), random.randint(1, 1000000)) for _ in range(100)],
'records_to_select':[random.randint(1, 5) for _ in range(100)]})
Lets take a look at that dataframe:
df.head()
id tups records_to_select
0 1 (59216, 532002, 799100, 829539, 968212, 62046) 5
1 2 (217750, 448108, 333314, 417604, 330570, 991236) 2
2 3 (352810, 242235, 466270, 169478, 155754, 29238) 3
3 4 (309312, 867221, 304830, 278511, 547559, 72195) 1
4 5 (872128, 556190, 112937, 33984, 759746, 549025) 2
Here we have a tups column where we want to randomly select from, and then a records_to_select column where we have a number we want to sample.
The way I generally solve a problem is to do it once first. Here we just create a single tuple, and then figure out how to randomly sample from it. One way to do it is random.sample(list, number). I had it return a tuple as well, though it returns a list by default:
x = (3, 7, 5, 0, 2, 8, 6, 1)
tuple(random.sample(x, 4))
(3, 0, 5, 8)
Now we can use apply along with lambda to iterate over the dataframe and apply it to every row.
df['samples_from_tuple'] = df.apply(lambda x: tuple(random.sample(x['tups'], x['records_to_select'])), axis=1)
df.head()
id tups records_to_select samples_from_tuple
0 1 (476833, 384041, 789847, 233342, 527508, 892565) 4 (384041, 233342, 527508, 476833)
1 2 (759298, 654362, 244128, 851410, 233892, 612689) 2 (759298, 851410)
2 3 (640435, 391573, 290131, 277103, 250173, 756359) 2 (391573, 277103)
3 4 (788502, 128537, 560571, 42867, 47120, 71505) 1 (47120,)
4 5 (356955, 813874, 731805, 943841, 972449, 247512) 5 (356955, 972449, 813874, 731805, 247512)

As per my understanding, you have a table as following:
>>> df = pd.DataFrame(data = {"pnrTuple":[(1,2,3,4,5), (5,3,325,3463,7,23,46,4)], "records_to_select": [3, 5]})
>>> df
a b
0 (1, 2, 3, 4, 5) 3
1 (5, 3, 325, 3463, 7, 23, 46, 4) 5
I believe the function apply would help:
>>> import random
>>> df.apply(lambda row: random.sample(row['pnrTuple'], row['records_to_select']), axis=1)

Related

How to Group by the mean of specific columns in Python

In the dataframe below:
import pandas as pd
import numpy as np
df= {
'Gen':['M','M','M','M','F','F','F','F','M','M','M','M','F','F','F','F'],
'Site':['FRX','FX','FRX','FRX','FRX','FX','FRX','FX','FX','FX','FX','FRX','FRX','FRX','FRX','FRX'],
'Type':['L','L','L','L','L','L','L','L','R','R','R','R','R','R','R','R'],
'AIC':['<1','<1','<1','<1',1,1,1,1,2,2,2,2,'>2','>2','>2','>2'],
'AIC_TRX':[1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,4],
'diff':[-1,-1,-1,-1,0,0,0,0,1,1,1,1,3,3,3,3],
'series':[1,2,4,8,1,2,4,8,1,2,4,8,1,2,4,8],
'Grwth_Time1':[150.78,162.34,188.53,197.69,208.07,217.76,229.48,139.51,146.87,182.54,189.57,199.97,229.28,244.73,269.91,249.19],
'Grwth_Time2':[250.78,262.34,288.53,297.69,308.07,317.7,329.81,339.15,346.87,382.54,369.59,399.97,329.28,347.73,369.91,349.12],
'Grwth_Time3':[240.18,232.14,258.53,276.69,338.07,307.74,359.16,339.25,365.87,392.48,399.97,410.75,429.08,448.39,465.15,469.33],
'Grwth_Time4':[270.84,282.14,298.53,306.69,318.73,327.47,369.63,389.59,398.75,432.18,449.78,473.55,494.85,509.39,515.52,539.23],
'Grwth_Time5':[25.78,22.34,28.53,27.69,30.07,17.7,29.81,33.15,34.87,32.54,36.59,39.97,29.28,34.73,36.91,34.12],
'Grwth_Time6':[240.18,232.14,258.53,276.69,338.07,307.74,359.16,339.25,365.87,392.48,399.97,410.75,429.08,448.39,465.15,469.33],
'Grwth_Time7':[27.84,28.14,29.53,30.69,18.73,27.47,36.63,38.59,38.75,24.18,24.78,21.55,13.85,9.39,15.52,39.23],
}
df = pd.DataFrame(df,columns = ['Gen','Site','Type','AIC','AIC_TRX','diff','series','Grwth_Time1','Grwth_Time2','Grwth_Time3','Grwth_Time4','Grwth_Time5','Grwth_Time6','Grwth_Time7'])
df.info()
I want to do the following:
Find the average of each unique series per AIC_TRX for each Grwth_Time (Grwth_Time1, Grwth_Time2,....,Grwth_Time7)
Export all the outputs as one xlsx file (refer to the figure below)
The desired outputs look like the figure below (note: the numbers in this output are not the actual average values, they were randomly generated)
My attempt:
# Select the columns -> AIC_TRX, series, Grwth_Time1,Grwth_Time2,....,Grwth_Time7
df1 = df[['AIC_TRX', 'diff', 'series',
'Grwth_Time1', 'Grwth_Time2', 'Grwth_Time3', 'Grwth_Time4',
'Grwth_Time5', 'Grwth_Time6', 'Grwth_Time7']]
#Below is where I need help, I want to groupby the 'series' and 'AIC_TRX' for all the 'Grwth_Time1_to_7'
df1.groupby('series').Grwth_Time1.agg(['mean'])
Thanks in advance
You have to groupby two columns: ['series', 'AIC_TRX'] and find mean of each Grwth_Time.
df.groupby(['series', 'AIC_TRX'])[['Grwth_Time1', 'Grwth_Time2', 'Grwth_Time3',
'Grwth_Time4', 'Grwth_Time5', 'Grwth_Time6', 'Grwth_Time7']].mean().unstack().to_excel("output.xlsx")
Output:
AIC_TRX 1 2 3 4
series
1 150.78 208.07 146.87 229.28
2 162.34 217.76 182.54 244.73
4 188.53 229.48 189.57 269.91
8 197.69 139.51 199.97 249.19
AIC_TRX 1 2 3 4
series
1 250.78 308.07 346.87 329.28
2 262.34 317.70 382.54 347.73
4 288.53 329.81 369.59 369.91
8 297.69 339.15 399.97 349.12
AIC_TRX 1 2 3 4
series
1 240.18 338.07 365.87 429.08
2 232.14 307.74 392.48 448.39
4 258.53 359.16 399.97 465.15
8 276.69 339.25 410.75 469.33
AIC_TRX 1 2 3 4
series
1 270.84 318.73 398.75 494.85
2 282.14 327.47 432.18 509.39
4 298.53 369.63 449.78 515.52
8 306.69 389.59 473.55 539.23
AIC_TRX 1 2 3 4
series
1 25.78 30.07 34.87 29.28
2 22.34 17.70 32.54 34.73
4 28.53 29.81 36.59 36.91
8 27.69 33.15 39.97 34.12
AIC_TRX 1 2 3 4
series
1 240.18 338.07 365.87 429.08
2 232.14 307.74 392.48 448.39
4 258.53 359.16 399.97 465.15
8 276.69 339.25 410.75 469.33
AIC_TRX 1 2 3 4
series
1 27.84 18.73 38.75 13.85
2 28.14 27.47 24.18 9.39
4 29.53 36.63 24.78 15.52
8 30.69 38.59 21.55 39.23
Just use the df.apply method to average across each column based on series and AIC_TRX grouping.
result = df1.groupby(['series', 'AIC_TRX']).apply(np.mean, axis=1)
Result:
series AIC_TRX
1 1 0 120.738
2 4 156.281
3 8 170.285
4 12 196.270
2 1 1 122.358
2 5 152.758
3 9 184.494
4 13 205.175
4 1 2 135.471
2 6 171.968
3 10 187.825
4 14 214.907
8 1 3 142.183
2 7 162.849
3 11 196.851
4 15 216.455
dtype: float64

Filter DataFrame to rows with 2+ True elements

For example, using
df[(df>1).any(1)]
I can get data with any element larger than 1. But if I want to get data with at least 2 elements larger than 1, how can I do that? Thx
Try this:
df[(df>1).sum(1).gt(1)]
Demo:
import string
In [118]: df = pd.DataFrame(np.random.rand(10,10)*1.2, columns=list(string.ascii_letters[:10]))
In [119]: df
Out[119]:
a b c d e f g h i j
0 0.934290 0.426050 0.165846 1.114521 1.101023 0.924071 0.241893 0.890354 1.168406 0.506547
1 0.576869 1.091996 0.272124 0.834070 0.229545 0.585501 1.114688 0.957817 1.151957 0.761277
2 0.016659 1.138262 0.481773 0.186753 0.176585 0.497437 0.321805 0.664140 0.738851 0.177179
3 0.192605 0.395377 0.950169 0.678960 0.525349 0.050877 0.181615 0.105080 0.385672 0.401810
4 1.184054 1.097378 0.197706 0.453395 0.258631 1.088337 0.139201 0.217262 0.369734 1.054716
5 0.246081 0.234748 0.879371 0.198397 0.288288 0.534848 0.561080 0.732490 0.156947 0.662194
6 0.660215 0.221513 0.224576 0.049425 0.339101 0.441393 1.122385 0.057968 1.094025 1.130691
7 0.022977 0.681718 0.314200 0.622263 0.692124 0.803743 0.783381 0.715494 0.434911 0.247724
8 0.815742 0.419933 0.019704 0.764557 0.074530 0.990639 0.801125 0.403838 0.680618 1.043551
9 1.061915 0.229453 0.446562 0.324415 0.121421 0.270542 0.884124 0.926168 0.282650 0.267467
In [120]: df[(df>1).sum(1).gt(1)]
Out[120]:
a b c d e f g h i j
0 0.934290 0.426050 0.165846 1.114521 1.101023 0.924071 0.241893 0.890354 1.168406 0.506547
1 0.576869 1.091996 0.272124 0.834070 0.229545 0.585501 1.114688 0.957817 1.151957 0.761277
4 1.184054 1.097378 0.197706 0.453395 0.258631 1.088337 0.139201 0.217262 0.369734 1.054716
6 0.660215 0.221513 0.224576 0.049425 0.339101 0.441393 1.122385 0.057968 1.094025 1.130691

Apply formula across pandas rows/ regression line

I'm trying to apply a formula across the rows of a data frame to get the trend of the numbers in the rows.
The below example works until the part where .apply is used.
df = pd.DataFrame(np.random.randn(10, 4), columns=list('ABCD'))
axisvalues=list(range(1,len(db.columns)+1))
def calc_slope(row):
return scipy.stats.linregress(df.iloc[row,:], y=axisvalues)
calc_slope(1) # this works
df["New"]=df.apply(calc_slope,axis=1) # this fails *- "too many values to unpack"*
Thank you for any help
I think you need for one attribute:
def calc_slope(row):
a = scipy.stats.linregress(row, y=axisvalues)
return a.slope
df["slope"]=df.apply(calc_slope,axis=1)
print (df)
A B C D slope
0 0.444640 0.024624 -0.016216 0.228935 -2.553465
1 1.226611 1.962481 1.103834 0.645562 -1.455239
2 -0.259415 0.971097 0.124538 -0.704115 -0.718621
3 1.938422 1.787310 -0.619745 -2.560187 -0.575519
4 -0.986231 -1.942930 2.677379 -1.813071 0.075679
5 0.611214 -0.258453 0.053452 1.223544 0.841865
6 0.685435 0.962880 -1.517077 -0.101108 -0.652503
7 0.368278 1.314202 0.748189 2.116189 1.350132
8 -0.322053 -1.135443 -0.161071 -1.836761 -0.987341
9 0.798461 0.461736 -0.665127 -0.247887 -1.610447
And for all atributes convert named tuple to dict and then to Series. Output is new DataFrame, so if is necessaryjoin to original:
np.random.seed(1997)
df = pd.DataFrame(np.random.randn(10, 4), columns=list('ABCD'))
axisvalues=list(range(1,len(df.columns)+1))
def calc_slope(row):
a = scipy.stats.linregress(row, y=axisvalues)
return pd.Series(a._asdict())
print (df.apply(calc_slope,axis=1))
slope intercept rvalue pvalue stderr
0 -2.553465 2.935355 -0.419126 0.580874 3.911302
1 -1.455239 4.296670 -0.615324 0.384676 1.318236
2 -0.718621 2.523733 -0.395862 0.604138 1.178774
3 -0.575519 2.578530 -0.956682 0.043318 0.123843
4 0.075679 2.539066 0.127254 0.872746 0.417101
5 0.841865 2.156991 0.425333 0.574667 1.266674
6 -0.652503 2.504915 -0.561947 0.438053 0.679154
7 1.350132 0.965285 0.794704 0.205296 0.729193
8 -0.987341 1.647104 -0.593680 0.406320 0.946311
9 -1.610447 2.639780 -0.828856 0.171144 0.768641
df = df.join(df.apply(calc_slope,axis=1))
print (df)
A B C D slope intercept rvalue \
0 0.444640 0.024624 -0.016216 0.228935 -2.553465 2.935355 -0.419126
1 1.226611 1.962481 1.103834 0.645562 -1.455239 4.296670 -0.615324
2 -0.259415 0.971097 0.124538 -0.704115 -0.718621 2.523733 -0.395862
3 1.938422 1.787310 -0.619745 -2.560187 -0.575519 2.578530 -0.956682
4 -0.986231 -1.942930 2.677379 -1.813071 0.075679 2.539066 0.127254
5 0.611214 -0.258453 0.053452 1.223544 0.841865 2.156991 0.425333
6 0.685435 0.962880 -1.517077 -0.101108 -0.652503 2.504915 -0.561947
7 0.368278 1.314202 0.748189 2.116189 1.350132 0.965285 0.794704
8 -0.322053 -1.135443 -0.161071 -1.836761 -0.987341 1.647104 -0.593680
9 0.798461 0.461736 -0.665127 -0.247887 -1.610447 2.639780 -0.828856
pvalue stderr
0 0.580874 3.911302
1 0.384676 1.318236
2 0.604138 1.178774
3 0.043318 0.123843
4 0.872746 0.417101
5 0.574667 1.266674
6 0.438053 0.679154
7 0.205296 0.729193
8 0.406320 0.946311
9 0.171144 0.768641

Calculate values without looping

I am attempting to do a monte carlo-esque projection using pandas on some stock prices. I used numpy to create some random correlated values for percentage price change, however I am struggling on how to use those values to create a 'running tally' of the actual asset price. So I have a DataFrame that looks like this:
abc xyz def
0 0.093889 0.113750 0.082923
1 -0.130293 -0.148742 -0.061890
2 0.062175 -0.005463 0.022963
3 -0.029041 -0.015918 0.006735
4 -0.048950 -0.010945 -0.034421
5 0.082868 0.080570 0.074637
6 0.048782 -0.030702 -0.003748
7 -0.027402 -0.065221 -0.054764
8 0.095154 0.063978 0.039480
9 0.059001 0.114566 0.056582
How can I create something like this, where abc_px = previous price * (1 + abc). I know I could iterate over, but I would rather not for performance reasons.
Something like, assuming the initial price on all of these was 100:
abc xyz def abc_px xyz_px def_px
0 0.093889 0.11375 0.082923 109.39 111.38 108.29
1 -0.130293 -0.148742 -0.06189 95.14 94.81 101.59
2 0.062175 -0.005463 0.022963 101.05 94.29 103.92
3 -0.029041 -0.015918 0.006735 98.12 92.79 104.62
4 -0.04895 -0.010945 -0.034421 93.31 91.77 101.02
5 0.082868 0.08057 0.074637 101.05 99.17 108.56
6 0.048782 -0.030702 -0.003748 105.98 96.12 108.15
7 -0.027402 -0.065221 -0.054764 103.07 89.85 102.23
8 0.095154 0.063978 0.03948 112.88 95.60 106.27
9 0.059001 0.114566 0.056582 119.54 106.56 112.28
Is that what you want?
In [131]: new = df.add_suffix('_px') + 1
In [132]: new
Out[132]:
abc_px xyz_px def_px
0 1.093889 1.113750 1.082923
1 0.869707 0.851258 0.938110
2 1.062175 0.994537 1.022963
3 0.970959 0.984082 1.006735
4 0.951050 0.989055 0.965579
5 1.082868 1.080570 1.074637
6 1.048782 0.969298 0.996252
7 0.972598 0.934779 0.945236
8 1.095154 1.063978 1.039480
9 1.059001 1.114566 1.056582
In [133]: df.join(new.cumprod() * 100)
Out[133]:
abc xyz def abc_px xyz_px def_px
0 0.093889 0.113750 0.082923 109.388900 111.375000 108.292300
1 -0.130293 -0.148742 -0.061890 95.136292 94.808860 101.590090
2 0.062175 -0.005463 0.022963 101.051391 94.290919 103.922903
3 -0.029041 -0.015918 0.006735 98.116758 92.789996 104.622824
4 -0.048950 -0.010945 -0.034421 93.313942 91.774410 101.021601
5 0.082868 0.080570 0.074637 101.046682 99.168674 108.561551
6 0.048782 -0.030702 -0.003748 105.975941 96.123997 108.154662
7 -0.027402 -0.065221 -0.054764 103.071989 89.854694 102.231680
8 0.095154 0.063978 0.039480 112.879701 95.603418 106.267787
9 0.059001 0.114566 0.056582 119.539716 106.556319 112.280631

after groupby and sum,how to get the max value rows in `pandas.DataFrame`?

here the df(i updated by real data ):
>TIMESTAMP OLTPSOURCE RNR RQDRECORD
>20150425232836 0PU_IS_PS_44 REQU_51NHAJUV06IMMP16BVE572JM2 17020
>20150128165726 ZFI_DS41 REQU_50P1AABLYXE86KYE3O6EY390M 6925
>20150701144253 ZZZJB_TEXT REQU_52DV5FB812JCDXDVIV9P35DGM 2
>20150107201358 0EQUIPMENT_ATTR REQU_50EVHXSDOITYUQLP4L8UXOBT6 14205
>20150623215202 0CO_OM_CCA_1 REQU_528XSXYWTK6FSJXDQY2ROQQ4Q 0
>20150715144139 0HRPOSITION_TEXT REQU_52I9KQ1LN4ZWTNIP0N1R68NDY 25381
>20150625175157 0HR_PA_0 REQU_528ZS1RFN0N3Y3AEB48UDCUKQ 100020
>20150309153828 0HR_PA_0 REQU_51385K5F3AGGFVCGHU997QF9M 0
>20150626185531 0FI_AA_001 REQU_52BO3RJCOG4JGHEIIZMJP9V4A 0
>20150307222336 0FUNCT_LOC_ATTR REQU_513JJ6I6ER5ZVW5CAJMVSKAJQ 13889
>20150630163419 0WBS_ELEMT_ATTR REQU_52CUPVUFCY2DDOG6SPQ1XOYQ2 0
>20150424162226 6DB_V_DGP_EXPORTDATA REQU_51N1F5ZC8G3LW68E4TFXRGH9I 0
>20150617143720 ZRZMS_TEXT REQU_5268R1YE6G1U7HUK971LX1FPM 6
>20150405162213 0HR_PA_0 REQU_51FFR7T4YQ2F766PFY0W9WUDM 0
>20150202165933 ZFI_DS41 REQU_50QPTCF0VPGLBYM9MGFXMWHGM 6925
>20150102162140 0HR_PA_0 REQU_50CNUT7I9OXH2WSNLC4WTUZ7U 0
>20150417184916 0FI_AA_004 REQU_51KFWWT6PPTI5X44D3MWD7CYU 0
>20150416220451 0FUNCT_LOC_ATTR REQU_51JP3BDCD6TUOBL2GK9ZE35UU 13889
>20150205150633 ZHR_DS09 REQU_50RFRYRADMA9QXB1PW4PRF5XM 6667
>20150419230724 0PU_IS_PS_44 REQU_51LC5XX6VWEERAVHEFJ9K5A6I 22528
>and the relationships between columns is
>OLTPSOURCE--RNR:1>n
>RNR--RQDRECORD:1>N
and my requirement is:
sum the RQDRECORD by RNR;
get the max sum result of every OLTPSOURCE;
Finally, I would draw a graph showing the results of all
sumed largest OLTPSOURCE by time
Thanks everyone, I further explain my problem:
if OLTPSOURCE:RNR:RQDRECORD= 1:1:1
just sum RQDRECORD,RETURN OLTPSOURCE AND SUM RESULT
if OLTPSOURCE:RNR:RQDRECORD= 1:1:N
just sum RQDRECORD,RETURN OLTPSOURCE AND SUM RESULT
if OLTPSOURCE:RNR:RQDRECORD= 1:N:(N OR 1)
sum RQDRECORD by RNR GROUP first,THEN Find the max result of one OLTPSOURCE,return all the OLTPSOURCE with the max RQDRECORD .
So for the above sample data, I eventually want the result as follows
>TIMESTAMP OLTPSOURCE RNR RQDRECORD
>20150623215202 0CO_OM_CCA_1 REQU_528XSXYWTK6FSJXDQY2ROQQ4Q 0
>20150107201358 0EQUIPMENT_ATTR REQU_50EVHXSDOITYUQLP4L8UXOBT6 14205
>20150626185531 0FI_AA_001 REQU_52BO3RJCOG4JGHEIIZMJP9V4A 0
>20150417184916 0FI_AA_004 REQU_51KFWWT6PPTI5X44D3MWD7CYU 0
>20150416220451 0FUNCT_LOC_ATTR REQU_51JP3BDCD6TUOBL2GK9ZE35UU 13889
>20150625175157 0HR_PA_0 REQU_528ZS1RFN0N3Y3AEB48UDCUKQ 100020
>20150715144139 0HRPOSITION_TEXT REQU_52I9KQ1LN4ZWTNIP0N1R68NDY 25381
>20150419230724 0PU_IS_PS_44 REQU_51LC5XX6VWEERAVHEFJ9K5A6I 22528
>20150630163419 0WBS_ELEMT_ATTR REQU_52CUPVUFCY2DDOG6SPQ1XOYQ2 0
>20150424162226 6DB_V_DGP_EXPORTDATA REQU_51N1F5ZC8G3LW68E4TFXRGH9I 0
>20150202165933 ZFI_DS41 REQU_50QPTCF0VPGLBYM9MGFXMWHGM 6925
>20150205150633 ZHR_DS09 REQU_50RFRYRADMA9QXB1PW4PRF5XM 6667
>20150617143720 ZRZMS_TEXT REQU_5268R1YE6G1U7HUK971LX1FPM 6
>20150701144253 ZZZJB_TEXT REQU_52DV5FB812JCDXDVIV9P35DGM 2
Referring to EdChum's approach, I made some adjustments, the results were as follows, because the amount of data is too big, I did "'RQDRECORD> 100000'" is set, in fact I would like to sort and then take the top 100, but not success
[1]: http://i.imgur.com/FgfZaDY.jpg "result"
You can take the groupby result, call max on this and pass param level=0 or level='clsa' if you prefer, this will return you the max count for that level. However this loses the 'clsb' column so what you can do is merge this back to your grouped result after calling reset_index on the grouped object, you can reorder the resulting df columns by using fancy indexing:
In [149]:
gp = df.groupby(['clsa','clsb']).sum()
result = gp.max(level=0).reset_index().merge(gp.reset_index())
result = result.ix[:,['clsa','clsb','count']]
result
Out[149]:
clsa clsb count
0 a a1 9
1 b b2 8
2 c c2 10
df['TIMESTAMP'] = pd.to_datetime(df['TIMESTAMP'], format='%Y%m%d%H%M%S')
df_gb = df.groupby(['OLTPSOURCE', 'RNR'], as_index=False).aggregate(sum)
final = pd.merge(df[['TIMESTAMP', 'OLTPSOURCE', 'RNR']], df_gb.groupby(['OLTPSOURCE'], as_index=False).first(), on=['OLTPSOURCE', 'RNR'], how='right').sort('OLTPSOURCE')
final.plot(kind='bar')
plt.show()
print final
TIMESTAMP OLTPSOURCE RNR \
3 2015-06-23 21:52:02 0CO_OM_CCA_1 REQU_528XSXYWTK6FSJXDQY2ROQQ4Q
2 2015-01-07 20:13:58 0EQUIPMENT_ATTR REQU_50EVHXSDOITYUQLP4L8UXOBT6
5 2015-06-26 18:55:31 0FI_AA_001 REQU_52BO3RJCOG4JGHEIIZMJP9V4A
11 2015-04-17 18:49:16 0FI_AA_004 REQU_51KFWWT6PPTI5X44D3MWD7CYU
6 2015-03-07 22:23:36 0FUNCT_LOC_ATTR REQU_513JJ6I6ER5ZVW5CAJMVSKAJQ
4 2015-07-15 14:41:39 0HRPOSITION_TEXT REQU_52I9KQ1LN4ZWTNIP0N1R68NDY
10 2015-01-02 16:21:40 0HR_PA_0 REQU_50CNUT7I9OXH2WSNLC4WTUZ7U
13 2015-04-19 23:07:24 0PU_IS_PS_44 REQU_51LC5XX6VWEERAVHEFJ9K5A6I
7 2015-06-30 16:34:19 0WBS_ELEMT_ATTR REQU_52CUPVUFCY2DDOG6SPQ1XOYQ2
8 2015-04-24 16:22:26 6DB_V_DGP_EXPORTDATA REQU_51N1F5ZC8G3LW68E4TFXRGH9I
0 2015-01-28 16:57:26 ZFI_DS41 REQU_50P1AABLYXE86KYE3O6EY390M
12 2015-02-05 15:06:33 ZHR_DS09 REQU_50RFRYRADMA9QXB1PW4PRF5XM
9 2015-06-17 14:37:20 ZRZMS_TEXT REQU_5268R1YE6G1U7HUK971LX1FPM
1 2015-07-01 14:42:53 ZZZJB_TEXT REQU_52DV5FB812JCDXDVIV9P35DGM
RQDRECORD
3 0
2 14205
5 0
11 0
6 13889
4 25381
10 0
13 22528
7 0
8 0
0 6925
12 6667
9 6
1 2

Categories

Resources