Why using group by makes some id disappear

Why using group by makes some id disappear - python

i was working in a machine learning project , and while i am extracting features i found that some of consumers LCLid disappear from the data set while i was grouping by the LCLid
Dataset: SmartMeter Energy Consumption Data in London Households
here is the original data set
and here is the code that i used to extract some features
LCLid=[]
for i in range(68):
LCLid.append('MAC0000'+str(228+i))
consommation=data.groupby('LCLid')['KWH/hh'].sum()
consommation_min=data.groupby('LCLid')['KWH/hh'].min()
consommation_max=data.groupby('LCLid')['KWH/hh'].max()
consommation_mean=data.groupby('LCLid')['KWH/hh'].mean()
consommation_evening=data.groupby(['LCLid','period'])['KWH/hh'].mean()
#creation de dataframe
list_of_tuples = list(zip (LCLid, consommation, consommation_min, consommation_max, consommation_mean))
data2= pd.DataFrame(list_of_tuples, columns = ['LCLid', 'Consumption', 'Consumption_min', 'Consumption_max', 'Consumption_mean'])
as you see after the execution of the code the dataset stopped in the LCLid 282 while iin the original one the dataset containes also the LCLid from 283 to 295

Using low-carbon-london-data from SmartMeter Energy Consumption Data in London Households
The issue is LCLid does not uniformly increment by 1, from MAC000228 to MAC000295.
print(data.LCLid.unique())
array(['MAC000228', 'MAC000229', 'MAC000230', 'MAC000231', 'MAC000232',
'MAC000233', 'MAC000234', 'MAC000235', 'MAC000237', 'MAC000238',
'MAC000239', 'MAC000240', 'MAC000241', 'MAC000242', 'MAC000243',
'MAC000244', 'MAC000245', 'MAC000246', 'MAC000248', 'MAC000249',
'MAC000250', 'MAC000251', 'MAC000252', 'MAC000253', 'MAC000254',
'MAC000255', 'MAC000256', 'MAC000258', 'MAC000260', 'MAC000262',
'MAC000263', 'MAC000264', 'MAC000267', 'MAC000268', 'MAC000269',
'MAC000270', 'MAC000271', 'MAC000272', 'MAC000273', 'MAC000274',
'MAC000275', 'MAC000276', 'MAC000277', 'MAC000279', 'MAC000280',
'MAC000281', 'MAC000282', 'MAC000283', 'MAC000284', 'MAC000285',
'MAC000287', 'MAC000289', 'MAC000291', 'MAC000294', 'MAC000295'],
dtype=object)
print(len(data.LCLid.unique()))
>>> 55
To resolve the issue
import pandas as pd
import numpy as np
df = pd.read_csv('Power-Networks-LCL-June2015(withAcornGps)v2.csv')
# determine the rows needed for the MAC000228 - MAC000295
df[df.LCLid == 'MAC000228'].iloc[0, :] # first row of 228
df[df.LCLid == 'MAC000295'].iloc[-1, :] # last row of 295
# create a dataframe with the desired data
data = df[['LCLid', 'DateTime', 'KWH/hh (per half hour) ']].iloc[6989700:9032044, :].copy()
# fix the data
data.DateTime = pd.to_datetime(data.DateTime)
data.rename(columns={'KWH/hh (per half hour) ': 'KWH/hh'}, inplace=True)
data['KWH/hh'] = data['KWH/hh'].str.replace('Null', 'NaN')
data['KWH/hh'].fillna(np.nan, inplace=True)
data['KWH/hh'] = data['KWH/hh'].astype('float')
data.reset_index(drop=True, inplace=True)
# aggregate your functions
agg_data = data.groupby('LCLid')['KWH/hh'].agg(['sum', 'min', 'max', 'mean']).reset_index()
print(agg_data)
agg_data
LCLid sum min max mean
0 MAC000228 5761.288000 0.021 1.616 0.146356
1 MAC000229 6584.866999 0.008 3.294 0.167456
2 MAC000230 8911.154000 0.029 2.750 0.226384
3 MAC000231 3174.314000 0.000 1.437 0.080663
4 MAC000232 2083.042000 0.005 0.736 0.052946
5 MAC000233 2241.591000 0.000 3.137 0.056993
6 MAC000234 9700.328001 0.029 2.793 0.246646
7 MAC000235 8473.999003 0.011 3.632 0.223194
8 MAC000237 22263.294998 0.036 4.450 0.598299
9 MAC000238 7814.889998 0.016 2.835 0.198781
10 MAC000239 6113.029000 0.015 1.346 0.155481
11 MAC000240 7280.662000 0.000 3.146 0.222399
12 MAC000241 4181.169999 0.024 1.733 0.194963
13 MAC000242 1654.336000 0.000 1.481 0.042088
14 MAC000243 11057.366999 0.009 3.588 0.281989
15 MAC000244 5894.271000 0.005 1.884 0.149939
16 MAC000245 22788.699005 0.037 4.743 0.580087
17 MAC000246 13787.060005 0.014 3.516 0.351075
18 MAC000248 10192.239001 0.000 4.351 0.259536
19 MAC000249 24401.468995 0.148 5.242 0.893042
20 MAC000250 5850.003000 0.000 2.185 0.148999
21 MAC000251 8400.234000 0.035 3.505 0.213931
22 MAC000252 21748.489004 0.135 4.171 0.554978
23 MAC000253 9739.408999 0.009 1.714 0.248201
24 MAC000254 9351.614001 0.009 2.484 0.238209
25 MAC000255 14142.974002 0.097 3.305 0.360220
26 MAC000256 20398.665001 0.049 3.019 0.520680
27 MAC000258 6646.485998 0.017 2.319 0.169666
28 MAC000260 5952.563001 0.006 2.192 0.151952
29 MAC000262 13909.603999 0.000 2.878 0.355181
30 MAC000263 3753.997000 0.015 1.060 0.095863
31 MAC000264 7022.967000 0.020 0.910 0.179432
32 MAC000267 8797.094000 0.029 2.198 0.224898
33 MAC000268 3734.252001 0.000 1.599 0.095359
34 MAC000269 2395.232000 0.000 1.029 0.061167
35 MAC000270 15569.711002 0.131 2.249 0.397501
36 MAC000271 7244.860000 0.028 1.794 0.184974
37 MAC000272 8703.658998 0.034 3.295 0.222446
38 MAC000273 3622.199002 0.005 5.832 0.092587
39 MAC000274 28724.718997 0.032 3.927 0.734422
40 MAC000275 5564.004999 0.012 1.840 0.161290
41 MAC000276 11060.774001 0.000 1.709 0.315724
42 MAC000277 8446.528999 0.027 1.938 0.241075
43 MAC000279 3444.160999 0.016 1.846 0.098354
44 MAC000280 12595.780001 0.125 1.988 0.360436
45 MAC000281 6282.568000 0.024 1.433 0.179538
46 MAC000282 4457.989001 0.030 1.830 0.127444
47 MAC000283 5024.917000 0.011 2.671 0.143627
48 MAC000284 1293.503000 0.000 0.752 0.047975
49 MAC000285 2399.018000 0.006 0.931 0.068567
50 MAC000287 1407.290000 0.000 2.372 0.045253
51 MAC000289 4767.490999 0.000 2.287 0.136436
52 MAC000291 13456.678999 0.072 3.354 0.385060
53 MAC000294 9477.966000 0.053 2.438 0.271264
54 MAC000295 7750.128000 0.010 1.839 0.221774

Related

Requests in multiprocessing fail python

I'm trying to query data from a website, but it takes 6 seconds to query. For all 3000 of my queries, I'd be sitting around for 5 hours. I heard there was a way to parallelize this stuff, so I tried using multiprocessing to do it. It didn't work, and I tried asyncio, but it gave much the same result. Here's the multiprocessing code since that's simpler.
I have 5+ urls in a list I want to request tables from:
https://irsa.ipac.caltech.edu/cgi-bin/Gator/nph-query?outfmt=1&searchForm=MO&spatial=cone&catalog=neowiser_p1bs_psd&moradius=5&mobj=smo&mobjstr=1&btime=2014+01+17+18:38:41&etime=2014+01+18+18:38:41&selcols=w1mpro,w1sigmpro,w1snr,w1rchi2,w1flux,w1sigflux,w1sky,w1mag_2,w1sigm,mjd
https://irsa.ipac.caltech.edu/cgi-bin/Gator/nph-query?outfmt=1&searchForm=MO&spatial=cone&catalog=neowiser_p1bs_psd&moradius=5&mobj=smo&mobjstr=2&btime=2014+05+18+23:10:01&etime=2014+05+19+23:10:01&selcols=w1mpro,w1sigmpro,w1snr,w1rchi2,w1flux,w1sigflux,w1sky,w1mag_2,w1sigm,mjd
https://irsa.ipac.caltech.edu/cgi-bin/Gator/nph-query?outfmt=1&searchForm=MO&spatial=cone&catalog=neowiser_p1bs_psd&moradius=5&mobj=smo&mobjstr=3&btime=2014+11+04+06:01:27&etime=2014+11+05+06:01:27&selcols=w1mpro,w1sigmpro,w1snr,w1rchi2,w1flux,w1sigflux,w1sky,w1mag_2,w1sigm,mjd
https://irsa.ipac.caltech.edu/cgi-bin/Gator/nph-query?outfmt=1&searchForm=MO&spatial=cone&catalog=neowiser_p1bs_psd&moradius=5&mobj=smo&mobjstr=4&btime=2014+07+14+10:01:45&etime=2014+07+15+10:01:45&selcols=w1mpro,w1sigmpro,w1snr,w1rchi2,w1flux,w1sigflux,w1sky,w1mag_2,w1sigm,mjd
https://irsa.ipac.caltech.edu/cgi-bin/Gator/nph-query?outfmt=1&searchForm=MO&spatial=cone&catalog=neowiser_p1bs_psd&moradius=5&mobj=smo&mobjstr=5&btime=2014+07+04+20:17:01&etime=2014+07+05+20:17:01&selcols=w1mpro,w1sigmpro,w1snr,w1rchi2,w1flux,w1sigflux,w1sky,w1mag_2,w1sigm,mjd
Here's my request code:
from astropy.io import fits
from astropy.io import ascii as astro_ascii
from astropy.time import Time
def get_real_meta(url):
df = astro_ascii.read(url)
df=df.to_pandas()
print(df)
return df
import multiprocessing as mp
pool = mp.Pool(processes=10)
results = pool.map(get_real_meta, urls)
When I run this, some of the results are failed requests.
Why is this happening?
This is the full result from the run:
col1 \
0 [struct stat="ERROR"
col2 \
0 msg="ERROR: Cannot process Moving Object Correctly. "
col3
0 logfile="https://irsa.ipac.caltech.edu:443/workspace/TMP_1YL1MV_20034/Gator/irsa/20034/log.20034.html"]
col1 \
0 [struct stat="ERROR"
col2 \
0 msg="ERROR: Cannot process Moving Object Correctly. "
col3
0 logfile="https://irsa.ipac.caltech.edu:443/workspace/TMP_gteaoW_20031/Gator/irsa/20031/log.20031.html"]
col1 \
0 [struct stat="ERROR"
col2 \
0 msg="ERROR: Cannot process Moving Object Correctly. "
col3
0 logfile="https://irsa.ipac.caltech.edu:443/workspace/TMP_yWl2vY_20037/Gator/irsa/20037/log.20037.html"]
cntr_u dist_x pang_x ra_u dec_u ra \
0 1 2.338323 -43.660587 153.298036 13.719689 153.297574
1 2 1.047075 96.475058 153.337711 13.730126 153.338009
2 3 1.709365 -159.072115 153.377399 13.740497 153.377224
3 4 0.903435 84.145005 153.377439 13.740491 153.377696
4 5 0.800164 99.321042 153.397283 13.745653 153.397509
5 6 0.591963 16.330683 153.417180 13.750790 153.417228
6 7 0.642462 63.761302 153.437090 13.755910 153.437255
7 8 1.020182 -123.497531 153.457013 13.761012 153.456770
8 9 1.051963 130.842143 153.476909 13.766102 153.477137
9 11 1.007540 -55.156815 153.516820 13.776216 153.516583
10 12 0.607295 118.463910 153.556744 13.786265 153.556897
11 13 0.227240 -79.964079 153.556784 13.786259 153.556720
12 14 1.526454 -113.268004 153.596760 13.796237 153.596359
dec clon clat w1mpro w1sigmpro w1snr w1rchi2 \
0 13.720159 10h13m11.42s 13d43m12.57s 7.282 0.014 78.4 307.90
1 13.730093 10h13m21.12s 13d43m48.34s 6.925 0.018 59.8 82.35
2 13.740054 10h13m30.53s 13d44m24.19s 6.354 0.012 91.3 203.90
3 13.740517 10h13m30.65s 13d44m25.86s 6.862 0.015 70.3 61.03
4 13.745617 10h13m35.40s 13d44m44.22s 7.005 0.016 68.0 62.28
5 13.750948 10h13m40.13s 13d45m03.41s 6.749 0.015 70.4 26.35
6 13.755989 10h13m44.94s 13d45m21.56s 7.031 0.019 57.5 37.56
7 13.760856 10h13m49.62s 13d45m39.08s 6.729 0.013 84.9 66.91
8 13.765911 10h13m54.51s 13d45m57.28s 6.944 0.022 49.0 44.22
9 13.776376 10h14m03.98s 13d46m34.95s 7.049 0.022 49.1 20.63
10 13.786185 10h14m13.66s 13d47m10.26s 6.728 0.018 58.9 14.40
11 13.786270 10h14m13.61s 13d47m10.57s 6.773 0.024 45.3 10.65
12 13.796069 10h14m23.13s 13d47m45.85s 7.126 0.015 72.2 219.50
w1flux w1sigflux w1sky w1mag_2 w1sigm mjd
0 248830.0 3173.5 24.057 7.719 0.013 56795.965297
1 345700.0 5780.5 27.888 8.348 0.006 56796.096965
2 584870.0 6406.8 24.889 7.986 0.006 56796.228504
3 366210.0 5206.7 27.876 7.653 0.006 56796.228632
4 321210.0 4725.7 26.150 7.867 0.009 56796.294338
5 406400.0 5771.4 25.240 7.711 0.006 56796.360172
6 313360.0 5449.7 26.049 7.988 0.006 56796.426005
7 414100.0 4877.9 25.581 8.022 0.006 56796.491839
8 339610.0 6935.9 25.564 8.029 0.007 56796.557545
9 308370.0 6285.2 25.491 8.331 0.006 56796.689212
10 414410.0 7035.5 27.656 7.851 0.007 56796.820752
11 397500.0 8773.4 27.628 8.015 0.006 56796.820880
12 287270.0 3980.2 24.825 8.310 0.006 56796.952419
cntr_u dist_x pang_x ra_u dec_u ra dec \
0 1 0.570817 137.605512 128.754979 4.242103 128.755086 4.241986
1 2 0.852021 14.819525 128.791578 4.225474 128.791639 4.225703
2 3 1.099000 -4.816139 128.828083 4.208860 128.828057 4.209164
3 4 1.207022 9.485091 128.864456 4.192260 128.864511 4.192591
4 5 0.323112 107.976317 128.882608 4.183966 128.882694 4.183938
5 6 0.627727 99.373708 128.882645 4.183967 128.882817 4.183939
6 7 0.489166 19.732971 128.900773 4.175676 128.900819 4.175804
7 8 0.231292 -139.425350 128.918877 4.167389 128.918835 4.167340
8 9 0.393206 -28.705753 128.936958 4.159106 128.936905 4.159202
9 10 0.466548 -35.199460 128.936995 4.159107 128.936920 4.159213
10 11 1.153921 -100.879703 128.955053 4.150828 128.954737 4.150767
11 12 1.078232 -38.005043 128.973087 4.142552 128.972902 4.142788
12 13 1.172329 -27.290606 128.991097 4.134280 128.990947 4.134569
13 15 1.399220 54.717544 129.027083 4.117750 129.027401 4.117974
clon clat w1mpro w1sigmpro w1snr w1rchi2 w1flux \
0 08h35m01.22s 04d14m31.15s 6.768 0.018 58.9 60.490 395130.0
1 08h35m09.99s 04d13m32.53s 6.706 0.018 59.1 30.780 418160.0
2 08h35m18.73s 04d12m32.99s 6.754 0.024 45.4 20.520 400280.0
3 08h35m27.48s 04d11m33.33s 6.667 0.024 44.9 34.090 433390.0
4 08h35m31.85s 04d11m02.18s 6.782 0.023 47.8 9.326 389870.0
5 08h35m31.88s 04d11m02.18s 6.710 0.035 31.4 11.360 416570.0
6 08h35m36.20s 04d10m32.89s 6.880 0.021 52.7 7.781 356410.0
7 08h35m40.52s 04d10m02.42s 6.653 0.023 46.8 18.900 439130.0
8 08h35m44.86s 04d09m33.13s 6.986 0.023 47.2 8.576 323350.0
9 08h35m44.86s 04d09m33.17s 6.917 0.019 58.5 25.720 344400.0
10 08h35m49.14s 04d09m02.76s 6.782 0.015 70.1 173.800 390170.0
11 08h35m53.50s 04d08m34.04s 6.671 0.016 69.6 70.490 431820.0
12 08h35m57.83s 04d08m04.45s 7.152 0.016 66.6 131.100 277440.0
13 08h36m06.58s 04d07m04.71s 6.436 0.017 63.3 86.350 536630.0
w1sigflux w1sky w1mag_2 w1sigm mjd
0 6711.7 21.115 7.988 0.008 56965.251016
1 7070.1 23.748 7.830 0.007 56965.382556
2 8812.6 20.930 8.456 0.007 56965.514096
3 9649.6 21.350 8.120 0.008 56965.645509
4 8161.1 19.988 7.686 0.007 56965.711215
5 13264.0 22.180 7.902 0.016 56965.711343
6 6769.0 22.962 8.023 0.008 56965.777049
7 9382.2 22.355 8.030 0.007 56965.842755
8 6847.6 23.531 8.024 0.007 56965.908462
9 5882.5 21.256 7.654 0.007 56965.908589
10 5568.7 21.926 8.051 0.007 56965.974295
11 6202.3 23.497 7.950 0.007 56966.040002
12 4165.8 20.094 8.091 0.010 56966.105708
13 8482.9 22.436 8.191 0.008 56966.237248

Why is my regressor predicting counts much further than the actual counts?

I'm building a regression model (statsmodels.discrete.count_model.ZeroInflatedPoisson) in which my goal is to predict the "count" variable, as you can see below. Any suggestion on how I can improve this model?
#regression expression in Patsy notation. count depends on all these columns
expr = 'count ~ day_of_week + business_day + duration + Distance_KM + wind_speed + wind_deg + wind_gust + rain_1h + rain_3h + clouds_all + End_Station_Region_cat'
y_train, X_train = dmatrices(expr, df, return_type='dataframe')
y_test, X_test = dmatrices(expr, df, return_type='dataframe')
import statsmodels.discrete.count_model as cm
zip_training_results = cm.ZeroInflatedPoisson(endog=y_train, exog=X_train, exog_infl=X_train, inflation='logit', maxiter=1000, maxfun=500).fit_regularized()
Optimization terminated successfully (Exit mode 0)
Current function value: 0.030743127487834265
Iterations: 184
Function evaluations: 230
Gradient evaluations: 184
import statsmodels.discrete.count_model as cm
print(zip_training_results.summary())
ZeroInflatedPoisson Regression Results
===============================================================================
Dep. Variable: count No. Observations: 3099093
Model: ZeroInflatedPoisson Df Residuals: 3099081
Method: MLE Df Model: 11
Date: Sun, 04 Sep 2022 Pseudo R-squ.: 0.6252
Time: 17:17:42 Log-Likelihood: -95276.
converged: True LL-Null: -2.5418e+05
Covariance Type: nonrobust LLR p-value: 0.000
==================================================================================================
coef std err z P>|z| [0.025 0.975]
--------------------------------------------------------------------------------------------------
inflate_Intercept 29.2202 488.934 0.060 0.952 -929.073 987.514
inflate_day_of_week 7.5087 11.894 0.631 0.528 -15.802 30.820
inflate_business_day 20.5779 59.774 0.344 0.731 -96.577 137.733
inflate_duration -85.6092 487.673 -0.176 0.861 -1041.431 870.212
inflate_Distance_KM -1.9937 46.122 -0.043 0.966 -92.391 88.404
inflate_wind_speed -1.0937 9.175 -0.119 0.905 -19.077 16.889
inflate_wind_deg -5.6348 48.297 -0.117 0.907 -100.296 89.026
inflate_wind_gust 0.8242 763.772 0.001 0.999 -1496.142 1497.790
inflate_rain_1h 0.5288 59.935 0.009 0.993 -116.942 118.000
inflate_rain_3h 0.9703 7.97e+04 1.22e-05 1.000 -1.56e+05 1.56e+05
inflate_clouds_all -0.1600 5.526 -0.029 0.977 -10.991 10.671
inflate_End_Station_Region_cat -0.3111 2.362 -0.132 0.895 -4.940 4.318
Intercept -0.5539 0.020 -28.229 0.000 -0.592 -0.515
day_of_week 0.0115 0.003 3.563 0.000 0.005 0.018
business_day 0.0560 0.014 4.127 0.000 0.029 0.083
duration 0.0149 0.000 80.515 0.000 0.015 0.015
Distance_KM 0.0972 0.002 49.338 0.000 0.093 0.101
wind_speed -0.9573 0.009 -109.888 0.000 -0.974 -0.940
wind_deg -0.0106 8.82e-05 -119.945 0.000 -0.011 -0.010
wind_gust 0.0001 0.015 0.008 0.994 -0.029 0.029
rain_1h -0.1926 0.019 -10.084 0.000 -0.230 -0.155
rain_3h -0.0724 0.026 -2.755 0.006 -0.124 -0.021
clouds_all 0.0006 0.000 5.288 0.000 0.000 0.001
End_Station_Region_cat 0.0034 0.005 0.693 0.489 -0.006 0.013
==================================================================================================
zip_predictions = zip_training_results.predict(X_test,exog_infl=X_test)
predicted_counts=np.round(zip_predictions)
actual_counts = y_test['count']
print('ZIP RMSE='+str(np.sqrt(np.sum(np.power(np.subtract(predicted_counts,actual_counts),2)))))
ZIP RMSE=195.05127530985283
With the following image, I believe it is clear how the model presented in the code above ends up making predictions of counts much higher than the observed values. When Y is a low value, the regressor can make good predictions, however, what I would like to fix is that it predicts much higher values for Y than the actual values.
plt.clf()
plt.hist([actual_counts, predicted_counts], log=True)
plt.legend(('orig','pred'))
plt.show()
Just Poisson attempt:
pipeline = Pipeline([('model', PoissonRegressor())])
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
r2_test = metrics.r2_score(y_test, y_pred)
r2_test
-0.18668761012669255
y_pred_train = pipeline.predict(X_train)
r2_train = metrics.r2_score(y_train, y_pred_train)
r2_train
-0.10978290552023906

Python stack loses data

I'm trying to reorganise my data (the overarching goal is to convert a ASCII file to netCDF). One of the steps to get there is to take the data and stack the columns. My original data look like this:
import pandas as pd
import numpy as np
import xarray as xr
fname = 'data.out'
df = pd.read_csv(fname, header=0, delim_whitespace=True)
print(df)
gives
Lon Lat Year Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
0 150.25 -34.25 1851 0.027 -0.005 -0.010 -0.034 -0.029 -0.025 0.016 -0.049 -0.055 0.003 -0.029 0.060
1 150.25 -34.25 1852 0.021 -0.002 -0.050 0.071 0.066 0.001 0.021 -0.014 -0.072 -0.050 0.113 0.114
2 150.25 -34.25 1853 0.093 0.094 0.139 -0.019 0.015 0.003 0.018 -0.032 -0.024 -0.010 0.132 0.107
3 150.25 -34.25 1854 0.084 0.071 0.024 -0.004 -0.022 0.005 0.025 0.006 -0.040 -0.051 -0.067 -0.005
4 150.25 -34.25 1855 -0.030 -0.004 -0.035 -0.036 -0.035 -0.012 0.009 -0.017 -0.062 -0.068 -0.077 -0.084
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
707995 138.75 -19.25 2096 -0.044 -0.039 -0.068 -0.027 -0.023 -0.029 -0.031 -0.002 -0.005 0.018 -0.039 -0.094
707996 138.75 -19.25 2097 -0.041 -0.066 -0.006 -0.018 -0.005 -0.017 0.011 0.018 0.026 0.024 0.010 -0.086
707997 138.75 -19.25 2098 -0.033 -0.044 -0.032 -0.044 -0.046 -0.040 -0.021 -0.017 0.022 -0.011 -0.015 -0.032
707998 138.75 -19.25 2099 0.039 0.016 -0.009 0.001 -0.002 0.001 0.010 0.021 0.026 0.027 0.012 -0.017
707999 138.75 -19.25 2100 0.010 -0.022 -0.024 -0.037 -0.008 -0.020 0.002 0.011 0.011 0.033 0.020 -0.002
[708000 rows x 15 columns]
I then select the actual timesteps
months=list(df.columns)
months=months[3:]
and select all columns that have monthly data. This then returns the shape
print(df[months].shape)
(708000, 12). So far so good, but then when I stack the data
df_stack = df[months].stack()
print(df_stack.shape)
instead of the expected shape ((8496000,) I get (8493000,). The weird thing is the script runs on other files that have the same shape as the data I used for this example and I don't have that problem there. It looks like I'm losing one Lon/Lat pixel for 250 years - but I don't understand why? This becomes a problem later when I try to convert the data to a netcdf file.
lons = np.unique(df.Lon)
lats = np.unique(df.Lat)
years = np.unique(df.Year)
nyears = len(years)
nrows = len(lats)
ncols = len(lons)
nmonths = 12
lons.sort()
lats.sort()
years.sort()
time = pd.date_range(start=f'01/{years[0]}',
end=f'01/{years[-1]+1}', freq='M')
dx = 0.5
Lon = xr.DataArray(np.arange(-180.+dx/2., 180., dx), dims=("Lon"),
attrs={"long_name":"longitude", "unit":"degrees_east"})
nlon = Lon.size
dy = 0.5
Lat = xr.DataArray(np.arange(-90.+dy/2., 90., dy), dims=("Lat"),
attrs={"long_name":"latitude", "unit":"degrees_north"})
nlat = Lat.size
out = xr.DataArray(np.zeros((nyears*nmonths,nlat, nlon)),
dims=("Time","Lat","Lon"),
coords=({"Lat":Lat, "Lon":Lon, "Time":time}))
for nr in range(0,len(df.index),nyears):
rows = df[nr:nr+nyears]
thislon = rows["Lon"].min()
thislat = rows["Lat"].min()
out.loc[dict(
Lon=thislon,
Lat=thislat)] = df_stack[nr*nmonths:(nr+nyears)*nmonths]
this gives me the error
ValueError: could not broadcast input array from shape (0,) into shape (3000,)
It's missing the 3000 values that I'm losing while stacking the data. Does anyone know how to fix this?

Replace:
df_stack = df[months].stack()
by
df_stack = df[months].stack(dropna=False)

Obtaining the last value that equals or most near in the column dataframe

i have an issue in my code, i'm making points of cuts.
First, this is my Dataframe Column:
In [23]: df['bad_%']
0 0.025
1 0.007
2 0.006
3 0.006
4 0.006
5 0.006
6 0.007
7 0.007
8 0.007
9 0.006
10 0.006
11 0.009
12 0.009
13 0.009
14 0.008
15 0.008
16 0.008
17 0.012
18 0.012
19 0.05
20 0.05
21 0.05
22 0.05
23 0.05
24 0.05
25 0.05
26 0.05
27 0.062
28 0.062
29 0.061
5143 0.166
5144 0.166
5145 0.166
5146 0.167
5147 0.167
5148 0.167
5149 0.167
5150 0.167
5151 0.05
5152 0.167
5153 0.167
5154 0.167
5155 0.167
5156 0.051
5157 0.052
5158 0.161
5159 0.149
5160 0.168
5161 0.168
5162 0.168
5163 0.168
5164 0.168
5165 0.168
5166 0.168
5167 0.168
5168 0.049
5169 0.168
5170 0.168
5171 0.168
5172 0.168
Name: bad%, Length: 5173, dtype: float64
I used this code to detected the value equals or most near to 0.05 (VALUE THAT INTRODUCED on the CONSOLE)
error = 100 #Margin of error
valuesA = [] #array to save data
pointCut=0 #identify cut point
for index, row in df.iterrows():
if(abs(row['bad%'] - a) <= error):
valuesA = row
error = abs(row['bad%'] - a)
#Variable "a" introduced by console, in this case is "0.05"
pointCut = index
This code return the value "0.05" in the index 5151, in first instance looks good, because the "0.05" in the index "5151" is the last "0.05".
Out [27]:
5151 0.05
But my objetive is obtain THE LAST VALUE IN THE COLUMN equal or most near to "0.05", in this case this value correspond to "0.049" in the index "5168", i need obtain this value.
Exists an algorithm that permit this? Any solution or recomendation?
Thanks in advance.

Solutions if exist at leas one value:
Use [::-1] for swap values from back and get idxmax for last matched index value:
a = 0.05
s = df['bad%']
b = s[[(s[::-1] <= a).idxmax()]]
print (b)
5168 0.049
Or:
b = s[(s <= a)].iloc[[-1]]
print (b)
5168 0.049
Name: bad%, dtype: float64
Solution working also if value not exist - then empty Series yields:
a = 0.05
s = df['bad%']
m1 = (s <= a)
m2 = m1[::-1].cumsum().eq(1)
b = s[m1 & m2]
print (b)
5168 0.049
Name: bad%, dtype: float64
Sample data:
df = pd.DataFrame({'bad%': {5146: 0.16699999999999998, 5147: 0.16699999999999998, 5148: 0.16699999999999998, 5149: 0.049, 5150: 0.16699999999999998, 5151: 0.05, 5152: 0.16699999999999998, 5167: 0.168, 5168: 0.049, 5169: 0.168}})

Why is Pandas.eval() with numexpr so slow?

Test code:
import numpy as np
import pandas as pd
COUNT = 1000000
df = pd.DataFrame({
'y': np.random.normal(0, 1, COUNT),
'z': np.random.gamma(50, 1, COUNT),
})
%timeit df.y[(10 < df.z) & (df.z < 50)].mean()
%timeit df.y.values[(10 < df.z.values) & (df.z.values < 50)].mean()
%timeit df.eval('y[(10 < z) & (z < 50)].mean()', engine='numexpr')
The output on my machine (a fairly fast x86-64 Linux desktop with Python 3.6) is:
17.8 ms ± 1.3 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
8.44 ms ± 502 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
46.4 ms ± 2.22 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
I understand why the second line is a bit faster (it ignores the Pandas index). But why is the eval() approach using numexpr so slow? Shouldn't it be faster than at least the first approach? The documentation sure makes it seem like it would be: https://pandas.pydata.org/pandas-docs/stable/enhancingperf.html

From the investigation presented below, it looks like the unspectacular reason for the worse performance is "overhead".
Only a small part of the expression y[(10 < z) & (z < 50)].mean() is done via numexpr-module. numexpr doesn't support indexing, thus we can only hope for (10 < z) & (z < 50) to be speed-up - anything else will be mapped to pandas-operations.
However, (10 < z) & (z < 50) is not the bottle-neck here, as can be easily seen:
%timeit df.y[(10 < df.z) & (df.z < 50)].mean() # 16.7 ms
mask=(10 < df.z) & (df.z < 50)
%timeit df.y[mask].mean() # 13.7 ms
%timeit df.y[mask] # 13.2 ms
df.y[mask] -takes the lion's share of the running time.
We can compare the profiler output for df.y[mask] and df.eval('y[mask]') to see what makes the difference.
When I use the following script:
import numpy as np
import pandas as pd
COUNT = 1000000
df = pd.DataFrame({
'y': np.random.normal(0, 1, COUNT),
'z': np.random.gamma(50, 1, COUNT),
})
mask = (10 < df.z) & (df.z < 50)
df['m']=mask
for _ in range(500):
df.y[df.m]
# OR
#df.eval('y[m]', engine='numexpr')
and run it with python -m cProfile -s cumulative run.py (or %prun -s cumulative <...> in IPython), I can see the following profiles.
For direct call of the pandas functionality:
ncalls tottime percall cumtime percall filename:lineno(function)
419/1 0.013 0.000 7.228 7.228 {built-in method builtins.exec}
1 0.006 0.006 7.228 7.228 run.py:1(<module>)
500 0.005 0.000 6.589 0.013 series.py:764(__getitem__)
500 0.003 0.000 6.475 0.013 series.py:812(_get_with)
500 0.003 0.000 6.468 0.013 series.py:875(_get_values)
500 0.009 0.000 6.445 0.013 internals.py:4702(get_slice)
500 0.006 0.000 3.246 0.006 range.py:491(__getitem__)
505 3.146 0.006 3.236 0.006 base.py:2067(__getitem__)
500 3.170 0.006 3.170 0.006 internals.py:310(_slice)
635/2 0.003 0.000 0.414 0.207 <frozen importlib._bootstrap>:958(_find_and_load)
We can see that almost 100% of the time is spent in series.__getitem__ without any overhead.
For the call via df.eval(...), the situation is quite different:
ncalls tottime percall cumtime percall filename:lineno(function)
453/1 0.013 0.000 12.702 12.702 {built-in method builtins.exec}
1 0.015 0.015 12.702 12.702 run.py:1(<module>)
500 0.013 0.000 12.090 0.024 frame.py:2861(eval)
1000/500 0.025 0.000 10.319 0.021 eval.py:153(eval)
1000/500 0.007 0.000 9.247 0.018 expr.py:731(__init__)
1000/500 0.004 0.000 9.236 0.018 expr.py:754(parse)
4500/500 0.019 0.000 9.233 0.018 expr.py:307(visit)
1000/500 0.003 0.000 9.105 0.018 expr.py:323(visit_Module)
1000/500 0.002 0.000 9.102 0.018 expr.py:329(visit_Expr)
500 0.011 0.000 9.096 0.018 expr.py:461(visit_Subscript)
500 0.007 0.000 6.874 0.014 series.py:764(__getitem__)
500 0.003 0.000 6.748 0.013 series.py:812(_get_with)
500 0.004 0.000 6.742 0.013 series.py:875(_get_values)
500 0.009 0.000 6.717 0.013 internals.py:4702(get_slice)
500 0.006 0.000 3.404 0.007 range.py:491(__getitem__)
506 3.289 0.007 3.391 0.007 base.py:2067(__getitem__)
500 3.282 0.007 3.282 0.007 internals.py:310(_slice)
500 0.003 0.000 1.730 0.003 generic.py:432(_get_index_resolvers)
1000 0.014 0.000 1.725 0.002 generic.py:402(_get_axis_resolvers)
2000 0.018 0.000 1.685 0.001 base.py:1179(to_series)
1000 0.003 0.000 1.537 0.002 scope.py:21(_ensure_scope)
1000 0.014 0.000 1.534 0.002 scope.py:102(__init__)
500 0.005 0.000 1.476 0.003 scope.py:242(update)
500 0.002 0.000 1.451 0.003 inspect.py:1489(stack)
500 0.021 0.000 1.449 0.003 inspect.py:1461(getouterframes)
11000 0.062 0.000 1.415 0.000 inspect.py:1422(getframeinfo)
2000 0.008 0.000 1.276 0.001 base.py:1253(_to_embed)
2035 1.261 0.001 1.261 0.001 {method 'copy' of 'numpy.ndarray' objects}
1000 0.015 0.000 1.226 0.001 engines.py:61(evaluate)
11000 0.081 0.000 1.081 0.000 inspect.py:757(findsource)
once again about 7 seconds are spent in series.__getitem__, but there are also about 6 seconds overhead - for example about 2 seconds in frame.py:2861(eval) and about 2 seconds in expr.py:461(visit_Subscript).
I did only a superficial investigation (see more details further below), but this overhead doesn't seems to be just constant but at least linear in the number of element in the series. For example there is method 'copy' of 'numpy.ndarray' objects which means that data is copied (it is quite unclear, why this would be necessary per se).
My take-away from it: using pd.eval has advantages as long as the evaluated expression can be evaluated with numexpr alone. As soon as this is not the case, there might be no longer gains but losses due to quite large overhead.
Using line_profiler (here I use %lprun-magic (after loading it with %load_ext line_profliler) for the function run() which is more or less a copy from the script above) we can easily find where the time is lost in Frame.eval:
%lprun -f pd.core.frame.DataFrame.eval
-f pd.core.frame.DataFrame._get_index_resolvers
-f pd.core.frame.DataFrame._get_axis_resolvers
-f pd.core.indexes.base.Index.to_series
-f pd.core.indexes.base.Index._to_embed
run()
Here we can see were the additional 10% are spent:
Line # Hits Time Per Hit % Time Line Contents
==============================================================
2861 def eval(self, expr,
....
2951 10 206.0 20.6 0.0 from pandas.core.computation.eval import eval as _eval
2952
2953 10 176.0 17.6 0.0 inplace = validate_bool_kwarg(inplace, 'inplace')
2954 10 30.0 3.0 0.0 resolvers = kwargs.pop('resolvers', None)
2955 10 37.0 3.7 0.0 kwargs['level'] = kwargs.pop('level', 0) + 1
2956 10 17.0 1.7 0.0 if resolvers is None:
2957 10 235850.0 23585.0 9.0 index_resolvers = self._get_index_resolvers()
2958 10 2231.0 223.1 0.1 resolvers = dict(self.iteritems()), index_resolvers
2959 10 29.0 2.9 0.0 if 'target' not in kwargs:
2960 10 19.0 1.9 0.0 kwargs['target'] = self
2961 10 46.0 4.6 0.0 kwargs['resolvers'] = kwargs.get('resolvers', ()) + tuple(resolvers)
2962 10 2392725.0 239272.5 90.9 return _eval(expr, inplace=inplace, **kwargs)
and _get_index_resolvers() can be drilled down to Index._to_embed:
Line # Hits Time Per Hit % Time Line Contents
==============================================================
1253 def _to_embed(self, keep_tz=False, dtype=None):
1254 """
1255 *this is an internal non-public method*
1256
1257 return an array repr of this object, potentially casting to object
1258
1259 """
1260 40 73.0 1.8 0.0 if dtype is not None:
1261 return self.astype(dtype)._to_embed(keep_tz=keep_tz)
1262
1263 40 201490.0 5037.2 100.0 return self.values.copy()
Where the O(n)-copying happens.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Why using group by makes some id disappear - python

Related

Requests in multiprocessing fail python

Why is my regressor predicting counts much further than the actual counts?

Python stack loses data

Obtaining the last value that equals or most near in the column dataframe

Why is Pandas.eval() with numexpr so slow?

Categories

Resources