How to groupby and aggregate an operation to multiple columns?

How to groupby and aggregate an operation to multiple columns? - python

I'm trying to create mean for rows in a data frame based on two columns, but I'm getting the following error:
TypeError: 'numpy.float64' object is not callable
The dataframe:
date origin positive_score neutral_score negativity_score compound_score
2020-09-19 the verge 0.130 0.846 0.024 0.9833
2020-09-19 the verge 0.130 0.846 0.024 0.9833
2020-09-19 fool 0.075 0.869 0.056 0.8560
2020-09-19 seeking_alpha 0.067 0.918 0.015 0.9983
2020-09-19 seeking_alpha 0.171 0.791 0.038 0.7506
2020-09-19 seeking_alpha 0.095 0.814 0.091 0.9187
2020-09-19 seeking_alpha 0.113 0.801 0.086 0.9890
2020-09-19 seeking_alpha 0.094 0.869 0.038 0.9997
2020-09-19 wall street journal 0.000 1.000 0.000 0.0000
2020-09-19 seeking_alpha 0.179 0.779 0.042 0.9997
2020-09-19 seeking_alpha 0.178 0.704 0.117 0.7360
My code:
def mean_indicators(cls, df: pd.DataFrame):
df_with_mean = df.groupby([DATE, ORIGIN], as_index=False).agg({POSITIVE_SCORE: df[POSITIVE_SCORE].mean(),
NEGATIVE_SCORE: df[NEGATIVE_SCORE].mean(),
NEUTRAL_SCORE: df[NEUTRAL_SCORE].mean(),
COMPOUND_SCORE: df[COMPOUND_SCORE].mean()
})
return df_with_mean

I think this should do what you want:
def mean_indicators(cls, df: pd.DataFrame):
df_with_mean = df.groupby([DATE, ORIGIN], as_index=False).agg(
{POSITIVE_SCORE: "mean",
NEGATIVE_SCORE: "mean",
NEUTRAL_SCORE: "mean",
COMPOUND_SCORE: "mean",
})
return df_with_mean
You can alternatively use named aggregation syntax as seen here

The error is a result of incorrectly aggregating an operation.
{POSITIVE_SCORE: df[POSITIVE_SCORE].mean() is not correct.
{'positive_score': 'mean'} is correct
Since you are trying to take the mean of all the non-grouped numeric columns, the function is not necessary.
Use pandas.core.groupby.GroupBy.mean for one operation on the entire dataframe.
Use pandas.core.groupby.DataFrameGroupBy.aggregate to aggregate different operations.
Applying multiple functions at once
# just groupby and mean
df_mean = df.groupby(['date', 'origin'], as_index=False).mean()
# display(df_mean())
date origin positive_score neutral_score negativity_score compound_score
2020-09-19 fool 0.075000 0.869000 0.056 0.856000
2020-09-19 seeking_alpha 0.128143 0.810857 0.061 0.913143
2020-09-19 the verge 0.130000 0.846000 0.024 0.983300
2020-09-19 wall street journal 0.000000 1.000000 0.000 0.000000

Related

Python stack loses data

I'm trying to reorganise my data (the overarching goal is to convert a ASCII file to netCDF). One of the steps to get there is to take the data and stack the columns. My original data look like this:
import pandas as pd
import numpy as np
import xarray as xr
fname = 'data.out'
df = pd.read_csv(fname, header=0, delim_whitespace=True)
print(df)
gives
Lon Lat Year Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
0 150.25 -34.25 1851 0.027 -0.005 -0.010 -0.034 -0.029 -0.025 0.016 -0.049 -0.055 0.003 -0.029 0.060
1 150.25 -34.25 1852 0.021 -0.002 -0.050 0.071 0.066 0.001 0.021 -0.014 -0.072 -0.050 0.113 0.114
2 150.25 -34.25 1853 0.093 0.094 0.139 -0.019 0.015 0.003 0.018 -0.032 -0.024 -0.010 0.132 0.107
3 150.25 -34.25 1854 0.084 0.071 0.024 -0.004 -0.022 0.005 0.025 0.006 -0.040 -0.051 -0.067 -0.005
4 150.25 -34.25 1855 -0.030 -0.004 -0.035 -0.036 -0.035 -0.012 0.009 -0.017 -0.062 -0.068 -0.077 -0.084
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
707995 138.75 -19.25 2096 -0.044 -0.039 -0.068 -0.027 -0.023 -0.029 -0.031 -0.002 -0.005 0.018 -0.039 -0.094
707996 138.75 -19.25 2097 -0.041 -0.066 -0.006 -0.018 -0.005 -0.017 0.011 0.018 0.026 0.024 0.010 -0.086
707997 138.75 -19.25 2098 -0.033 -0.044 -0.032 -0.044 -0.046 -0.040 -0.021 -0.017 0.022 -0.011 -0.015 -0.032
707998 138.75 -19.25 2099 0.039 0.016 -0.009 0.001 -0.002 0.001 0.010 0.021 0.026 0.027 0.012 -0.017
707999 138.75 -19.25 2100 0.010 -0.022 -0.024 -0.037 -0.008 -0.020 0.002 0.011 0.011 0.033 0.020 -0.002
[708000 rows x 15 columns]
I then select the actual timesteps
months=list(df.columns)
months=months[3:]
and select all columns that have monthly data. This then returns the shape
print(df[months].shape)
(708000, 12). So far so good, but then when I stack the data
df_stack = df[months].stack()
print(df_stack.shape)
instead of the expected shape ((8496000,) I get (8493000,). The weird thing is the script runs on other files that have the same shape as the data I used for this example and I don't have that problem there. It looks like I'm losing one Lon/Lat pixel for 250 years - but I don't understand why? This becomes a problem later when I try to convert the data to a netcdf file.
lons = np.unique(df.Lon)
lats = np.unique(df.Lat)
years = np.unique(df.Year)
nyears = len(years)
nrows = len(lats)
ncols = len(lons)
nmonths = 12
lons.sort()
lats.sort()
years.sort()
time = pd.date_range(start=f'01/{years[0]}',
end=f'01/{years[-1]+1}', freq='M')
dx = 0.5
Lon = xr.DataArray(np.arange(-180.+dx/2., 180., dx), dims=("Lon"),
attrs={"long_name":"longitude", "unit":"degrees_east"})
nlon = Lon.size
dy = 0.5
Lat = xr.DataArray(np.arange(-90.+dy/2., 90., dy), dims=("Lat"),
attrs={"long_name":"latitude", "unit":"degrees_north"})
nlat = Lat.size
out = xr.DataArray(np.zeros((nyears*nmonths,nlat, nlon)),
dims=("Time","Lat","Lon"),
coords=({"Lat":Lat, "Lon":Lon, "Time":time}))
for nr in range(0,len(df.index),nyears):
rows = df[nr:nr+nyears]
thislon = rows["Lon"].min()
thislat = rows["Lat"].min()
out.loc[dict(
Lon=thislon,
Lat=thislat)] = df_stack[nr*nmonths:(nr+nyears)*nmonths]
this gives me the error
ValueError: could not broadcast input array from shape (0,) into shape (3000,)
It's missing the 3000 values that I'm losing while stacking the data. Does anyone know how to fix this?

Replace:
df_stack = df[months].stack()
by
df_stack = df[months].stack(dropna=False)

python Statsmodels SARIMAX KeyError: 'The `start` argument could not be matched to a location related to the index of the data.'

My first stack overflow post, I am studying part time for a data science qualification and Im stuck with Statsmodels SARIMAX predicting
my time series data looks as follows
ts_log.head()
Calendar Week
2016-02-22 8.168486
2016-02-29 8.252707
2016-03-07 8.324821
2016-03-14 8.371474
2016-03-21 8.766238
Name: Sales Quantity, dtype: float64
ts_log.tail()
Calendar Week
2020-07-20 8.326759
2020-07-27 8.273847
2020-08-03 8.286521
2020-08-10 8.222822
2020-08-17 8.011687
Name: Sales Quantity, dtype: float64
I run the following
train = ts_log[:'2019-07-01'].dropna()
test = ts_log['2020-08-24':].dropna()
model = SARIMAX(train, order=(2,1,2), seasonal_order=(0,1,0,52)
,enforce_stationarity=False, enforce_invertibility=False)
results = model.fit()
summary shows
results.summary()
Dep. Variable: Sales Quantity No. Observations: 175
Model: SARIMAX(2, 1, 2)x(0, 1, 0, 52) Log Likelihood 16.441
Date: Mon, 21 Sep 2020 AIC -22.883
Time: 22:32:28 BIC -8.987
Sample: 0 HQIC -17.240
- 175
Covariance Type: opg
coef std err z P>|z| [0.025 0.975]
ar.L1 1.3171 0.288 4.578 0.000 0.753 1.881
ar.L2 -0.5158 0.252 -2.045 0.041 -1.010 -0.022
ma.L1 -1.5829 0.519 -3.048 0.002 -2.601 -0.565
ma.L2 0.5093 0.502 1.016 0.310 -0.474 1.492
sigma2 0.0345 0.011 3.195 0.001 0.013 0.056
Ljung-Box (Q): 30.08 Jarque-Bera (JB): 2.55
Prob(Q): 0.87 Prob(JB): 0.28
Heteroskedasticity (H): 0.54 Skew: -0.02
Prob(H) (two-sided): 0.05 Kurtosis: 3.72
However, when I try to predict I get a key error suggesting my start date is incorrect but I cant see what is wrong with it
pred = results.predict(start='2019-06-10',end='2020-08-17')[1:]
KeyError: 'The `start` argument could not be matched to a location related to the index of the data.'
I can see both of these dates are valid:
ts_log['2019-06-10']
8.95686647085414
ts_log['2020-08-17']
8.011686729127847
If, instead I run with numbers, it works fine
pred = results.predict(start=175,end=200)[1:]
Id like to use date so I can use it in my time series graph with other dates

EmmaT,
you seem to have same date for start and end.
start='2019-06-10',end='2019-06-10'
Please, double-check if this is what you want. Also check that '2019-06-10' is present in the dataset.

Why using group by makes some id disappear

i was working in a machine learning project , and while i am extracting features i found that some of consumers LCLid disappear from the data set while i was grouping by the LCLid
Dataset: SmartMeter Energy Consumption Data in London Households
here is the original data set
and here is the code that i used to extract some features
LCLid=[]
for i in range(68):
LCLid.append('MAC0000'+str(228+i))
consommation=data.groupby('LCLid')['KWH/hh'].sum()
consommation_min=data.groupby('LCLid')['KWH/hh'].min()
consommation_max=data.groupby('LCLid')['KWH/hh'].max()
consommation_mean=data.groupby('LCLid')['KWH/hh'].mean()
consommation_evening=data.groupby(['LCLid','period'])['KWH/hh'].mean()
#creation de dataframe
list_of_tuples = list(zip (LCLid, consommation, consommation_min, consommation_max, consommation_mean))
data2= pd.DataFrame(list_of_tuples, columns = ['LCLid', 'Consumption', 'Consumption_min', 'Consumption_max', 'Consumption_mean'])
as you see after the execution of the code the dataset stopped in the LCLid 282 while iin the original one the dataset containes also the LCLid from 283 to 295

Using low-carbon-london-data from SmartMeter Energy Consumption Data in London Households
The issue is LCLid does not uniformly increment by 1, from MAC000228 to MAC000295.
print(data.LCLid.unique())
array(['MAC000228', 'MAC000229', 'MAC000230', 'MAC000231', 'MAC000232',
'MAC000233', 'MAC000234', 'MAC000235', 'MAC000237', 'MAC000238',
'MAC000239', 'MAC000240', 'MAC000241', 'MAC000242', 'MAC000243',
'MAC000244', 'MAC000245', 'MAC000246', 'MAC000248', 'MAC000249',
'MAC000250', 'MAC000251', 'MAC000252', 'MAC000253', 'MAC000254',
'MAC000255', 'MAC000256', 'MAC000258', 'MAC000260', 'MAC000262',
'MAC000263', 'MAC000264', 'MAC000267', 'MAC000268', 'MAC000269',
'MAC000270', 'MAC000271', 'MAC000272', 'MAC000273', 'MAC000274',
'MAC000275', 'MAC000276', 'MAC000277', 'MAC000279', 'MAC000280',
'MAC000281', 'MAC000282', 'MAC000283', 'MAC000284', 'MAC000285',
'MAC000287', 'MAC000289', 'MAC000291', 'MAC000294', 'MAC000295'],
dtype=object)
print(len(data.LCLid.unique()))
>>> 55
To resolve the issue
import pandas as pd
import numpy as np
df = pd.read_csv('Power-Networks-LCL-June2015(withAcornGps)v2.csv')
# determine the rows needed for the MAC000228 - MAC000295
df[df.LCLid == 'MAC000228'].iloc[0, :] # first row of 228
df[df.LCLid == 'MAC000295'].iloc[-1, :] # last row of 295
# create a dataframe with the desired data
data = df[['LCLid', 'DateTime', 'KWH/hh (per half hour) ']].iloc[6989700:9032044, :].copy()
# fix the data
data.DateTime = pd.to_datetime(data.DateTime)
data.rename(columns={'KWH/hh (per half hour) ': 'KWH/hh'}, inplace=True)
data['KWH/hh'] = data['KWH/hh'].str.replace('Null', 'NaN')
data['KWH/hh'].fillna(np.nan, inplace=True)
data['KWH/hh'] = data['KWH/hh'].astype('float')
data.reset_index(drop=True, inplace=True)
# aggregate your functions
agg_data = data.groupby('LCLid')['KWH/hh'].agg(['sum', 'min', 'max', 'mean']).reset_index()
print(agg_data)
agg_data
LCLid sum min max mean
0 MAC000228 5761.288000 0.021 1.616 0.146356
1 MAC000229 6584.866999 0.008 3.294 0.167456
2 MAC000230 8911.154000 0.029 2.750 0.226384
3 MAC000231 3174.314000 0.000 1.437 0.080663
4 MAC000232 2083.042000 0.005 0.736 0.052946
5 MAC000233 2241.591000 0.000 3.137 0.056993
6 MAC000234 9700.328001 0.029 2.793 0.246646
7 MAC000235 8473.999003 0.011 3.632 0.223194
8 MAC000237 22263.294998 0.036 4.450 0.598299
9 MAC000238 7814.889998 0.016 2.835 0.198781
10 MAC000239 6113.029000 0.015 1.346 0.155481
11 MAC000240 7280.662000 0.000 3.146 0.222399
12 MAC000241 4181.169999 0.024 1.733 0.194963
13 MAC000242 1654.336000 0.000 1.481 0.042088
14 MAC000243 11057.366999 0.009 3.588 0.281989
15 MAC000244 5894.271000 0.005 1.884 0.149939
16 MAC000245 22788.699005 0.037 4.743 0.580087
17 MAC000246 13787.060005 0.014 3.516 0.351075
18 MAC000248 10192.239001 0.000 4.351 0.259536
19 MAC000249 24401.468995 0.148 5.242 0.893042
20 MAC000250 5850.003000 0.000 2.185 0.148999
21 MAC000251 8400.234000 0.035 3.505 0.213931
22 MAC000252 21748.489004 0.135 4.171 0.554978
23 MAC000253 9739.408999 0.009 1.714 0.248201
24 MAC000254 9351.614001 0.009 2.484 0.238209
25 MAC000255 14142.974002 0.097 3.305 0.360220
26 MAC000256 20398.665001 0.049 3.019 0.520680
27 MAC000258 6646.485998 0.017 2.319 0.169666
28 MAC000260 5952.563001 0.006 2.192 0.151952
29 MAC000262 13909.603999 0.000 2.878 0.355181
30 MAC000263 3753.997000 0.015 1.060 0.095863
31 MAC000264 7022.967000 0.020 0.910 0.179432
32 MAC000267 8797.094000 0.029 2.198 0.224898
33 MAC000268 3734.252001 0.000 1.599 0.095359
34 MAC000269 2395.232000 0.000 1.029 0.061167
35 MAC000270 15569.711002 0.131 2.249 0.397501
36 MAC000271 7244.860000 0.028 1.794 0.184974
37 MAC000272 8703.658998 0.034 3.295 0.222446
38 MAC000273 3622.199002 0.005 5.832 0.092587
39 MAC000274 28724.718997 0.032 3.927 0.734422
40 MAC000275 5564.004999 0.012 1.840 0.161290
41 MAC000276 11060.774001 0.000 1.709 0.315724
42 MAC000277 8446.528999 0.027 1.938 0.241075
43 MAC000279 3444.160999 0.016 1.846 0.098354
44 MAC000280 12595.780001 0.125 1.988 0.360436
45 MAC000281 6282.568000 0.024 1.433 0.179538
46 MAC000282 4457.989001 0.030 1.830 0.127444
47 MAC000283 5024.917000 0.011 2.671 0.143627
48 MAC000284 1293.503000 0.000 0.752 0.047975
49 MAC000285 2399.018000 0.006 0.931 0.068567
50 MAC000287 1407.290000 0.000 2.372 0.045253
51 MAC000289 4767.490999 0.000 2.287 0.136436
52 MAC000291 13456.678999 0.072 3.354 0.385060
53 MAC000294 9477.966000 0.053 2.438 0.271264
54 MAC000295 7750.128000 0.010 1.839 0.221774

Obtaining the last value that equals or most near in the column dataframe

i have an issue in my code, i'm making points of cuts.
First, this is my Dataframe Column:
In [23]: df['bad_%']
0 0.025
1 0.007
2 0.006
3 0.006
4 0.006
5 0.006
6 0.007
7 0.007
8 0.007
9 0.006
10 0.006
11 0.009
12 0.009
13 0.009
14 0.008
15 0.008
16 0.008
17 0.012
18 0.012
19 0.05
20 0.05
21 0.05
22 0.05
23 0.05
24 0.05
25 0.05
26 0.05
27 0.062
28 0.062
29 0.061
5143 0.166
5144 0.166
5145 0.166
5146 0.167
5147 0.167
5148 0.167
5149 0.167
5150 0.167
5151 0.05
5152 0.167
5153 0.167
5154 0.167
5155 0.167
5156 0.051
5157 0.052
5158 0.161
5159 0.149
5160 0.168
5161 0.168
5162 0.168
5163 0.168
5164 0.168
5165 0.168
5166 0.168
5167 0.168
5168 0.049
5169 0.168
5170 0.168
5171 0.168
5172 0.168
Name: bad%, Length: 5173, dtype: float64
I used this code to detected the value equals or most near to 0.05 (VALUE THAT INTRODUCED on the CONSOLE)
error = 100 #Margin of error
valuesA = [] #array to save data
pointCut=0 #identify cut point
for index, row in df.iterrows():
if(abs(row['bad%'] - a) <= error):
valuesA = row
error = abs(row['bad%'] - a)
#Variable "a" introduced by console, in this case is "0.05"
pointCut = index
This code return the value "0.05" in the index 5151, in first instance looks good, because the "0.05" in the index "5151" is the last "0.05".
Out [27]:
5151 0.05
But my objetive is obtain THE LAST VALUE IN THE COLUMN equal or most near to "0.05", in this case this value correspond to "0.049" in the index "5168", i need obtain this value.
Exists an algorithm that permit this? Any solution or recomendation?
Thanks in advance.

Solutions if exist at leas one value:
Use [::-1] for swap values from back and get idxmax for last matched index value:
a = 0.05
s = df['bad%']
b = s[[(s[::-1] <= a).idxmax()]]
print (b)
5168 0.049
Or:
b = s[(s <= a)].iloc[[-1]]
print (b)
5168 0.049
Name: bad%, dtype: float64
Solution working also if value not exist - then empty Series yields:
a = 0.05
s = df['bad%']
m1 = (s <= a)
m2 = m1[::-1].cumsum().eq(1)
b = s[m1 & m2]
print (b)
5168 0.049
Name: bad%, dtype: float64
Sample data:
df = pd.DataFrame({'bad%': {5146: 0.16699999999999998, 5147: 0.16699999999999998, 5148: 0.16699999999999998, 5149: 0.049, 5150: 0.16699999999999998, 5151: 0.05, 5152: 0.16699999999999998, 5167: 0.168, 5168: 0.049, 5169: 0.168}})

Python - writing multiple lists to file

I'm trying to write my output to a new file. The required output is 4 rows with n number of columns. It works fine when printing in the terminal, however as soon as I try to write the output to a file, it all prints on one line.
This is the code that I have which works until I print to file. I have no idea why the output isn't the same. Can anyone explain why it is different please and how can I go about correcting this? (sorry i'm a newbie so simple terms would be helpful!) Any help is appreciated thanks!
with open("file.txt", "w") as p:
pwm = f.readlines()
lis=[x.split() for x in pwm]
for x in zip(*lis):
pwm = "\t".join(x)
print str(pwm) # this prints in required format
p.write(str(pwm)) # this prints all on one line
Required output:
0.224 0.128 0.536 0.009 0.007 0.085 0.013 0.097 0.058
0.339 0.152 0.136 0.002 0.002 0.009 0.876 0.031 0.829
0.250 0.421 0.299 0.004 0.065 0.845 0.027 0.834 0.007
0.186 0.299 0.029 0.985 0.926 0.061 0.084 0.038 0.106
File output:
0.224 0.128 0.536 0.009 0.007 0.085 0.013 0.097 0.058 0.339 0.152 0.136 0.002 0.002 0.009 0.876 0.031 0.829 etc...

Use the following:
p.writelines(str(pwm))

The write method of a file object just adds text to the end of the file. If you want to have each row on its own line, you need to add a newline to the end of the string:
p.write(str(pwm) + "\n")
You do not have to do this with print because it does this for you implicitly.

You can either write a newline explicitly or use print with a file redirection.
p.write(str(pwm) + "\n")
print >> p, pwm

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to groupby and aggregate an operation to multiple columns? - python

Related

Python stack loses data

python Statsmodels SARIMAX KeyError: 'The `start` argument could not be matched to a location related to the index of the data.'

Why using group by makes some id disappear

Obtaining the last value that equals or most near in the column dataframe

Python - writing multiple lists to file

Categories

Resources