Sum results of pandas groupby - python

I have the next DataFrame:
stock color 15M_c 60M_c mediodia 1D_c 1D-15M_c
0 PYPL rojo 0.32 0.32 0.47 -0.18 -0.50
1 MSFT verde -0.11 0.38 0.79 -0.48 -0.35
2 PYPL verde -1.44 -1.23 0.28 -1.13 0.30
3 V rojo -0.07 0.23 0.70 0.80 0.91
4 JD rojo 0.87 1.11 1.19 0.43 -0.42
5 FB verde 0.20 0.05 0.22 -0.66 -0.82
.. ... ... ... ... ... ... ...
282 GM verde 0.14 0.06 0.47 0.51 0.37
283 FB verde 0.09 -0.08 0.12 0.22 0.12
284 MSFT rojo -0.16 -0.23 -0.06 -0.01 0.14
285 PYPL verde -0.14 -0.41 -0.07 0.20 0.30
286 V verde -0.02 0.00 0.28 0.42 0.45
And first I grouped by 'stock' and 'color', I do it with the next code:
marcos = ['15M_c','60M_c','mediodia','1D_c','1D-15M_c']
grouped = data.groupby(['stock','color'])
res = grouped[marcos].agg([np.size, np.sum])
So in 'res' I get the next DataFrame:
15M_c 60M_c mediodia 1D_c 1D-15M_c
size sum size sum size sum size sum size sum
stock color
AAPL rojo 10.0 -0.46 10.0 -0.20 10.0 -0.33 10.0 -0.25 10.0 0.18
verde 8.0 1.39 8.0 2.48 8.0 1.06 8.0 -1.57 8.0 -2.88
... ... .. .. .. .. .. .. .. .. .. ..
FB verde 15.0 0.92 15.0 -0.64 15.0 -0.11 15.0 -0.89 15.0 -1.80
MSFT rojo 11.0 0.47 11.0 2.07 11.0 2.71 11.0 4.37 11.0 3.83
verde 18.0 1.46 18.0 2.12 18.0 1.26 18.0 0.97 18.0 -0.54
PYPL rojo 9.0 1.06 9.0 2.68 9.0 5.02 9.0 3.98 9.0 2.84
verde 17.0 -1.57 17.0 -2.40 17.0 0.29 17.0 -0.48 17.0 1.08
V rojo 1.0 -0.22 1.0 -0.28 1.0 -0.36 1.0 -0.29 1.0 -0.06
verde 9.0 -1.01 9.0 -1.42 9.0 -0.86 9.0 0.58 9.0 1.61
And then I want to sum 'verde' row with 'rojo' row for each 'stock', but multipliying rojo sum by -1. The final result I wanted is:
15M_c 60M_c mediodia 1D_c 1D-15M_c
size sum sum sum sum sum
stock
AAPL 18.0 1.85 2.68 1.39 -1.32 -3.06
... .. .. .. .. .. ..
FB 15.0 0.92 -0.64 -0.11 -0.89 -1.80
MSFT 29.0 0.99 0.05 -1.45 -3.40 -4.37
PYPL 26.0 -2.63 -5.08 .. .. ..
V 10.0 -0.79 -1.14 .. .. ..
Thank you very much in advance for your help.

pandas.IndexSlice
Use loc and IndexSlice to change the sign of appropriate values. Then use sum(level=0)
islc = pd.IndexSlice
res.loc[islc[:, 'rojo'], islc[:, 'sum']] *= -1
res.sum(level=0)

Convert the columns in marcos based on the value of color
import numpy as np
for m in marcos:
data[m] = np.where(data['color'] == 'rojo', -data[m], data[m])
Then you can skip grouping by color altogether:
grouped = foo.groupby(['stock'])
res = grouped[marcos].agg([np.size, np.sum])

Related

Tensorflow dataframe lambda modify the type of object

So i got a dataframe that has stats of fighter that store data as float
fighers_clean.dtypes
sig_str_abs_pM float64
sig_str_def_pct float64
sig_str_land_pM float64
sig_str_land_pct float64
sub_avg float64
td_avg float64
td_def_pct float64
td_land_pct float64
win% float64
fighers_clean
sig_str_abs_pM sig_str_def_pct sig_str_land_pM sig_str_land_pct sub_avg td_avg td_def_pct td_land_pct win%
name
Hunter Azure 1.57 0.56 4.00 0.50 2.5 2.00 0.75 0.33 1.000000
Jessica Eye 3.36 0.60 3.51 0.36 0.7 0.51 0.56 0.50 0.666667
Rolando Dy 4.47 0.52 3.04 0.37 0.0 0.30 0.68 0.20 0.529412
Gleidson Cutis 8.28 0.59 2.99 0.52 0.0 0.00 0.00 0.00 0.700000
Damien Brown 4.86 0.50 3.66 0.38 0.7 0.68 0.53 0.27 0.586207
... ... ... ... ... ... ... ... ... ...
Xiaonan Yan 4.22 0.64 6.85 0.40 0.0 0.25 0.66 0.50 0.916667
Alexander Yakovlev 2.44 0.58 1.79 0.47 0.2 1.56 0.72 0.33 0.705882
Rani Yahya 1.61 0.52 1.59 0.36 2.1 2.92 0.22 0.32 0.722222
Eddie Yagin 5.77 0.42 3.13 0.30 1.0 0.00 0.62 0.00 0.727273
Jamie Yager 2.55 0.63 3.08 0.39 0.0 0.00 0.66 0.00 0.714286
With this line im trying to add data to another dataframe that have stats about matches
for col in statistics:
matches_clean[col] = matches_clean.apply(
lambda row: fighers_clean.loc[row["fighter_1"], col] - fighers_clean.loc[row["fighter_2"], col], axis=1)
matches_clean.dtypes
fighter_1 object
fighter_2 object
result int64
sig_str_abs_pM object
sig_str_def_pct object
sig_str_land_pM object
sig_str_land_pct object
sub_avg object
td_avg object
td_def_pct object
td_land_pct object
win% object
dtype: object
fighter_1 fighter_2 result sig_str_abs_pM sig_str_def_pct sig_str_land_pM sig_str_land_pct sub_avg td_avg td_def_pct td_land_pct win%
fight_id
d7cbe2f23d75afd1 Julio Arce Hakeem Dawodu 0 0.56 0.03 -1.03 -0.1 0.6 0.58 0.08 0.3 -0.046154
f0418c2c989a5cde Grigorii Popov Davey Grant 0 2.52 -0.1 0.5 -0.15 0.0 -2.64 -0.19 -0.47 0.079167
fc16ccf0994c6e50 Jack Shore Nohelin Hernandez 1 -2.16 0.31 2.26 0.16 1.2 3.96 -0.37 0.07 0.285714
18e1b0df8da7010e Vanessa Melo Tracy Cortez 0 4.17 -0.05 -1.23 -0.26 0.0 -3.0 -0.23 -0.37 -0.286765
57ff0eb2351979c4 Khalid Taha Bruno Silva 1 -0.63 -0.11 1.1 0.1 0.5 -2.31 -0.37 -0.18 0.1875
This cause later an error ValueError: setting an array element with a sequence. here at line X_train_scaled = scaler.fit_transform(X_train)
# get ready for deep learning
X, y = matches_clean.iloc[:, 1:], matches_clean.iloc[:, 0]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
# normalization
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
X_train.dtypes
Im pretty sure Its because the float are converted to object during the lambda function
Do you know why the lambda changes the returned values and how to avoid that ?

How to list the specific countries in a df which have NaN values?

I have created the df_nan below which shows the sum of NaN values from the main df, which, as seen, shows how many are in each specific column.
However, I want to create a new df, which has a column/index of countries, then another with the number of NaN values for the given country.
Country Number of NaN Values
Aruba 4
Finland 3
I feel like I have to use groupby, to create something along the lines of this below, but .isna is not an attribute of the groupby function. Any help would be great, thanks!
df_nan2= df_nan.groupby(['Country']).isna().sum()
Current code
import pandas as pd
import seaborn as sns
import numpy as np
from scipy.stats import spearmanr
# given dataframe df
df = pd.read_csv('countries.csv')
df.drop(columns= ['Population (millions)', 'HDI', 'GDP per Capita','Fish Footprint','Fishing Water',
'Urban Land','Earths Required', 'Countries Required', 'Data Quality'], axis=1, inplace = True)
df_nan= df.isna().sum()
Head of main df
0 Afghanistan Middle East/Central Asia 0.30 0.20 0.08 0.18 0.79 0.24 0.20 0.02 0.50 -0.30
1 Albania Northern/Eastern Europe 0.78 0.22 0.25 0.87 2.21 0.55 0.21 0.29 1.18 -1.03
2 Algeria Africa 0.60 0.16 0.17 1.14 2.12 0.24 0.27 0.03 0.59 -1.53
3 Angola Africa 0.33 0.15 0.12 0.20 0.93 0.20 1.42 0.64 2.55 1.61
4 Antigua and Barbuda Latin America NaN NaN NaN NaN 5.38 NaN NaN NaN 0.94 -4.44
5 Argentina Latin America 0.78 0.79 0.29 1.08 3.14 2.64 1.86 0.66 6.92 3.78
6 Armenia Middle East/Central Asia 0.74 0.18 0.34 0.89 2.23 0.44 0.26 0.10 0.89 -1.35
7 Aruba Latin America NaN NaN NaN NaN 11.88 NaN NaN NaN 0.57 -11.31
8 Australia Asia-Pacific 2.68 0.63 0.89 4.85 9.31 5.42 5.81 2.01 16.57 7.26
9 Austria European Union 0.82 0.27 0.63 4.14 6.06 0.71 0.16 2.04 3.07 -3.00
10 Azerbaijan Middle East/Central Asia 0.66 0.22 0.11 1.25 2.31 0.46 0.20 0.11 0.85 -1.46
11 Bahamas Latin America 0.97 1.05 0.19 4.46 6.84 0.05 0.00 1.18 9.55 2.71
12 Bahrain Middle East/Central Asia 0.52 0.45 0.16 6.19 7.49 0.01 0.00 0.00 0.58 -6.91
13 Bangladesh Asia-Pacific 0.29 0.00 0.08 0.26 0.72 0.25 0.00 0.00 0.38 -0.35
14 Barbados Latin America 0.56 0.24 0.14 3.28 4.48 0.08 0.00 0.02 0.19 -4.29
15 Belarus Northern/Eastern Europe 1.32 0.12 0.91 2.57 5.09 1.52 0.30 1.71 3.64 -1.45
16 Belgium European Union 1.15 0.48 0.99 4.43 7.44 0.56 0.03 0.28 1.19 -6.25
17 Benin Africa 0.49 0.04 0.26 0.51 1.41 0.44 0.04 0.34 0.88 -0.53
18 Bermuda North America NaN NaN NaN NaN 5.77 NaN NaN NaN 0.13 -5.64
19 Bhutan Asia-Pacific 0.50 0.42 3.03 0.63 4.84 0.28 0.34 4.38 5.27 0.43
Nan head
Country 0
Region 0
Cropland Footprint 15
Grazing Footprint 15
Forest Footprint 15
Carbon Footprint 15
Total Ecological Footprint 0
Cropland 15
Grazing Land 15
Forest Land 15
Total Biocapacity 0
Biocapacity Deficit or Reserve 0
dtype: int64
Suppose, you want to get Null count for each Country from "Cropland Footprint" column, then you can use the following code -
Unique_Country = df['Country'].unique()
Col1 = 'Cropland Footprint'
NullCount = []
for i in Unique_Country:
s = df[df['Country']==i][Col1].isnull().sum()
NullCount.append(s)
df2 = pd.DataFrame({'Country': Unique_Country,
'Number of NaN Values': NullCount})
df2 = df2[df2['Number of NaN Values']!=0]
df2
Output -
Country Number of NaN Values
Antigua and Barbuda 1
Aruba 1
Bermuda 1
If you want to get Null Count from another Column then just change the Value of Col1 variable.

Reading in a .txt file to get time series from rows of years and columns of monthly values

How could I read in a txt file like the one from
https://psl.noaa.gov/data/correlation/pna.data (example below)
1960 -0.16 -0.22 -0.69 -0.07 0.99 1.20 1.11 1.85 -0.01 0.48 -0.52 1.15
1961 1.16 0.17 0.28 -1.14 -0.25 1.84 -0.52 0.47 1.10 -1.94 -0.40 -1.54
1962 -0.74 -0.54 -0.71 -1.50 -1.11 -0.97 -0.36 0.57 -0.83 1.33 0.53 -0.38
1963 0.09 0.79 -2.04 -0.79 -0.95 0.50 -1.10 -1.01 0.87 0.93 -0.31 1.46
1964 -0.44 1.36 -1.31 -1.30 -2.27 0.27 0.20 0.83 0.92 0.80 -0.78 -2.03
1965 -0.92 -1.03 -0.80 -1.07 -0.42 1.89 -1.26 0.32 0.36 1.42 -0.81 -1.56
into a pandas dataframe to plot as a time series, for example from 1960-1965 with each value column (corresponding to months) being plotted? I rarely use .txt's
Here's what you can try:
import pandas as pd
import requests
import re
aa=requests.get("https://psl.noaa.gov/data/correlation/pna.data").text
aa=aa.split("\n")[1:-4]
aa=list(map(lambda x:x[1:],aa))
aa="\n".join(aa)
aa=re.sub(" +",",",aa)
with open("test.csv","w") as f:
f.write(aa)
df=pd.read_csv("test.csv", header=None, index_col=0).rename_axis('Year')
df.columns=list(pd.date_range(start='2021-01', freq='M', periods=12).month_name())
print(df.head())
df.to_csv("test.csv")
This is going to give you, in test.csv file:
Year
January
February
March.....
up to December
1948
73
67
67
773....
1949
73
67
67
773....
1950
73
67
67
773....
....
..
..
..
.......
....
..
..
..
.......
2021
73
88
84
733....
Use pd.read_fwf as suggested by #SanskarSingh
>>> pd.read_fwf('data.txt', header=None, index_col=0).rename_axis('Year')
1 2 3 4 5 6 7 8 9 10 11 12
Year
1960 -0.16 -0.22 -0.69 -0.07 0.99 1.20 1.11 1.85 -0.01 0.48 -0.52 1.15
1961 1.16 0.17 0.28 -1.14 -0.25 1.84 -0.52 0.47 1.10 -1.94 -0.40 -1.54
1962 -0.74 -0.54 -0.71 -1.50 -1.11 -0.97 -0.36 0.57 -0.83 1.33 0.53 -0.38
1963 0.09 0.79 -2.04 -0.79 -0.95 0.50 -1.10 -1.01 0.87 0.93 -0.31 1.46
1964 -0.44 1.36 -1.31 -1.30 -2.27 0.27 0.20 0.83 0.92 0.80 -0.78 -2.03
1965 -0.92 -1.03 -0.80 -1.07 -0.42 1.89 -1.26 0.32 0.36 1.42 -0.81 -1.56

pyGAM `y data is not in domain of logit link function`

I'm trying to find to what degree the chemical properties of a wine dataset influence the quality property of the dataset.
The error:
ValueError: y data is not in domain of logit link function. Expected
domain: [0.0, 1.0], but found [3.0, 9.0]
The code:
import pandas as pd
from pygam import LogisticGAM
white_data = pd.read_csv("winequality-white.csv",sep=';');
X = white_data[[
"fixed acidity","volatile acidity","citric acid","residual sugar","chlorides","free sulfur dioxide",
"total sulfur dioxide","density","pH","sulphates","alcohol"
]]
print(X.describe)
y = pd.Series(white_data["quality"]);
print(white_quality.describe)
white_gam = LogisticGAM().fit(X, y)
The output of said code:
<bound method NDFrame.describe of fixed acidity volatile acidity citric acid residual sugar chlorides \
0 7.0 0.27 0.36 20.7 0.045
1 6.3 0.30 0.34 1.6 0.049
2 8.1 0.28 0.40 6.9 0.050
3 7.2 0.23 0.32 8.5 0.058
4 7.2 0.23 0.32 8.5 0.058
... ... ... ... ... ...
4893 6.2 0.21 0.29 1.6 0.039
4894 6.6 0.32 0.36 8.0 0.047
4895 6.5 0.24 0.19 1.2 0.041
4896 5.5 0.29 0.30 1.1 0.022
4897 6.0 0.21 0.38 0.8 0.020
free sulfur dioxide total sulfur dioxide density pH sulphates \
0 45.0 170.0 1.00100 3.00 0.45
1 14.0 132.0 0.99400 3.30 0.49
2 30.0 97.0 0.99510 3.26 0.44
3 47.0 186.0 0.99560 3.19 0.40
4 47.0 186.0 0.99560 3.19 0.40
... ... ... ... ... ...
4893 24.0 92.0 0.99114 3.27 0.50
4894 57.0 168.0 0.99490 3.15 0.46
4895 30.0 111.0 0.99254 2.99 0.46
4896 20.0 110.0 0.98869 3.34 0.38
4897 22.0 98.0 0.98941 3.26 0.32
alcohol
0 8.8
1 9.5
2 10.1
3 9.9
4 9.9
... ...
4893 11.2
4894 9.6
4895 9.4
4896 12.8
4897 11.8
[4898 rows x 11 columns]>
<bound method NDFrame.describe of 0 6
1 6
2 6
3 6
4 6
..
4893 6
4894 5
4895 6
4896 7
4897 6
Name: quality, Length: 4898, dtype: int64>
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-71-e1c5720823a6> in <module>
16 print(white_quality.describe)
17
---> 18 white_gam = LogisticGAM().fit(X, y)
~/miniconda3/lib/python3.7/site-packages/pygam/pygam.py in fit(self, X, y, weights)
893
894 # validate data
--> 895 y = check_y(y, self.link, self.distribution, verbose=self.verbose)
896 X = check_X(X, verbose=self.verbose)
897 check_X_y(X, y)
~/miniconda3/lib/python3.7/site-packages/pygam/utils.py in check_y(y, link, dist, min_samples, verbose)
227 .format(link, get_link_domain(link, dist),
228 [float('%.2f'%np.min(y)),
--> 229 float('%.2f'%np.max(y))]))
230 return y
231
ValueError: y data is not in domain of logit link function. Expected domain: [0.0, 1.0], but found [3.0, 9.0]
The files: (I'm using Jupyter Notebook but I don't think you'd need to): https://drive.google.com/drive/folders/1RAj2Gh6WfdzpwtgbMaFVuvBVIWwoTUW5?usp=sharing
You probably want to use LinearGAM – LogisticGAM is for classification tasks.

What is the effective way to have a pivot-table having pandas dataset columns as its rows?

Let's take as an example the following dataset:
make address all 3d our over length_total y
0 0.0 0.64 0.64 0.0 0.32 0.0 278 1
1 0.21 0.28 0.5 0.0 0.14 0.28 1028 1
2 0.06 0.0 0.71 0.0 1.23 0.19 2259 1
3 0.15 0.0 0.46 0.1 0.61 0.0 1257 1
4 0.06 0.12 0.77 0.0 0.19 0.32 749 1
5 0.0 0.0 0.0 0.0 0.0 0.0 21 1
6 0.0 0.0 0.25 0.0 0.38 0.25 184 1
7 0.0 0.69 0.34 0.0 0.34 0.0 261 1
8 0.0 0.0 0.0 0.0 0.9 0.0 25 1
9 0.0 0.0 1.42 0.0 0.71 0.35 205 1
10 0.0 0.0 0.0 0.0 0.0 0.0 23 0
11 0.48 0.0 0.0 0.0 0.48 0.0 37 0
12 0.12 0.0 0.25 0.0 0.0 0.0 491 0
13 0.08 0.08 0.25 0.2 0.0 0.25 807 0
14 0.0 0.0 0.0 0.0 0.0 0.0 38 0
15 0.24 0.0 0.12 0.0 0.0 0.12 227 0
16 0.0 0.0 0.0 0.0 0.75 0.0 77 0
17 0.1 0.0 0.21 0.0 0.0 0.0 571 0
18 0.51 0.0 0.0 0.0 0.0 0.0 74 0
19 0.3 0.0 0.15 0.0 0.0 0.15 155 0
I want to get pivot-table from the previous dataset, in which the columns (make, address all, 3d, our, over, length_total) will have their mean values processed by the column y. The following table is the expected result:
y
1 0
make 0.048 0.183
address 0.173 0.008
all 0.509 0.098
3d 0.01 0.02
our 0.482 0.123
over 0.139 0.052
length_total 626.7 250
Is it possible to get the desired result through pivot_table method from pandas.data object? If so, how?
Is there a more effective way to do this?
Some people like using stack or unstack, but I prefer good ol' pd.melt to "flatten" or "unpivot" a frame:
>>> df_m = pd.melt(df, id_vars="y")
>>> df_m.pivot_table(index="variable", columns="y")
value
y 0 1
variable
3d 0.020 0.010
address 0.008 0.173
all 0.098 0.509
length_total 250.000 626.700
make 0.183 0.048
our 0.123 0.482
over 0.052 0.139
(If you want to preserve the original column order as the new row order, you can use .loc to index into this, something like df2.loc[df.columns].dropna()).
Melting does the flattening, and preserves y as a column, putting the old column names as a new column called "variable" (which can be changed if you like):
>>> pd.melt(df, id_vars="y").head()
y variable value
0 1 make 0.00
1 1 make 0.21
2 1 make 0.06
3 1 make 0.15
4 1 make 0.06
After that we can call pivot_table as we would ordinarily.

Categories

Resources