Python pandas rolling mean while retaining index and column - python

I have a pandas DataFrame of statistics for NBA games. Here's a sample of the data for away teams:
away_team away_efg away_drb away_score
date
2000-10-31 19:00:00 Los Angeles Clippers 0.522 74.4 94
2000-10-31 19:00:00 Milwaukee Bucks 0.434 63.0 93
2000-10-31 19:30:00 Minnesota Timberwolves 0.523 73.8 106
2000-10-31 19:30:00 Charlotte Hornets 0.605 77.1 106
2000-10-31 19:30:00 Seattle SuperSonics 0.429 73.1 88
There are many more numeric columns other than the away_score column, and also analogous columns for the home team.
What I would like is, for each row, replace the numeric columns (other than score) with the mean of the previous three observations, partitioned by team. I can almost get what I want by doing the following:
home_df.groupby("team").apply(lambda x: x.rolling(window=3).mean())
This returns, for example,
>>> home_avg[home_avg["team"]=="Utah Jazz"].head()
3par ast blk drb efg ftr orb
0 NaN NaN NaN NaN NaN NaN NaN
50 NaN NaN NaN NaN NaN NaN NaN
81 0.146667 71.600000 9.4 74.666667 0.512000 0.347667 25.833333
Taking this, along with
>>> home_df[home_df["team"]=="Utah Jazz"].head()
3par ast blk drb efg ftr orb stl team tov trb
0 0.118 76.7 7.1 64.7 0.535 0.365 25.6 11.5 Utah Jazz 10.8 42.9
50 0.100 63.9 9.1 80.5 0.536 0.414 27.6 2.2 Utah Jazz 20.2 58.6
81 0.222 74.2 12.0 78.8 0.465 0.264 24.3 7.3 Utah Jazz 13.9 50.0
122 0.119 81.8 11.3 75.0 0.515 0.642 25.0 12.2 Utah Jazz 21.8 52.5
135 0.129 76.7 17.8 75.9 0.650 0.400 37.9 5.7 Utah Jazz 18.8 62.7
demonstrates that it is including the current row in the calculation of the mean. I want to avoid this. More specifically, the desired output for row 81 would be all NaNs (because there haven't been three games yet), and the entry in the 3par column for row 122 would be .146667 (the average of the values in that column for rows 0, 50, and 81).
So, my question is, how can I exclude the current row in the rolling mean calculation?

You can use shift here which shifts the index for a given amount to make your rolling window use the last three values excluding the current value:
# create dummy data frame with numeric values
df = pd.DataFrame({"numeric_col": np.random.randint(0, 100, size=5)})
print(df)
numeric_col
0 66
1 60
2 74
3 41
4 83
df["mean"] = df["numeric_col"].shift(1).rolling(window=3).mean()
print(df)
numeric_col mean
0 66 NaN
1 60 NaN
2 74 NaN
3 41 66.666667
4 83 58.333333
Accordingly, change your apply function to lambda x: x.shift(1).rolling(window=3).mean() to make it work in your specific example.

Related

Calculate Positive Streak for Pandas Rows in reverse

I want to calculate a positive streak for numbers in a row in reverse fashion.
I tried using cumsum() but that's not helping me.
The DataFrame looks as follows with the expected output:
country score_1 score_2 score_3 score_4 score_5 expected_streak
U.S. 12.4 13.6 19.9 22 28.7 4
Africa 11.1 15.5 9.2 7 34.2 1
India 13.9 6.6 16.3 21.8 30.9 3
Australia 25.4 36.9 18.9 29 NaN 0
Malaysia 12.8 NaN -6.2 28.6 31.7 2
Argentina 40.7 NaN 16.3 20.1 39 2
Canada 56.4 NaN NaN -2 -1 1
So, basically score_5 should be greater than score_4 and so on... to get a count of streak. If a number is greater than score_5 the streak count ends.
One way using diff with cummin:
df2 = df.filter(like="score_").loc[:, ::-1]
df["expected"] = df2.diff(-1, axis=1).gt(0).cummin(1).sum(1)
print(df)
Output:
country score_1 score_2 score_3 score_4 score_5 expected
0 U.S. 12.4 13.6 19.9 22.0 28.7 4
1 Africa 11.1 15.5 9.2 7.0 34.2 1
2 India 13.9 6.6 16.3 21.8 30.9 3
3 Australia 25.4 36.9 18.9 29.0 NaN 0
4 Malaysia 12.8 NaN -6.2 28.6 31.7 2
5 Argentina 40.7 NaN 16.3 20.1 39.0 2
6 Canada 56.4 NaN NaN -2.0 -1.0 1

Dropping specific rows in pandas using the index number

I'm having some issues to take out the index number of some rows on my script.
I wanna only take the Country names for my table, I already added "dropna()" to take out the empty rows, but I failed to take the index of the rows on the Col1 that starts with 'Total'.
The content on the panda file is like this:
The Country is the Col1, 1975 the Col2 and 1966 the Col3, more (also the index number)
In[1]:
Index Country 1965 1966
0 NaN NaN NaN
1 Canada 115.9 123.0
2 Mexico 25.0 26.4
3 US 1249.6 1320.0
4 Total... 1390.5 1469.5
5 NaN NaN NaN
6 Argentina 26.9 27.8
7 Brazil 22.5 24.5
8 Chile 6.2 6.6
9 Colombia 7.5 8.2
10 Ecuador 0.8 0.8
11 Peru 4.8 5.8
12 Trinidad... 3.0 3.2
13 Venezuela 16.4 16.6
14 Central... 4.3 4.4
15 Other... 15.0 15.7
16 Other... 2.6 3.0
17 Total... 110.0 116.6
18 NaN NaN NaN
19 Austria 15.8 16.6
My plan is to take the index row number of these 'Total' rows using pandas and drop these lines, with this part of the data, will the rows 4 and 17. (cause I also wrote the dropna() to take off the empty rows.
Because when I take out the lines, the index number stay the same, but I stucked on the part where I can take the index number using the rows where starts with 'Total' on the Country column.
So I'd like to record this index numbers on a list to use as df.drop(index=numbers), being numbers the list on the index rows of the 'Total' cells
So the output will be:
In[2]: df.drop(index=numbers)
Index Country 1965 1966
1 Canada 115.9 123.0
2 Mexico 25.0 26.4
3 US 1249.6 1320.0
6 Argentina 26.9 27.8
7 Brazil 22.5 24.5
8 Chile 6.2 6.6
9 Colombia 7.5 8.2
10 Ecuador 0.8 0.8
11 Peru 4.8 5.8
12 Trinidad... 3.0 3.2
13 Venezuela 16.4 16.6
14 Central... 4.3 4.4
15 Other... 15.0 15.7
16 Other... 2.6 3.0
19 Austria 15.8 16.6
I would use boolean indexing:
df = df[df['Country'].ne('Total...')].dropna(how='all')
or
df = df[~df['Country'].str.startswith('Total')].dropna(how='all')

parsing data in excel file to create data frame

I am analyzing data from excel file.
I want to create data frame by parsing data from excel using python.
Data in my excel file looks like as follow:
The first row highlighted in yellow contains match, which will be one of the columns in data frame that I wanted to create.
In fact, second row and 4th row are the name of the columns that I wanted to created in a new data frame.
3rd row and fifth row are the value of each column.
The sample here is only for one match.
I have multiple matches in the excel file.
I want to create a data frame that contain the column Match and all name in blue colors in the file.
I have attached the sample file that contains multiple matches.
Download the file here.
My expected data frame is
Match 1-0 2-0 2-1 3-0 3-1 3-2 4-0 4-1 4-2 4-3.......
MOL Vivi -vs- Chelsea 14 42 20 170 85 85 225 225 225 .....
Can anyone advise me how to parse the excel data and convert to data frame?
Thanks,
Zep
Use:
import pandas as pd
from datetime import datetime
df = pd.read_excel('test_match.xlsx')
#mask for check a-z in column HOME -vs- AWAY
m1 = df['HOME -vs- AWAY'].str.contains('[a-z]', na=False)
#create index by matches
df.index = df['HOME -vs- AWAY'].where(m1).ffill()
df.index.name = 'Match'
#remove same index and HOME -vs- AWAY column rows
df = df[df.index != df['HOME -vs- AWAY']].copy()
#test if datetime or string
m2 = df['HOME -vs- AWAY'].apply(lambda x: isinstance(x, datetime))
m3 = df['HOME -vs- AWAY'].apply(lambda x: isinstance(x, str))
#seelct next rows and set new columns names
df1 = df[m2.shift().fillna(False)]
df1.columns = df[m2].iloc[0]
#also remove only NaNs columns
df2 = df[m3.shift().fillna(False)].dropna(axis=1, how='all')
df2.columns = df[m3].iloc[0].dropna()
#join together
df = pd.concat([df1, df2], axis=1).astype(float).reset_index().rename_axis(None, axis=1)
print (df.head())
Match 2000-01-01 00:00:00 2000-02-01 00:00:00 \
0 MOL Vidi -vs- Chelsea 14.00 42.00
1 Lazio -vs- Eintracht Frankfurt 8.57 11.55
2 Sevilla -vs- FC Krasnodar 7.87 6.63
3 Villarreal -vs- Spartak Moscow 7.43 7.03
4 Rennes -vs- FC Astana 4.95 6.38
2018-02-01 00:00:00 2000-03-01 00:00:00 2018-03-01 00:00:00 \
0 20.00 170.00 85.00
1 7.87 23.80 15.55
2 7.87 8.72 8.65
3 7.07 10.00 9.43
4 7.33 12.00 13.20
2018-03-02 00:00:00 2000-04-01 00:00:00 2018-04-01 00:00:00 \
0 85.0 225.00 225.00
1 21.3 64.30 42.00
2 25.9 14.80 14.65
3 23.9 19.35 17.65
4 38.1 31.50 34.10
2018-04-02 00:00:00 ... 0-1 0-2 2018-01-02 00:00:00 \
0 225.0 ... 5.6 6.80 7.00
1 55.7 ... 11.0 19.05 10.45
2 38.1 ... 28.0 79.60 29.20
3 38.4 ... 20.9 58.50 22.70
4 81.4 ... 12.9 42.80 22.70
0-3 2018-01-03 00:00:00 2018-02-03 00:00:00 0-4 \
0 12.5 12.0 32.0 30.0
1 48.4 27.4 29.8 167.3
2 223.0 110.0 85.4 227.5
3 203.5 87.6 73.4 225.5
4 201.7 97.6 103.6 225.5
2018-01-04 00:00:00 2018-02-04 00:00:00 2018-03-04 00:00:00
0 29.0 60.0 220.0
1 91.8 102.5 168.3
2 227.5 227.5 227.5
3 225.5 225.5 225.5
4 225.5 225.5 225.5
[5 rows x 27 columns]

Why is pandas showing "?" instead of NaN

I'm learning pandas and when i display the data frame, it is displaying ? instead of NaN.
Why is it so?
CODE :
import pandas as pd
url = "https://archive.ics.uci.edu/ml/machine-learning-
databases/autos/imports-85.data"
df = pd.read_csv(url, header=None)
print(df.head())
headers = ["symboling", "normalized-losses", "make", "fuel-type",
"aspiration",
"num-of-doors", "body-style", "drive-wheels", "engine-location",
"wheel-base", "length", "width", "height", "curb-weight",
"engine-type", "num-of-cylinders", "engine-size", "fuel-system",
"bore", "stroke", "compression-ratio", "hoursepower", "peak-rpm",
"city-mpg", "highway-mpg", "price"]
df.columns=headers
print(df.head(30))
In data are missing values represented by ?, so for converting them is possible use parameter na_values, also names parameter in read_csv add columns by list, so assign is not necessary:
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/autos/imports-85.data"
headers = ["symboling", "normalized-losses", "make", "fuel-type", "aspiration",
"num-of-doors", "body-style", "drive-wheels", "engine-location",
"wheel-base", "length", "width", "height", "curb-weight",
"engine-type", "num-of-cylinders", "engine-size", "fuel-system",
"bore", "stroke", "compression-ratio", "hoursepower", "peak-rpm",
"city-mpg", "highway-mpg", "price"]
df = pd.read_csv(url, header=None, names=headers, na_values='?')
print(df.head(10))
symboling normalized-losses make fuel-type aspiration \
0 3 NaN alfa-romero gas std
1 3 NaN alfa-romero gas std
2 1 NaN alfa-romero gas std
3 2 164.0 audi gas std
4 2 164.0 audi gas std
5 2 NaN audi gas std
6 1 158.0 audi gas std
7 1 NaN audi gas std
8 1 158.0 audi gas turbo
9 0 NaN audi gas turbo
num-of-doors body-style drive-wheels engine-location wheel-base ... \
0 two convertible rwd front 88.6 ...
1 two convertible rwd front 88.6 ...
2 two hatchback rwd front 94.5 ...
3 four sedan fwd front 99.8 ...
4 four sedan 4wd front 99.4 ...
5 two sedan fwd front 99.8 ...
6 four sedan fwd front 105.8 ...
7 four wagon fwd front 105.8 ...
8 four sedan fwd front 105.8 ...
9 two hatchback 4wd front 99.5 ...
engine-size fuel-system bore stroke compression-ratio hoursepower \
0 130 mpfi 3.47 2.68 9.0 111.0
1 130 mpfi 3.47 2.68 9.0 111.0
2 152 mpfi 2.68 3.47 9.0 154.0
3 109 mpfi 3.19 3.40 10.0 102.0
4 136 mpfi 3.19 3.40 8.0 115.0
5 136 mpfi 3.19 3.40 8.5 110.0
6 136 mpfi 3.19 3.40 8.5 110.0
7 136 mpfi 3.19 3.40 8.5 110.0
8 131 mpfi 3.13 3.40 8.3 140.0
9 131 mpfi 3.13 3.40 7.0 160.0
peak-rpm city-mpg highway-mpg price
0 5000.0 21 27 13495.0
1 5000.0 21 27 16500.0
2 5000.0 19 26 16500.0
3 5500.0 24 30 13950.0
4 5500.0 18 22 17450.0
5 5500.0 19 25 15250.0
6 5500.0 19 25 17710.0
7 5500.0 19 25 18920.0
8 5500.0 17 20 23875.0
9 5500.0 16 22 NaN
[10 rows x 26 columns]
This information is here:
https://archive.ics.uci.edu/ml/machine-learning-databases/autos/imports-85.names:
Missing Attribute Values: (denoted by "?")
Another solution: if you want to replace ? by NaN after reading the data, you can do this:
df_new = df.replace({'?':np.nan})

Dropping multiple columns in pandas at once

I have a data set consisting of 135 columns. I am trying to drop the columns which have empty data of more than 60%. There are some 40 columns approx in it. So, I wrote a function to drop this empty columns. But I am getting "Not contained in axis" error. Could some one help me solving this?. Or any other way to drop this 40 columns at once?
My function:
list_drop = df.isnull().sum()/(len(df))
def empty(df):
if list_drop > 0.5:
df.drop(list_drop,axis=1,inplace=True)
return df
Other method i tried:
df.drop(df.count()/len(df)<0.5,axis=1,inplace=True)
You could use isnull + sum and then use the mask to filter df.columns.
m = df.isnull().sum(0) / len(df) < 0.6
df = df[df.columns[m]]
Demo
df
A B C
0 29.0 NaN 26.6
1 NaN NaN 23.3
2 23.0 94.0 28.1
3 35.0 168.0 43.1
4 NaN NaN 25.6
5 32.0 88.0 31.0
6 NaN NaN 35.3
7 45.0 543.0 30.5
8 NaN NaN NaN
9 NaN NaN 37.6
10 NaN NaN 38.0
11 NaN NaN 27.1
12 23.0 846.0 30.1
13 19.0 175.0 25.8
14 NaN NaN 30.0
15 47.0 230.0 45.8
16 NaN NaN 29.6
17 38.0 83.0 43.3
18 30.0 96.0 34.6
m = df.isnull().sum(0) / len(df) < 0.3 # 0.3 as an example
m
A False
B False
C True
dtype: bool
df[df.columns[m]]
C
0 26.6
1 23.3
2 28.1
3 43.1
4 25.6
5 31.0
6 35.3
7 30.5
8 NaN
9 37.6
10 38.0
11 27.1
12 30.1
13 25.8
14 30.0
15 45.8
16 29.6
17 43.3
18 34.6

Categories

Resources