I want to calculate a positive streak for numbers in a row in reverse fashion.
I tried using cumsum() but that's not helping me.
The DataFrame looks as follows with the expected output:
country score_1 score_2 score_3 score_4 score_5 expected_streak
U.S. 12.4 13.6 19.9 22 28.7 4
Africa 11.1 15.5 9.2 7 34.2 1
India 13.9 6.6 16.3 21.8 30.9 3
Australia 25.4 36.9 18.9 29 NaN 0
Malaysia 12.8 NaN -6.2 28.6 31.7 2
Argentina 40.7 NaN 16.3 20.1 39 2
Canada 56.4 NaN NaN -2 -1 1
So, basically score_5 should be greater than score_4 and so on... to get a count of streak. If a number is greater than score_5 the streak count ends.
One way using diff with cummin:
df2 = df.filter(like="score_").loc[:, ::-1]
df["expected"] = df2.diff(-1, axis=1).gt(0).cummin(1).sum(1)
print(df)
Output:
country score_1 score_2 score_3 score_4 score_5 expected
0 U.S. 12.4 13.6 19.9 22.0 28.7 4
1 Africa 11.1 15.5 9.2 7.0 34.2 1
2 India 13.9 6.6 16.3 21.8 30.9 3
3 Australia 25.4 36.9 18.9 29.0 NaN 0
4 Malaysia 12.8 NaN -6.2 28.6 31.7 2
5 Argentina 40.7 NaN 16.3 20.1 39.0 2
6 Canada 56.4 NaN NaN -2.0 -1.0 1
I'm having some issues to take out the index number of some rows on my script.
I wanna only take the Country names for my table, I already added "dropna()" to take out the empty rows, but I failed to take the index of the rows on the Col1 that starts with 'Total'.
The content on the panda file is like this:
The Country is the Col1, 1975 the Col2 and 1966 the Col3, more (also the index number)
In[1]:
Index Country 1965 1966
0 NaN NaN NaN
1 Canada 115.9 123.0
2 Mexico 25.0 26.4
3 US 1249.6 1320.0
4 Total... 1390.5 1469.5
5 NaN NaN NaN
6 Argentina 26.9 27.8
7 Brazil 22.5 24.5
8 Chile 6.2 6.6
9 Colombia 7.5 8.2
10 Ecuador 0.8 0.8
11 Peru 4.8 5.8
12 Trinidad... 3.0 3.2
13 Venezuela 16.4 16.6
14 Central... 4.3 4.4
15 Other... 15.0 15.7
16 Other... 2.6 3.0
17 Total... 110.0 116.6
18 NaN NaN NaN
19 Austria 15.8 16.6
My plan is to take the index row number of these 'Total' rows using pandas and drop these lines, with this part of the data, will the rows 4 and 17. (cause I also wrote the dropna() to take off the empty rows.
Because when I take out the lines, the index number stay the same, but I stucked on the part where I can take the index number using the rows where starts with 'Total' on the Country column.
So I'd like to record this index numbers on a list to use as df.drop(index=numbers), being numbers the list on the index rows of the 'Total' cells
So the output will be:
In[2]: df.drop(index=numbers)
Index Country 1965 1966
1 Canada 115.9 123.0
2 Mexico 25.0 26.4
3 US 1249.6 1320.0
6 Argentina 26.9 27.8
7 Brazil 22.5 24.5
8 Chile 6.2 6.6
9 Colombia 7.5 8.2
10 Ecuador 0.8 0.8
11 Peru 4.8 5.8
12 Trinidad... 3.0 3.2
13 Venezuela 16.4 16.6
14 Central... 4.3 4.4
15 Other... 15.0 15.7
16 Other... 2.6 3.0
19 Austria 15.8 16.6
I would use boolean indexing:
df = df[df['Country'].ne('Total...')].dropna(how='all')
or
df = df[~df['Country'].str.startswith('Total')].dropna(how='all')
I am analyzing data from excel file.
I want to create data frame by parsing data from excel using python.
Data in my excel file looks like as follow:
The first row highlighted in yellow contains match, which will be one of the columns in data frame that I wanted to create.
In fact, second row and 4th row are the name of the columns that I wanted to created in a new data frame.
3rd row and fifth row are the value of each column.
The sample here is only for one match.
I have multiple matches in the excel file.
I want to create a data frame that contain the column Match and all name in blue colors in the file.
I have attached the sample file that contains multiple matches.
Download the file here.
My expected data frame is
Match 1-0 2-0 2-1 3-0 3-1 3-2 4-0 4-1 4-2 4-3.......
MOL Vivi -vs- Chelsea 14 42 20 170 85 85 225 225 225 .....
Can anyone advise me how to parse the excel data and convert to data frame?
Thanks,
Zep
Use:
import pandas as pd
from datetime import datetime
df = pd.read_excel('test_match.xlsx')
#mask for check a-z in column HOME -vs- AWAY
m1 = df['HOME -vs- AWAY'].str.contains('[a-z]', na=False)
#create index by matches
df.index = df['HOME -vs- AWAY'].where(m1).ffill()
df.index.name = 'Match'
#remove same index and HOME -vs- AWAY column rows
df = df[df.index != df['HOME -vs- AWAY']].copy()
#test if datetime or string
m2 = df['HOME -vs- AWAY'].apply(lambda x: isinstance(x, datetime))
m3 = df['HOME -vs- AWAY'].apply(lambda x: isinstance(x, str))
#seelct next rows and set new columns names
df1 = df[m2.shift().fillna(False)]
df1.columns = df[m2].iloc[0]
#also remove only NaNs columns
df2 = df[m3.shift().fillna(False)].dropna(axis=1, how='all')
df2.columns = df[m3].iloc[0].dropna()
#join together
df = pd.concat([df1, df2], axis=1).astype(float).reset_index().rename_axis(None, axis=1)
print (df.head())
Match 2000-01-01 00:00:00 2000-02-01 00:00:00 \
0 MOL Vidi -vs- Chelsea 14.00 42.00
1 Lazio -vs- Eintracht Frankfurt 8.57 11.55
2 Sevilla -vs- FC Krasnodar 7.87 6.63
3 Villarreal -vs- Spartak Moscow 7.43 7.03
4 Rennes -vs- FC Astana 4.95 6.38
2018-02-01 00:00:00 2000-03-01 00:00:00 2018-03-01 00:00:00 \
0 20.00 170.00 85.00
1 7.87 23.80 15.55
2 7.87 8.72 8.65
3 7.07 10.00 9.43
4 7.33 12.00 13.20
2018-03-02 00:00:00 2000-04-01 00:00:00 2018-04-01 00:00:00 \
0 85.0 225.00 225.00
1 21.3 64.30 42.00
2 25.9 14.80 14.65
3 23.9 19.35 17.65
4 38.1 31.50 34.10
2018-04-02 00:00:00 ... 0-1 0-2 2018-01-02 00:00:00 \
0 225.0 ... 5.6 6.80 7.00
1 55.7 ... 11.0 19.05 10.45
2 38.1 ... 28.0 79.60 29.20
3 38.4 ... 20.9 58.50 22.70
4 81.4 ... 12.9 42.80 22.70
0-3 2018-01-03 00:00:00 2018-02-03 00:00:00 0-4 \
0 12.5 12.0 32.0 30.0
1 48.4 27.4 29.8 167.3
2 223.0 110.0 85.4 227.5
3 203.5 87.6 73.4 225.5
4 201.7 97.6 103.6 225.5
2018-01-04 00:00:00 2018-02-04 00:00:00 2018-03-04 00:00:00
0 29.0 60.0 220.0
1 91.8 102.5 168.3
2 227.5 227.5 227.5
3 225.5 225.5 225.5
4 225.5 225.5 225.5
[5 rows x 27 columns]
I'm learning pandas and when i display the data frame, it is displaying ? instead of NaN.
Why is it so?
CODE :
import pandas as pd
url = "https://archive.ics.uci.edu/ml/machine-learning-
databases/autos/imports-85.data"
df = pd.read_csv(url, header=None)
print(df.head())
headers = ["symboling", "normalized-losses", "make", "fuel-type",
"aspiration",
"num-of-doors", "body-style", "drive-wheels", "engine-location",
"wheel-base", "length", "width", "height", "curb-weight",
"engine-type", "num-of-cylinders", "engine-size", "fuel-system",
"bore", "stroke", "compression-ratio", "hoursepower", "peak-rpm",
"city-mpg", "highway-mpg", "price"]
df.columns=headers
print(df.head(30))
In data are missing values represented by ?, so for converting them is possible use parameter na_values, also names parameter in read_csv add columns by list, so assign is not necessary:
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/autos/imports-85.data"
headers = ["symboling", "normalized-losses", "make", "fuel-type", "aspiration",
"num-of-doors", "body-style", "drive-wheels", "engine-location",
"wheel-base", "length", "width", "height", "curb-weight",
"engine-type", "num-of-cylinders", "engine-size", "fuel-system",
"bore", "stroke", "compression-ratio", "hoursepower", "peak-rpm",
"city-mpg", "highway-mpg", "price"]
df = pd.read_csv(url, header=None, names=headers, na_values='?')
print(df.head(10))
symboling normalized-losses make fuel-type aspiration \
0 3 NaN alfa-romero gas std
1 3 NaN alfa-romero gas std
2 1 NaN alfa-romero gas std
3 2 164.0 audi gas std
4 2 164.0 audi gas std
5 2 NaN audi gas std
6 1 158.0 audi gas std
7 1 NaN audi gas std
8 1 158.0 audi gas turbo
9 0 NaN audi gas turbo
num-of-doors body-style drive-wheels engine-location wheel-base ... \
0 two convertible rwd front 88.6 ...
1 two convertible rwd front 88.6 ...
2 two hatchback rwd front 94.5 ...
3 four sedan fwd front 99.8 ...
4 four sedan 4wd front 99.4 ...
5 two sedan fwd front 99.8 ...
6 four sedan fwd front 105.8 ...
7 four wagon fwd front 105.8 ...
8 four sedan fwd front 105.8 ...
9 two hatchback 4wd front 99.5 ...
engine-size fuel-system bore stroke compression-ratio hoursepower \
0 130 mpfi 3.47 2.68 9.0 111.0
1 130 mpfi 3.47 2.68 9.0 111.0
2 152 mpfi 2.68 3.47 9.0 154.0
3 109 mpfi 3.19 3.40 10.0 102.0
4 136 mpfi 3.19 3.40 8.0 115.0
5 136 mpfi 3.19 3.40 8.5 110.0
6 136 mpfi 3.19 3.40 8.5 110.0
7 136 mpfi 3.19 3.40 8.5 110.0
8 131 mpfi 3.13 3.40 8.3 140.0
9 131 mpfi 3.13 3.40 7.0 160.0
peak-rpm city-mpg highway-mpg price
0 5000.0 21 27 13495.0
1 5000.0 21 27 16500.0
2 5000.0 19 26 16500.0
3 5500.0 24 30 13950.0
4 5500.0 18 22 17450.0
5 5500.0 19 25 15250.0
6 5500.0 19 25 17710.0
7 5500.0 19 25 18920.0
8 5500.0 17 20 23875.0
9 5500.0 16 22 NaN
[10 rows x 26 columns]
This information is here:
https://archive.ics.uci.edu/ml/machine-learning-databases/autos/imports-85.names:
Missing Attribute Values: (denoted by "?")
Another solution: if you want to replace ? by NaN after reading the data, you can do this:
df_new = df.replace({'?':np.nan})
I have a data set consisting of 135 columns. I am trying to drop the columns which have empty data of more than 60%. There are some 40 columns approx in it. So, I wrote a function to drop this empty columns. But I am getting "Not contained in axis" error. Could some one help me solving this?. Or any other way to drop this 40 columns at once?
My function:
list_drop = df.isnull().sum()/(len(df))
def empty(df):
if list_drop > 0.5:
df.drop(list_drop,axis=1,inplace=True)
return df
Other method i tried:
df.drop(df.count()/len(df)<0.5,axis=1,inplace=True)
You could use isnull + sum and then use the mask to filter df.columns.
m = df.isnull().sum(0) / len(df) < 0.6
df = df[df.columns[m]]
Demo
df
A B C
0 29.0 NaN 26.6
1 NaN NaN 23.3
2 23.0 94.0 28.1
3 35.0 168.0 43.1
4 NaN NaN 25.6
5 32.0 88.0 31.0
6 NaN NaN 35.3
7 45.0 543.0 30.5
8 NaN NaN NaN
9 NaN NaN 37.6
10 NaN NaN 38.0
11 NaN NaN 27.1
12 23.0 846.0 30.1
13 19.0 175.0 25.8
14 NaN NaN 30.0
15 47.0 230.0 45.8
16 NaN NaN 29.6
17 38.0 83.0 43.3
18 30.0 96.0 34.6
m = df.isnull().sum(0) / len(df) < 0.3 # 0.3 as an example
m
A False
B False
C True
dtype: bool
df[df.columns[m]]
C
0 26.6
1 23.3
2 28.1
3 43.1
4 25.6
5 31.0
6 35.3
7 30.5
8 NaN
9 37.6
10 38.0
11 27.1
12 30.1
13 25.8
14 30.0
15 45.8
16 29.6
17 43.3
18 34.6