Can someone please tell me how I can fill in the missing values of my dataframe? The missing values dont come up as NaN or anything that is common instead they show as two dots like .. How would i go about filling them in with the mean of that row that they are in?
1971 1990 1999 2000 2001 2002
Estonia .. 17.4 8.3 8.5 8.5 8.6
Spain 61.6 151.2 205.9 222.2 233.2 241.6
SlovakRepublic 10.9 25.5 28.1 30.8 31.9 32.2
Slovenia .. 12.4 13.3 13.6 14.5 14.6
My headers are the years and my index are the countries.
It seems you can use mask, compare by numpy array created by values and replace by mean, last cast all columns to float:
print (df.mean(axis=1))
Estonia 10.26
Spain 210.82
SlovakRepublic 29.70
Slovenia 13.68
df = df.mask(df.values == '..', df.mean(axis=1), axis=0).astype(float)
print (df)
1971 1990 1999 2000 2001 2002
Estonia 10.26 17.4 8.3 8.5 8.5 8.6
Spain 61.6 151.2 205.9 222.2 233.2 241.6
SlovakRepublic 10.9 25.5 28.1 30.8 31.9 32.2
Slovenia 13.68 12.4 13.3 13.6 14.5 14.6
You should be able to use an .set_value
try df_name.set_value('index', 'column', value)
something like
df_name.set_value('Estonia','1971', 50)
Related
I'm trying to build a simple web scraper. I am trying to scrape a table, but I'm not sure why the output is: School, 20-5, 33.2 26 times over.
Here is my code:
from bs4 import BeautifulSoup
import requests
url = 'https://www.maxpreps.com/rankings/basketball/1/state/michigan.htm'
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
teams = soup.find_all('tr')
for team in teams:
teamname = soup.find('th', class_ = "school").text
record = soup.find('td', class_= "overall dw").text
rating = soup.find('td', class_ = "rating sorted dw").text
print(teamname, record, rating)
Notice that you're never using the Tag that team refers to. Inside the for loop, all of the calls to soup.find() should be calls to team.find():
for team in teams[1:]:
teamname = team.find('th', class_ = "school").text
record = team.find('td', class_= "overall dw").text
rating = team.find('td', class_ = "rating sorted dw").text
print(teamname, record, rating)
This outputs:
St. Mary's Prep (Orchard Lake) 20-5 33.2
University of Detroit Jesuit (Detroit) 16-7 30.0
Williamston 25-0 29.3
Ferndale 21-3 28.9
Catholic Central (Grand Rapids) 25-1 28.4
King (Detroit) 18-3 27.4
De La Salle Collegiate (Warren) 18-7 27.2
Catholic Central (Novi) 16-9 26.6
Brother Rice (Bloomfield Hills) 15-7 26.5
Unity Christian (Hudsonville) 21-1 26.4
Hamtramck 21-4 26.3
Grand Blanc 20-5 25.9
East Lansing 18-5 25.0
Muskegon 20-3 24.8
Northview (Grand Rapids) 25-1 24.6
Cass Tech (Detroit) 21-4 24.3
North Farmington (Farmington Hills) 18-4 24.2
Beecher (Flint) 23-2 24.0
Okemos 19-5 23.9
Benton Harbor 23-3 23.2
Rockford 19-3 22.9
Grand Haven 17-4 21.9
Hartland 19-4 21.0
Marshall 20-3 21.0
Freeland 24-0 21.0
We use [1:] to skip the table header, slicing off the first element in the teams list.
Let pandas parse that table for you (it uses BeautifulSoup under the hoop).
import pandas as pd
url = 'https://www.maxpreps.com/rankings/basketball/1/state/michigan.htm'
df = pd.read_html(url)[0]
Output:
print(df)
# School Ovr. Rating Str. +/-
0 1 St. Mary's Prep (Orchard Lake) 20-5 33.2 23.0 NaN
1 2 University of Detroit Jesuit (Detroit) 16-7 30.0 24.1 NaN
2 3 Williamston 25-0 29.3 10.9 NaN
3 4 Ferndale 21-3 28.9 16.5 NaN
4 5 Catholic Central (Grand Rapids) 25-1 28.4 11.4 NaN
5 6 King (Detroit) 18-3 27.4 15.2 NaN
6 7 De La Salle Collegiate (Warren) 18-7 27.2 19.6 2.0
7 8 Catholic Central (Novi) 16-9 26.6 22.6 -1.0
8 9 Brother Rice (Bloomfield Hills) 15-7 26.5 21.0 -1.0
9 10 Unity Christian (Hudsonville) 21-1 26.4 10.4 NaN
10 11 Hamtramck 21-4 26.3 14.5 2.0
11 12 Grand Blanc 20-5 25.9 15.3 -1.0
12 13 East Lansing 18-5 25.0 15.6 1.0
13 14 Muskegon 20-3 24.8 11.4 1.0
14 15 Northview (Grand Rapids) 25-1 24.6 8.2 1.0
15 16 Cass Tech (Detroit) 21-4 24.3 11.8 -4.0
16 17 North Farmington (Farmington Hills) 18-4 24.2 13.1 NaN
17 18 Beecher (Flint) 23-2 24.0 8.6 2.0
18 19 Okemos 19-5 23.9 13.7 -1.0
19 20 Benton Harbor 23-3 23.2 9.9 -1.0
20 21 Rockford 19-3 22.9 11.6 NaN
21 22 Grand Haven 17-4 21.9 11.3 NaN
22 23 Hartland 19-4 21.0 10.4 1.0
23 24 Marshall 20-3 21.0 8.6 -1.0
24 25 Freeland 24-0 21.0 2.7 4.0
I want to calculate a positive streak for numbers in a row in reverse fashion.
I tried using cumsum() but that's not helping me.
The DataFrame looks as follows with the expected output:
country score_1 score_2 score_3 score_4 score_5 expected_streak
U.S. 12.4 13.6 19.9 22 28.7 4
Africa 11.1 15.5 9.2 7 34.2 1
India 13.9 6.6 16.3 21.8 30.9 3
Australia 25.4 36.9 18.9 29 NaN 0
Malaysia 12.8 NaN -6.2 28.6 31.7 2
Argentina 40.7 NaN 16.3 20.1 39 2
Canada 56.4 NaN NaN -2 -1 1
So, basically score_5 should be greater than score_4 and so on... to get a count of streak. If a number is greater than score_5 the streak count ends.
One way using diff with cummin:
df2 = df.filter(like="score_").loc[:, ::-1]
df["expected"] = df2.diff(-1, axis=1).gt(0).cummin(1).sum(1)
print(df)
Output:
country score_1 score_2 score_3 score_4 score_5 expected
0 U.S. 12.4 13.6 19.9 22.0 28.7 4
1 Africa 11.1 15.5 9.2 7.0 34.2 1
2 India 13.9 6.6 16.3 21.8 30.9 3
3 Australia 25.4 36.9 18.9 29.0 NaN 0
4 Malaysia 12.8 NaN -6.2 28.6 31.7 2
5 Argentina 40.7 NaN 16.3 20.1 39.0 2
6 Canada 56.4 NaN NaN -2.0 -1.0 1
I'm having some issues to take out the index number of some rows on my script.
I wanna only take the Country names for my table, I already added "dropna()" to take out the empty rows, but I failed to take the index of the rows on the Col1 that starts with 'Total'.
The content on the panda file is like this:
The Country is the Col1, 1975 the Col2 and 1966 the Col3, more (also the index number)
In[1]:
Index Country 1965 1966
0 NaN NaN NaN
1 Canada 115.9 123.0
2 Mexico 25.0 26.4
3 US 1249.6 1320.0
4 Total... 1390.5 1469.5
5 NaN NaN NaN
6 Argentina 26.9 27.8
7 Brazil 22.5 24.5
8 Chile 6.2 6.6
9 Colombia 7.5 8.2
10 Ecuador 0.8 0.8
11 Peru 4.8 5.8
12 Trinidad... 3.0 3.2
13 Venezuela 16.4 16.6
14 Central... 4.3 4.4
15 Other... 15.0 15.7
16 Other... 2.6 3.0
17 Total... 110.0 116.6
18 NaN NaN NaN
19 Austria 15.8 16.6
My plan is to take the index row number of these 'Total' rows using pandas and drop these lines, with this part of the data, will the rows 4 and 17. (cause I also wrote the dropna() to take off the empty rows.
Because when I take out the lines, the index number stay the same, but I stucked on the part where I can take the index number using the rows where starts with 'Total' on the Country column.
So I'd like to record this index numbers on a list to use as df.drop(index=numbers), being numbers the list on the index rows of the 'Total' cells
So the output will be:
In[2]: df.drop(index=numbers)
Index Country 1965 1966
1 Canada 115.9 123.0
2 Mexico 25.0 26.4
3 US 1249.6 1320.0
6 Argentina 26.9 27.8
7 Brazil 22.5 24.5
8 Chile 6.2 6.6
9 Colombia 7.5 8.2
10 Ecuador 0.8 0.8
11 Peru 4.8 5.8
12 Trinidad... 3.0 3.2
13 Venezuela 16.4 16.6
14 Central... 4.3 4.4
15 Other... 15.0 15.7
16 Other... 2.6 3.0
19 Austria 15.8 16.6
I would use boolean indexing:
df = df[df['Country'].ne('Total...')].dropna(how='all')
or
df = df[~df['Country'].str.startswith('Total')].dropna(how='all')
I want to find the corr btw cities and and Rainfall. Note that 'city' is categorical, not numerical.
I wand to compare their rainfall.
How do I go about it? I haven't seen anything on here that talk about how to deal with duplicate cities with different data
like
Date Location MinTemp MaxTemp Rainfall
12/1/2008 Albury 13.4 22.9 0.6
12/2/2008 Albury 7.4 25.1 0
12/3/2008 Albury 12.9 25.7 0
12/5/2008 Brisbane 20.5 29 9.6
12/6/2008 Brisbane 22.1 33.4 7.8
12/7/2008 Brisbane 22.6 33.4 12.4
12/8/2008 Brisbane 21.9 26.7 0
12/9/2008 Brisbane 19.5 27.6 0.2
12/10/2008 Brisbane 22.1 30.3 0.6
3/30/2011 Tuggeranong 9.8 25.2 0.4
3/31/2011 Tuggeranong 10.3 18.5 2.8
5/1/2011 Tuggeranong 5.5 20.8 0
5/2/2011 Tuggeranong 11 16.1 0
5/3/2011 Tuggeranong 7.3 17.5 0.6
8/29/2016 Woomera 15 22.9 0
8/30/2016 Woomera 12.5 22.1 12.8
8/31/2016 Woomera 8 20 0
9/1/2016 Woomera 11.6 21.4 0
9/2/2016 Woomera 11.2 19.6 0.3
9/3/2016 Woomera 7.1 20.4 0
9/4/2016 Woomera 6.5 18.6 0
9/5/2016 Woomera 7.3 21.5 0
One possible solution, if I understood you correctly (based on the title of OP), is:
Step 1
Preparing a dataset with Locations as columns and Rainfall as rows (note, you will lose information here up to a shortest rainfall series)
df2=df.groupby("Location")[["Location", "Rainfall"]].head(3) # head(3) is first 3 observations
df2.loc[:,"col"] = 4*["x1","x2","x3"] # 4 is number of unique cities
df3 = df2.pivot_table(index="col",columns="Location",values="Rainfall")
df3
Location Albury Brisbane Tuggeranong Woomera
col
x1 0.6 9.6 0.4 0.0
x2 0.0 7.8 2.8 12.8
x3 0.0 12.4 0.0 0.0
Step 2
Doing correlation matrix on the obtained dataset
df3.corr()
Location Albury Brisbane Tuggeranong Woomera
Location
Albury 1.000000 -0.124534 -0.381246 -0.500000
Brisbane -0.124534 1.000000 -0.869799 -0.797017
Tuggeranong -0.381246 -0.869799 1.000000 0.991241
Woomera -0.500000 -0.797017 0.991241 1.000000
An alternative, slightly more involved solution would be to keep the longest series and impute missing values with means or median.
But even though you will feed more data into your algo, it won't cure the main problem: your data seem to be misaligned. What I mean by this is that to do correlation analysis properly you should make it sure, that you compare comparable values, e.g. rainfall for summer with rainfall for summer for another city. To do analysis this way, you should make it sure you have equal amount of comparable rainfalls for each city: e.g. winter, spring, summer, autumn; or, January, February, ..., December.
I tried search around for an answer but most was based on merging 2 dataframe, however mine exist within a single dataframe
D1date D1price D2date D2price
1/2/2017 11.4 1/3/2017 11.3
1/3/2017 12.4 1/4/2017 12.3
1/4/2017 14.4 1/5/2017 12.4
1/5/2017 15.5 1/6/2017 12.5
Results I am looking for
D1date D1price D2price
1/2/2017 11.4 nan
1/3/2017 12.4 11.3
1/4/2017 14.4 12.3
1/5/2017 15.5 12.4
Can any kind souls advise me please?
Use filter + join:
df = df.filter(like='D1').join(df.filter(like='D2').set_index('D2date'), on='D1date')
print (df)
D1date D1price D2price
0 1/2/2017 11.4 NaN
1 1/3/2017 12.4 11.3
2 1/4/2017 14.4 12.3
3 1/5/2017 15.5 12.4
Have you tried like this:
df[['D1date', 'D1price']].merge(df[['D2date', 'D2price']], how='left', left_on='D1date', right_on='D2date')
You can add:
.drop('D2date', axis=1)
To remove D2date column.
Complete code:
df = df[['D1date', 'D1price']].merge(df[['D2date', 'D2price']], how='left', left_on='D1date', right_on='D2date').drop('D2date', axis=1)