Dropping specific rows in pandas using the index number - python

I'm having some issues to take out the index number of some rows on my script.
I wanna only take the Country names for my table, I already added "dropna()" to take out the empty rows, but I failed to take the index of the rows on the Col1 that starts with 'Total'.
The content on the panda file is like this:
The Country is the Col1, 1975 the Col2 and 1966 the Col3, more (also the index number)
In[1]:
Index Country 1965 1966
0 NaN NaN NaN
1 Canada 115.9 123.0
2 Mexico 25.0 26.4
3 US 1249.6 1320.0
4 Total... 1390.5 1469.5
5 NaN NaN NaN
6 Argentina 26.9 27.8
7 Brazil 22.5 24.5
8 Chile 6.2 6.6
9 Colombia 7.5 8.2
10 Ecuador 0.8 0.8
11 Peru 4.8 5.8
12 Trinidad... 3.0 3.2
13 Venezuela 16.4 16.6
14 Central... 4.3 4.4
15 Other... 15.0 15.7
16 Other... 2.6 3.0
17 Total... 110.0 116.6
18 NaN NaN NaN
19 Austria 15.8 16.6
My plan is to take the index row number of these 'Total' rows using pandas and drop these lines, with this part of the data, will the rows 4 and 17. (cause I also wrote the dropna() to take off the empty rows.
Because when I take out the lines, the index number stay the same, but I stucked on the part where I can take the index number using the rows where starts with 'Total' on the Country column.
So I'd like to record this index numbers on a list to use as df.drop(index=numbers), being numbers the list on the index rows of the 'Total' cells
So the output will be:
In[2]: df.drop(index=numbers)
Index Country 1965 1966
1 Canada 115.9 123.0
2 Mexico 25.0 26.4
3 US 1249.6 1320.0
6 Argentina 26.9 27.8
7 Brazil 22.5 24.5
8 Chile 6.2 6.6
9 Colombia 7.5 8.2
10 Ecuador 0.8 0.8
11 Peru 4.8 5.8
12 Trinidad... 3.0 3.2
13 Venezuela 16.4 16.6
14 Central... 4.3 4.4
15 Other... 15.0 15.7
16 Other... 2.6 3.0
19 Austria 15.8 16.6

I would use boolean indexing:
df = df[df['Country'].ne('Total...')].dropna(how='all')
or
df = df[~df['Country'].str.startswith('Total')].dropna(how='all')

Related

Calculate Positive Streak for Pandas Rows in reverse

I want to calculate a positive streak for numbers in a row in reverse fashion.
I tried using cumsum() but that's not helping me.
The DataFrame looks as follows with the expected output:
country score_1 score_2 score_3 score_4 score_5 expected_streak
U.S. 12.4 13.6 19.9 22 28.7 4
Africa 11.1 15.5 9.2 7 34.2 1
India 13.9 6.6 16.3 21.8 30.9 3
Australia 25.4 36.9 18.9 29 NaN 0
Malaysia 12.8 NaN -6.2 28.6 31.7 2
Argentina 40.7 NaN 16.3 20.1 39 2
Canada 56.4 NaN NaN -2 -1 1
So, basically score_5 should be greater than score_4 and so on... to get a count of streak. If a number is greater than score_5 the streak count ends.
One way using diff with cummin:
df2 = df.filter(like="score_").loc[:, ::-1]
df["expected"] = df2.diff(-1, axis=1).gt(0).cummin(1).sum(1)
print(df)
Output:
country score_1 score_2 score_3 score_4 score_5 expected
0 U.S. 12.4 13.6 19.9 22.0 28.7 4
1 Africa 11.1 15.5 9.2 7.0 34.2 1
2 India 13.9 6.6 16.3 21.8 30.9 3
3 Australia 25.4 36.9 18.9 29.0 NaN 0
4 Malaysia 12.8 NaN -6.2 28.6 31.7 2
5 Argentina 40.7 NaN 16.3 20.1 39.0 2
6 Canada 56.4 NaN NaN -2.0 -1.0 1

To scrape the data from span tag using beautifulsoup

I am trying to scrape the webpage, where I need to decode the entire table into a dataframe. I am using beautiful soup for this purpose. In certain td tags, there are span tags which do not have any text. But the values are shown on the webpage in that particular span tag.
The following html code corresponds to that webpage,
<td>
<span class="nttu">::after</span>
<span class="ntbb">::after</span>
<span class="ntyc">::after</span>
<span class="nttu">::after</span>
</td>
But, the value shown in this td tag is 23.8. I tried to scrape it, but I am getting am empty text.
How to scrape this value using beautiful soup.
URL: https://en.tutiempo.net/climate/ws-432950.html
and my code is for scraping the table is given below,
http_url = "https://en.tutiempo.net/climate/01-2013/ws-432950.html"
retreived_data = requests.get(http_url).text
soup = BeautifulSoup(retreived_data, "lxml")
climate_table = soup.find("table", attrs={"class": "medias mensuales numspan"})
climate_data = climate_table.find_all("tr")
for data in climate_data[1:-2]:
table_data = data.find_all("td")
row_data = []
for row in table_data:
row_data.append(row.get_text())
climate_df.loc[len(climate_df)] = row_data
Misunderstood your question as you have 2 different urls referenced. I see now what you mean.
Ya that is weird that in that second table, they used CSS to fill in the content of some of those <td> tags. What you need to do is pull out those special cases from the <style> tag. Once you have that, you can replace those elements within the html source, and finally parse it into a dataframe. I used pandas as it uses BeautifulSoup under the hood to parse <table> tags. But I believe this will get you what you want:
import pandas as pd
import requests
from bs4 import BeautifulSoup
import re
http_url = "https://en.tutiempo.net/climate/01-2013/ws-432950.html"
retreived_data = requests.get(http_url).text
soup = BeautifulSoup(retreived_data, "lxml")
hiddenData = str(soup.find_all('style')[1])
hiddenSpan = {}
for group in re.findall(r'span\.(.+?)}',hiddenData):
class_attr = group.split('span.')[-1].split('::')[0]
content = group.split('"')[1]
hiddenSpan[class_attr] = content
climate_table = str(soup.find("table", attrs={"class": "medias mensuales numspan"}))
for k, v in hiddenSpan.items():
climate_table = climate_table.replace('<span class="%s"></span>' %(k), hiddenSpan[k])
df = pd.read_html(climate_table)[0]
Output:
print (df.to_string())
Day T TM Tm SLP H PP VV V VM VG RA SN TS FG
0 1 23.4 30.3 19 - 59 0 6.3 4.3 5.4 - NaN NaN NaN NaN
1 2 22.4 30.3 16.9 - 57 0 6.9 3.3 7.6 - NaN NaN NaN NaN
2 3 24 31.8 16.9 - 51 0 6.9 2.8 5.4 - NaN NaN NaN NaN
3 4 24.2 32 17.4 - 53 0 6 3.3 5.4 - NaN NaN NaN NaN
4 5 23.8 32 18 - 58 0 6.9 3.1 7.6 - NaN NaN NaN NaN
5 6 23.3 31 18.3 - 60 0 6.9 5 9.4 - NaN NaN NaN NaN
6 7 22.8 30.2 17.6 - 55 0 7.7 3.7 7.6 - NaN NaN NaN NaN
7 8 23.1 30.6 17.4 - 46 0 6.9 3.3 5.4 - NaN NaN NaN NaN
8 9 22.9 30.6 17.4 - 51 0 6.9 3.5 3.5 - NaN NaN NaN NaN
9 10 22.3 30 17 - 56 0 6.3 3.3 7.6 - NaN NaN NaN NaN
10 11 22.3 29.4 17 - 53 0 6.9 4.3 7.6 - NaN NaN NaN NaN
11 12 21.8 29.4 15.7 - 54 0 6.9 2.8 3.5 - NaN NaN NaN NaN
12 13 22.3 30.1 15.7 - 43 0 6.9 2.8 5.4 - NaN NaN NaN NaN
13 14 21.8 30.6 14.8 - 41 0 6.9 1.9 5.4 - NaN NaN NaN NaN
14 15 21.6 30.6 14.2 - 43 0 6.9 3.1 7.6 - NaN NaN NaN NaN
15 16 21.1 29.9 15.4 - 55 0 6.9 4.1 7.6 - NaN NaN NaN NaN
16 17 20.4 28.1 15.4 - 59 0 6.9 5 11.1 - NaN NaN NaN NaN
17 18 21.2 28.3 14.5 - 53 0 6.9 3.1 7.6 - NaN NaN NaN NaN
18 19 21.6 29.6 16.4 - 58 0 6.9 2.2 3.5 - NaN NaN NaN NaN
19 20 21.9 29.6 16.6 - 58 0 6.9 2.4 5.4 - NaN NaN NaN NaN
20 21 22.3 29.9 17.5 - 55 0 6.9 3.1 5.4 - NaN NaN NaN NaN
21 22 21.9 29.9 15.1 - 46 0 6.9 4.3 7.6 - NaN NaN NaN NaN
22 23 21.3 29 15.2 - 50 0 6.9 3.3 5.4 - NaN NaN NaN NaN
23 24 21.3 28.8 14.6 - 45 0 6.9 3 5.4 - NaN NaN NaN NaN
24 25 21.6 29.1 15.5 - 47 0 7.7 4.8 7.6 - NaN NaN NaN NaN
25 26 21.8 29.2 14.6 - 41 0 6.9 2.8 3.5 - NaN NaN NaN NaN
26 27 22.3 30.1 15.6 - 40 0 6.9 2.4 5.4 - NaN NaN NaN NaN
27 28 22.4 30.3 16 - 51 0 6.9 2.8 3.5 - NaN NaN NaN NaN
28 29 23 30.3 16.9 - 53 0 6.6 2.8 5.4 - NaN NaN NaN o
29 30 23.1 30 17.8 - 54 0 6.9 5.4 7.6 - NaN NaN NaN NaN
30 31 22.1 29.8 17.3 - 54 0 6.9 5.2 9.4 - NaN NaN NaN NaN
31 Monthly means and totals: Monthly means and totals: Monthly means and totals: Monthly means and totals: Monthly means and totals: Monthly means and totals: Monthly means and totals: Monthly means and totals: Monthly means and totals: Monthly means and totals: Monthly means and totals: Monthly means and totals: Monthly means and totals: Monthly means and totals: Monthly means and totals:
32 NaN 22.3 30 16.4 - 51.6 0 6.9 3.5 6.3 NaN 0 0 0 1

Concatenate multiple unequal dataframes on condition

I have 7 dataframes (df_1, df_2, df_3,..., df_7) all with the same columns but different lengths but sometimes have the same values.
I'd like to concatenate all 7 dataframes under the conditions that:
if df_n.iloc[row_i] != df_n+1.iloc[row_i] and df_n.iloc[row_i][0] < df_n+1.iloc[row_i][0]:
pd.concat([df_n.iloc[row_i], df_n+1.iloc[row_i], df_n+2.iloc[row_i],
...., df_n+6.iloc[row_i]])
Where df_n.iloc[row_i] is the ith row of the nth dataframe and df_n.iloc[row_i][0] is the first column of the ith row.
For example if we only had 2 dataframes and that len(df_1) < len(df_2) and if we used the conditions above the input would be:
df_1 df_2
index 0 1 2 index 0 1 2
0 12.12 11.0 31 0 12.2 12.6 30
1 12.3 12.1 33 1 12.3 12.1 33
2 10 9.1 33 2 13 12.1 23
3 16 12.1 33 3 13.1 12.1 27
4 14.4 13.1 27
5 15.2 13.2 28
And the output would be:
conditions -> pd.concat([df_1, df_2]):
index 0 1 2 3 4 5
0 12.12 11.0 31 12.2 12.6 30
2 10 9.1 33 13 12.1 23
4 nan 14.4 13.1 27
5 nan 15.2 13.2 28
Is there an easy way to do this?
IIUC concat first , the groupby by columns get the different , and we just implement your condition
s=pd.concat([df1,df2],1)
s1=s.groupby(level=0,axis=1).apply(lambda x : x.iloc[:,0]-x.iloc[:,1])
yourdf=s[s1.ne(0).any(1)&s1.iloc[:,0].lt(0)|s1.iloc[:,0].isnull()]
Out[487]:
0 1 2 0 1 2
index
0 12.12 11.0 31.0 12.2 12.6 30
2 10.00 9.1 33.0 13.0 12.1 23
4 NaN NaN NaN 14.4 13.1 27
5 NaN NaN NaN 15.2 13.2 28

Why is pandas showing "?" instead of NaN

I'm learning pandas and when i display the data frame, it is displaying ? instead of NaN.
Why is it so?
CODE :
import pandas as pd
url = "https://archive.ics.uci.edu/ml/machine-learning-
databases/autos/imports-85.data"
df = pd.read_csv(url, header=None)
print(df.head())
headers = ["symboling", "normalized-losses", "make", "fuel-type",
"aspiration",
"num-of-doors", "body-style", "drive-wheels", "engine-location",
"wheel-base", "length", "width", "height", "curb-weight",
"engine-type", "num-of-cylinders", "engine-size", "fuel-system",
"bore", "stroke", "compression-ratio", "hoursepower", "peak-rpm",
"city-mpg", "highway-mpg", "price"]
df.columns=headers
print(df.head(30))
In data are missing values represented by ?, so for converting them is possible use parameter na_values, also names parameter in read_csv add columns by list, so assign is not necessary:
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/autos/imports-85.data"
headers = ["symboling", "normalized-losses", "make", "fuel-type", "aspiration",
"num-of-doors", "body-style", "drive-wheels", "engine-location",
"wheel-base", "length", "width", "height", "curb-weight",
"engine-type", "num-of-cylinders", "engine-size", "fuel-system",
"bore", "stroke", "compression-ratio", "hoursepower", "peak-rpm",
"city-mpg", "highway-mpg", "price"]
df = pd.read_csv(url, header=None, names=headers, na_values='?')
print(df.head(10))
symboling normalized-losses make fuel-type aspiration \
0 3 NaN alfa-romero gas std
1 3 NaN alfa-romero gas std
2 1 NaN alfa-romero gas std
3 2 164.0 audi gas std
4 2 164.0 audi gas std
5 2 NaN audi gas std
6 1 158.0 audi gas std
7 1 NaN audi gas std
8 1 158.0 audi gas turbo
9 0 NaN audi gas turbo
num-of-doors body-style drive-wheels engine-location wheel-base ... \
0 two convertible rwd front 88.6 ...
1 two convertible rwd front 88.6 ...
2 two hatchback rwd front 94.5 ...
3 four sedan fwd front 99.8 ...
4 four sedan 4wd front 99.4 ...
5 two sedan fwd front 99.8 ...
6 four sedan fwd front 105.8 ...
7 four wagon fwd front 105.8 ...
8 four sedan fwd front 105.8 ...
9 two hatchback 4wd front 99.5 ...
engine-size fuel-system bore stroke compression-ratio hoursepower \
0 130 mpfi 3.47 2.68 9.0 111.0
1 130 mpfi 3.47 2.68 9.0 111.0
2 152 mpfi 2.68 3.47 9.0 154.0
3 109 mpfi 3.19 3.40 10.0 102.0
4 136 mpfi 3.19 3.40 8.0 115.0
5 136 mpfi 3.19 3.40 8.5 110.0
6 136 mpfi 3.19 3.40 8.5 110.0
7 136 mpfi 3.19 3.40 8.5 110.0
8 131 mpfi 3.13 3.40 8.3 140.0
9 131 mpfi 3.13 3.40 7.0 160.0
peak-rpm city-mpg highway-mpg price
0 5000.0 21 27 13495.0
1 5000.0 21 27 16500.0
2 5000.0 19 26 16500.0
3 5500.0 24 30 13950.0
4 5500.0 18 22 17450.0
5 5500.0 19 25 15250.0
6 5500.0 19 25 17710.0
7 5500.0 19 25 18920.0
8 5500.0 17 20 23875.0
9 5500.0 16 22 NaN
[10 rows x 26 columns]
This information is here:
https://archive.ics.uci.edu/ml/machine-learning-databases/autos/imports-85.names:
Missing Attribute Values: (denoted by "?")
Another solution: if you want to replace ? by NaN after reading the data, you can do this:
df_new = df.replace({'?':np.nan})

Dropping multiple columns in pandas at once

I have a data set consisting of 135 columns. I am trying to drop the columns which have empty data of more than 60%. There are some 40 columns approx in it. So, I wrote a function to drop this empty columns. But I am getting "Not contained in axis" error. Could some one help me solving this?. Or any other way to drop this 40 columns at once?
My function:
list_drop = df.isnull().sum()/(len(df))
def empty(df):
if list_drop > 0.5:
df.drop(list_drop,axis=1,inplace=True)
return df
Other method i tried:
df.drop(df.count()/len(df)<0.5,axis=1,inplace=True)
You could use isnull + sum and then use the mask to filter df.columns.
m = df.isnull().sum(0) / len(df) < 0.6
df = df[df.columns[m]]
Demo
df
A B C
0 29.0 NaN 26.6
1 NaN NaN 23.3
2 23.0 94.0 28.1
3 35.0 168.0 43.1
4 NaN NaN 25.6
5 32.0 88.0 31.0
6 NaN NaN 35.3
7 45.0 543.0 30.5
8 NaN NaN NaN
9 NaN NaN 37.6
10 NaN NaN 38.0
11 NaN NaN 27.1
12 23.0 846.0 30.1
13 19.0 175.0 25.8
14 NaN NaN 30.0
15 47.0 230.0 45.8
16 NaN NaN 29.6
17 38.0 83.0 43.3
18 30.0 96.0 34.6
m = df.isnull().sum(0) / len(df) < 0.3 # 0.3 as an example
m
A False
B False
C True
dtype: bool
df[df.columns[m]]
C
0 26.6
1 23.3
2 28.1
3 43.1
4 25.6
5 31.0
6 35.3
7 30.5
8 NaN
9 37.6
10 38.0
11 27.1
12 30.1
13 25.8
14 30.0
15 45.8
16 29.6
17 43.3
18 34.6

Categories

Resources