How to convert list to pandas DataFrame? - python

I use BeautifulSoup to get some data from a webpage:
import pandas as pd
import requests
from bs4 import BeautifulSoup
res = requests.get("http://www.nationmaster.com/country-info/stats/Media/Internet-users")
soup = BeautifulSoup(res.content,'html5lib')
table = soup.find_all('table')[0]
df = pd.read_html(str(table))
df.head()
But df is a list, not the pandas DataFrame as I expected from using pd.read_html.
How can I get pandas DataFrame out of it?

You can use read_html with your url:
df = pd.read_html("http://www.nationmaster.com/country-info/stats/Media/Internet-users")[0]
And then if necessary remove GRAPH and HISTORY columns and replace NaNs in column # by forward filling:
df = df.drop(['GRAPH','HISTORY'], axis=1)
df['#'] = df['#'].ffill()
print(df.head())
# COUNTRY AMOUNT DATE
0 1 China 389 million 2009
1 2 United States 245 million 2009
2 3 Japan 99.18 million 2009
3 3 Group of 7 countries (G7) average (profile) 80.32 million 2009
4 4 Brazil 75.98 million 2009
print(df.tail())
# COUNTRY AMOUNT DATE
244 214 Niue 1100 2009
245 =215 Saint Helena, Ascension, and Tristan da Cunha 900 2009
246 =215 Saint Helena 900 2009
247 217 Tokelau 800 2008
248 218 Christmas Island 464 2001

Related

Pandas inner Join does not give complete results

I have two csvs
I am trying to join the two dataframes where the Stocknumber matches, but it is only returning 77 results except the expected 140
here is my code
import pandas as pd
df=pd.read_csv('C:/Users/Starshine/Desktop/vdp.csv')
df = df.iloc[: , :-1]
df['StockNumber']=df['URL'].str[-8:]
df['StockNumber']=df['StockNumber'].str.strip('/')
df['StockNumber']=df['StockNumber'].str.strip('-')
df.to_csv('C:/Users/Starshine/Desktop/panda.csv',index=False)
dfs=pd.read_csv('C:/Users/Starshine/Desktop/a2.csv')
dfs.rename(columns={'Stock #': 'StockNumber'}, inplace=True)
dfs = dfs.iloc[: , :-2]
dfs['Stock']=df['StockNumber']
sf=pd.merge(dfs,df,on='StockNumber')
sf.to_csv('C:/Users/Starshine/Desktop/test21.csv',index=False)
print (sf)
What am I doing wrong here?
pandas.merge is case sensitive. You have to lowercase both columns before the merge.
Try this :
import pandas as pd
df=pd.read_csv('C:/Users/Starshine/Desktop/vdp.csv')
df['StockNumber']=df['URL'].str.rsplit('-').str[-1].str.strip('/').str.lower()
dfs=pd.read_csv('C:/Users/Starshine/Desktop/a2.csv')
dfs.rename(columns={'Stock #': 'StockNumber'}, inplace=True)
dfs['StockNumber'] = df['StockNumber'].str.lower()
sf=pd.merge(dfs,df,on='StockNumber')
>>> Result (There is exaclty 139 matches, not 140)
print(sf)
Vehicle StockNumber \
0 2012 Ford Fusion S a26131
1 2020 Chevrolet Malibu LS 1FL b98795
2 2010 Hyundai Santa Fe GLS 571849
3 2019 Dodge Charger GT c32026
4 2019 Toyota Camry SE 500754
.. ... ...
134 2014 Hyundai Santa Fe Sport 2.4L 656191
135 2015 Jeep Wrangler Unlimited Rubicon 206164
136 2012 Mercedes-Benz E-Class E 350 4MATIC? 545815
137 2013 Lexus GX 460 Premium c60862
138 2007 Ford F-450SD XL DRW c42901
URL Images
0 www.site.com/2007-ford-f450-super-duty-crew-ca... 0
1 www.site.com/2020-ford-f150-supercrew-cab-lari... 0
2 www.site.com/2012-mercedes-benz-e-class-e-350-... 0
3 www.site.com/2014-hyundai-santa-fe-sport-sport... 0
4 www.site.com/2013-nissan-rogue-sv-sport-utilit... 0
.. ... ...
134 www.site.com/2015-nissan-rogue-select-s-sport-... 0
135 www.site.com/2016-chevrolet-ss-sedan-4d-206164/ 0
136 www.site.com/2018-volkswagen-atlas-se-sport-ut... 41
137 www.site.com/2014-lexus-rx-rx-350-sport-utilit... 0
138 www.site.com/2017-ford-f150-supercrew-cab-xlt-... 0
[139 rows x 4 columns]

Sorting grouped DataFrame column without changing index sorting

I have a df as below:
I want only the top 5 countries from each year but keeping the year ascending.
First I grouped the df by year and country name and then ran the following code:
df.sort_values(['year','hydro_total'], ascending=False).groupby(['year']).head(5)
The result didn't keep the index ascending, instead, it sorted the year index too. How do I get the top 5 countries and keep the year's group ascending?
The CSV file is uploaded HERE .
You already sort by year and hydro_total, both decreasingly. You need to sort the year as increasing:
(df.sort_values(['year','hydro_total'],
ascending=[True,False])
.groupby('year').head(5)
)
Output:
country year hydro_total hydro_per_person
440 Japan 1971 7240000.0 0.06890
160 China 1971 2580000.0 0.00308
240 India 1971 2410000.0 0.00425
760 North Korea 1971 788000.0 0.05380
800 Pakistan 1971 316000.0 0.00518
... ... ... ... ...
199 China 2010 62100000.0 0.04630
279 India 2010 9840000.0 0.00803
479 Japan 2010 7070000.0 0.05590
1119 Turkey 2010 4450000.0 0.06120
839 Pakistan 2010 2740000.0 0.01580

How to add conditional row to pandas dataframe

I tried looking for a succinct answer and nothing helped. I am trying to add a row to a dataframe that takes a string for the first column and then for each column grabbing the sum. I ran into a scalar issue, so I tried to make the desired row into a series then convert to a dataframe, but apparently I was adding four rows with one column value instead of one row with the four column values.
My code:
def country_csv():
# loop through absolute paths of each file in source
for filename in os.listdir(source):
filepath = os.path.join(source, filename)
if not os.path.isfile(filepath):
continue
df = pd.read_csv(filepath)
df = df.groupby(['Country']).sum()
df.reset_index()
print(df)
# df.to_csv(os.path.join(path1, filename))
Sample dataframe:
Confirmed Deaths Recovered
Country
Afghanistan 299 7 10
Albania 333 20 99
Would like to see this as the first row
World 632 27 109
import pandas as pd
import datetime as dt
df
Confirmed Deaths Recovered
Country
Afghanistan 299 7 10
Albania 333 20 99
df.loc['World'] = [df['Confirmed'].sum(),df['Deaths'].sum(),df['Recovered'].sum()]
df.sort_values(by=['Confirmed'], ascending=False)
Confirmed Deaths Recovered
Country
World 632 27 109
Albania 333 20 99
Afghanistan 299 7 10
IIUC, you can create a dict then repass it into a dataframe to concat.
data = df.sum(axis=0).to_dict()
data.update({'Country' : 'World'})
df2 = pd.concat([pd.DataFrame(data,index=[0]).set_index('Country'),df],axis=0)
print(df2)
Confirmed Deaths Recovered
Country
World 632 27 109
Afghanistan 299 7 10
Albania 333 20 99
or a oner liner using assign and Transpose
df2 = pd.concat(
[df.sum(axis=0).to_frame().T.assign(Country="World").set_index("Country"), df],
axis=0,
)
print(df2)
Confirmed Deaths Recovered
Country
World 632 27 109
Afghanistan 299 7 10
Albania 333 20 99

In a pandas dataframe, count the number of times a condition occurs in one column?

Background
I have five years of NO2 measurement data, in csv files-one file for every location and year. I have loaded all the files into pandas dataframes in the same format:
Date Hour Location NO2_Level
0 01/01/2016 00 Street 18
1 01/01/2016 01 Street 39
2 01/01/2016 02 Street 129
3 01/01/2016 03 Street 76
4 01/01/2016 04 Street 40
Goal
For each dataframe count the number of times NO2_Level is greater than 150 and output this.
So I wrote a loop that's creates all the dataframes from the right directories and cleans them appropriately .
Problem
Whatever I've tried produces results I know on inspection are incorrect, e.g :
-the count value for every location on a given year is the same (possible but unlikely)
-for a year when I know there should be any positive number for the count, every location returns 0
What I've tried
I have tried a lot of approaches to getting this value for each dataframe, such as making the column a series:
NO2_Level = pd.Series(df['NO2_Level'])
count = (NO2_Level > 150).sum()'''
Using pd.count():
count = df[df['NO2_Level'] >= 150].count()
These two approaches have gotten closest to what I want to output
Example to test on
data = {'Date': ['01/01/2016','01/02/2016',' 01/03/2016', '01/04/2016', '01/05/2016'], 'Hour': ['00', '01', '02', '03', '04'], 'Location': ['Street','Street','Street','Street','Street',], 'NO2_Level': [18, 39, 129, 76, 40]}
df = pd.DataFrame(data=d)
NO2_Level = pd.Series(df['NO2_Level'])
count = (NO2_Level > 150).sum()
count
Expected Outputs
So from this I'm trying to get it to output a single line for each dataframe that was made in the format Location, year, count (of condition):
Kirkstall Road,2013,47
Haslewood Close,2013,97
...
Jack Lane Hunslet,2015,158
So the above example would produce
Street, 2016, 1
Actual
Every year produces the same result for each location, for some years (2014) the count doesn't seem to work at all when on inspection there should be:
Kirkstall Road,2013,47
Haslewood Close,2013,47
Tilbury Terrace,2013,47
Corn Exchange,2013,47
Temple Newsam,2014,0
Queen Street Morley,2014,0
Corn Exchange,2014,0
Tilbury Terrace,2014,0
Haslewood Close,2015,43
Tilbury Terrace,2015,43
Corn Exchange,2015,43
Jack Lane Hunslet,2015,43
Norman Rows,2015,43
Hopefully this helps.
import pandas as pd
ddict = {
'Date':['2016-01-01','2016-01-01','2016-01-01','2016-01-01','2016-01-01','2016-01-02',],
'Hour':['00','01','02','03','04','02'],
'Location':['Street','Street','Street','Street','Street','Street',],
'N02_Level':[19,39,129,76,40, 151],
}
df = pd.DataFrame(ddict)
# Convert dates to datetime
df['Date'] = pd.to_datetime(df['Date'])
# Make a Year column
df['Year'] = df['Date'].apply(lambda x: x.strftime('%Y'))
# Group by lcoation and year, count by M02_Level > 150
df1 = df[df['N02_Level'] > 150].groupby(['Location','Year']).size().reset_index(name='Count')
# Interate the results
for i in range(len(df1)):
loc = df1['Location'][i]
yr = df1['Year'][i]
cnt = df1['Count'][i]
print(f'{loc},{yr},{cnt}')
### To not use f-strings
for i in range(len(df1)):
print('{loc},{yr},{cnt}'.format(loc=df1['Location'][i], yr=df1['Year'][i], cnt=df1['Count'][i]))
Sample data:
Date Hour Location N02_Level
0 2016-01-01 00 Street 19
1 2016-01-01 01 Street 39
2 2016-01-01 02 Street 129
3 2016-01-01 03 Street 76
4 2016-01-01 04 Street 40
5 2016-01-02 02 Street 151
Output:
Street,2016,1
here is a solution with a sample generated (randomly):
def random_dates(start, end, n):
start_u = start.value // 10 ** 9
end_u = end.value // 10 ** 9
return pd.to_datetime(np.random.randint(start_u, end_u, n), unit='s')
location = ['street', 'avenue', 'road', 'town', 'campaign']
df = pd.DataFrame({'Date' : random_dates(pd.to_datetime('2015-01-01'), pd.to_datetime('2018-12-31'), 20),
'Location' : np.random.choice(location, 20),
'NOE_level' : np.random.randint(low=130, high= 200, size=20)})
#Keep only year for Date
df['Date'] = df['Date'].dt.strftime("%Y")
print(df)
df = df.groupby(['Location', 'Date'])['NOE_level'].apply(lambda x: (x>150).sum()).reset_index(name='count')
print(df)
Example df generated:
Date Location NOE_level
0 2018 town 191
1 2017 campaign 187
2 2017 town 137
3 2016 avenue 148
4 2017 campaign 195
5 2018 town 181
6 2018 road 187
7 2018 town 184
8 2016 town 155
9 2016 street 183
10 2018 road 136
11 2017 road 171
12 2018 street 165
13 2015 avenue 193
14 2016 campaign 170
15 2016 street 132
16 2016 campaign 165
17 2015 road 161
18 2018 road 161
19 2015 road 140
output:
Location Date count
0 avenue 2015 1
1 avenue 2016 0
2 campaign 2016 2
3 campaign 2017 2
4 road 2015 1
5 road 2017 1
6 road 2018 2
7 street 2016 1
8 street 2018 1
9 town 2016 1
10 town 2017 0
11 town 2018 3

Join dataframes based on partial string-match between columns

I have a dataframe which I want to compare if they are present in another df.
after_h.sample(10, random_state=1)
movie year ratings
108 Mechanic: Resurrection 2016 4.0
206 Warcraft 2016 4.0
106 Max Steel 2016 3.5
107 Me Before You 2016 4.5
I want to compare if the above movies are present in another df.
FILM Votes
0 Avengers: Age of Ultron (2015) 4170
1 Cinderella (2015) 950
2 Ant-Man (2015) 3000
3 Do You Believe? (2015) 350
4 Max Steel (2016) 560
I want something like this as my final output:
FILM votes
0 Max Steel 560
There are two ways:
get the row-indices for partial-matches: FILM.startswith(title) or FILM.contains(title). Either of:
df1[ df1.movie.apply( lambda title: df2.FILM.str.startswith(title) ).any(1) ]
df1[ df1['movie'].apply(lambda title: df2['FILM'].str.contains(title)).any(1) ]
movie year ratings
106 Max Steel 2016 3.5
Alternatively, you can use merge() if you convert the compound string column df2['FILM'] into its two component columns movie_title (year).
.
# see code at bottom to recreate your dataframes
df2[['movie','year']] = df2.FILM.str.extract('([^\(]*) \(([0-9]*)\)')
# reorder columns and drop 'FILM' now we have its subfields 'movie','year'
df2 = df2[['movie','year','Votes']]
df2['year'] = df2['year'].astype(int)
df2.merge(df1)
movie year Votes ratings
0 Max Steel 2016 560 3.5
(Acknowledging much help from #user3483203 here and in Python chat room)
Code to recreate dataframes:
import pandas as pd
from pandas.compat import StringIO
dat1 = """movie year ratings
108 Mechanic: Resurrection 2016 4.0
206 Warcraft 2016 4.0
106 Max Steel 2016 3.5
107 Me Before You 2016 4.5"""
dat2 = """FILM Votes
0 Avengers: Age of Ultron (2015) 4170
1 Cinderella (2015) 950
2 Ant-Man (2015) 3000
3 Do You Believe? (2015) 350
4 Max Steel (2016) 560"""
df1 = pd.read_csv(StringIO(dat1), sep='\s{2,}', engine='python', index_col=0)
df2 = pd.read_csv(StringIO(dat2), sep='\s{2,}', engine='python')
Given input dataframes df1 and df2, you can use Boolean indexing via pd.Series.isin. To align the format of the movie strings you need to first concatenate movie and year from df1:
s = df1['movie'] + ' (' + df1['year'].astype(str) + ')'
res = df2[df2['FILM'].isin(s)]
print(res)
FILM VOTES
4 Max Steel (2016) 560
smci's option 1 is nearly there, the following worked for me:
df1['Votes'] = ''
df1['Votes']=df1['movie'].apply(lambda title: df2[df2['FILM'].str.startswith(title)]['Votes'].any(0))
Explanation:
Create a Votes column in df1
Apply a lambda to every movie string in df1
The lambda looks up df2, selecting all rows in df2 where Film starts with the movie title
Select the Votes column of the resulting subset of df2
Take the first value in this column with any(0)

Categories

Resources