Related
I have a df that has country-year data from 2000-2020 with various columns containing the sum total of given events in each country-year unit. In some countries, the event only happened in some of the years, so there are no rows for the remaining years which I would like to have a "0" in all columns in that row.
country
iyear
nwound
Med
claimed
Nigeria
2000
2
5
7
Nigeria
2001
3
15
9
Nigeria
2005
4
6
14
Nigeria
2017
9
41
20
Benin
2004
2
5
7
Benin
2008
3
15
9
Benin
20010
4
6
14
Benin
2019
9
41
20
In short, I'm looking for a way to add rows for all the years 2000-2020 for Nigeria and Benin (and all the other countries not listed) that are missing with each value in the row (for nwound, med and claimed) being 0. Keep in mind, this data set have 18 countries in it so I would want the code to be reproducible.
Use the reindex method from pandas:
import pandas as pd
df = pd.DataFrame({'country': ['Nigeria', 'Nigeria', 'Nigeria', 'Nigeria', 'Benin', 'Benin', 'Benin', 'Benin'],
'iyear': [2000, 2001, 2005, 2017, 2004, 2008, 2010, 2019],
'nwound': [2, 3, 4, 9, 2, 3, 4, 9],
'Med': [5, 15, 6, 41, 5, 15, 6, 41],
'claimed': [7, 9, 14, 20, 7, 9, 14, 20]})
df = df.set_index(['country', 'iyear'])
countries = df.index.levels[0].tolist()
index = pd.MultiIndex.from_product([countries, range(2000, 2021)], names=['country', 'iyear'])
df = df.reindex(index, fill_value=0)
df = df.reset_index()
print(df)
I'm working on an assignment for a course and I'm running into an issue with my dataframe. I made the changes as they requested, but when I go to display my new dataframe, it just shows the headers.
These are the requirements of the assignment:
Load the data file using pandas
Check for null values in the data.
Drop records with nulls in any of the columns
Size column has sizes in kb as well as mb. To analyze you'll need to convert them to numeric
the data set has M and K and "Varies with device" showing up in these columns so I removed them
Price field is a string and has $ symbol. Remove the $ symbol and convert to numeric.
Average rating should be between 1 nd 5 as only these values are allowed. Drop the rows that have a value outside this range.
For Free apps in the Type column, drop these rows.
#Here is my code:
import pandas as pd
import numpy as np
ds = pd.read_csv('googleplaystore.csv')
headers = pd.DataFrame(['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver'])
ds['Size'] = ds['Size'].replace("Varies with Device", np.nan, inplace = True)
ds =ds.dropna()
ds['Size'] = ds['Size'].str.replace("M", "", regex = True)
ds['Size'] = ds['Size'].str.replace("k", "", regex = True)
ds['Size'] = ds['Size'].astype(float)
ds['Installs'] = ds['Installs'].str.replace("+", '', regex = True)
ds['Installs'] = ds['Installs'].astype(int)
ds['Reviews'] = ds['Reviews'].astype(float)
ds['Price'] = ds['Price'].str.replace("$", "", regex = True)
ds['Price'] = ds['Price'].astype(float)
indexrating = ds[(ds['Rating'] >= 1) & (ds['Rating'] <= 5)].index
ds.drop(indexrating, inplace = True)
ds['Type']= ds['Type'].replace("Free", np.nan, inplace = True)
ds =ds.dropna()
display(ds)
#I was expecting for a new dataframe to display with the dropped rows
Remove everything that ends by 'M' or 'k' or contains "Varies with device", remove all rows.
>>> df['Size'].str[-1].value_counts()
M 7466 # ends with 'M'
e 1637 # ends with 'k'
k 257 # for "Varies with device"
Name: Size, dtype: int64
Try with this version:
df = pd.read_csv(googleplaystore.csv) # 1
df = df.dropna() # 3
df['Size'] = df['Size'].str.extract(r'(\d+\.?\d)', expand=False).astype(float) * df['Size'].str[-1].replace({'M': 1024, 'k': 1}) # 4
df = df.dropna() # remove nan from "Varies with device"
df['Price'] = df['Price'].str.strip('$').astype(float) # 5
df = df.loc[df['Rating'].between(1, 5)] # 6
df = df.loc[df['Type'] != 'Free'] # 7
Output:
>>> df
App Category Rating Reviews Size Installs Type Price Content Rating Genres Last Updated Current Ver Android Ver
234 TurboScan: scan documents and receipts in PDF BUSINESS 4.7 11442 6963.2 100,000+ Paid 4.99 Everyone Business March 25, 2018 1.5.2 4.0 and up
235 Tiny Scanner Pro: PDF Doc Scan BUSINESS 4.8 10295 39936.0 100,000+ Paid 4.99 Everyone Business April 11, 2017 3.4.6 3.0 and up
290 TurboScan: scan documents and receipts in PDF BUSINESS 4.7 11442 6963.2 100,000+ Paid 4.99 Everyone Business March 25, 2018 1.5.2 4.0 and up
291 Tiny Scanner Pro: PDF Doc Scan BUSINESS 4.8 10295 39936.0 100,000+ Paid 4.99 Everyone Business April 11, 2017 3.4.6 3.0 and up
477 Calculator DATING 2.6 57 6348.8 1,000+ Paid 6.99 Everyone Dating October 25, 2017 1.1.6 4.0 and up
... ... ... ... ... ... ... ... ... ... ... ... ... ...
10690 FO Bixby PERSONALIZATION 5.0 5 861.0 100+ Paid 0.99 Everyone Personalization April 25, 2018 0.2 7.0 and up
10697 Mu.F.O. GAME 5.0 2 16384.0 1+ Paid 0.99 Everyone Arcade March 3, 2017 1.0 2.3 and up
10760 Fast Tract Diet HEALTH_AND_FITNESS 4.4 35 2457.6 1,000+ Paid 7.99 Everyone Health & Fitness August 8, 2018 1.9.3 4.2 and up
10782 Trine 2: Complete Story GAME 3.8 252 11264.0 10,000+ Paid 16.99 Teen Action February 27, 2015 2.22 5.0 and up
10785 sugar, sugar FAMILY 4.2 1405 9728.0 10,000+ Paid 1.20 Everyone Puzzle June 5, 2018 2.7 2.3 and up
[577 rows x 13 columns]
I am scraping lists of US presidents using beautiful soup and requests. I want to scrape both the date for example start of the presidency and end of the presidency date and for some reason it's showing list index out of range error . I'll Provide you the link so you can understand better .
website Link : https://en.wikipedia.org/wiki/List_of_presidents_of_the_United_States
from bs4 import BeautifulSoup
from urllib.request import urlopen as uReq
my_url = 'https://en.wikipedia.org/wiki/List_of_presidents_of_the_United_States'
uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()
page_soup = BeautifulSoup(page_html , 'html.parser' )
containers = page_soup.find_all('table' , class_ = 'wikitable')
#print(containers[0])
#print(len(containers))
#print(soup.prettify(containers[0]))
container = containers[0]
date =container.find_all('span' , attrs = {'class': 'date'})
#print(len(date))
#print(date[0].text)
for container in containers:
date_container = container.find_all('span', attrs={'class': 'date'})
print(date_container[0].text)
The find_all function can return an empty list, which can lead you to getting an error.
You can simple check this:
all_dates = []
for container in containers:
date_container = container.find_all('span', attrs={'class': 'date'})
all_dates.extend([date.text for date in date_container])
As you have last lines of code, that store all spans of dates on first table "wikitable", you can make list comprehension:
date = [x.text for x in container.find_all('span' , attrs = {'class': 'date'})]
print(date)
Which will print:
['April 30, 1789', 'March 4, 1797', 'March 4, 1797', 'March 4, 1801', 'March 4, 1801'...
Since it has <table> tags, have you considered using pandas' .read_html()? It uses BeautifulSoup under the hood. Takes alot of the work out and puts it straight into a dataframe for you. The only work then needed is any manipulation or cleanup/filtering:
import pandas as pd
import re
my_url = 'https://en.wikipedia.org/wiki/List_of_presidents_of_the_United_States'
# Returns a list of dataframes
dfs = pd.read_html(my_url)
# Get the specific dataframe with the desired columns
df = dfs[1].iloc[:,[1,3]]
# Rename the columns
df.columns = ['Date','Name']
# Split the date column into start and end dates and drop the date column
df[['Start','End']] = df.Date.str.split('–', expand=True)
df = df.drop('Date',axis=1)
# Clean up the name column using regex to pull out the name
df['Name'] = [re.match(r'.+?(?=\d)', x)[0].strip().split('Born')[0] for x in df['Name']]
# Drop duplicate rows
df.drop_duplicates(inplace = True)
print (df)
Output:
print (df.to_string())
Name Start End
0 George Washington April 30, 1789[d] March 4, 1797
1 John Adams March 4, 1797 March 4, 1801
2 Thomas Jefferson March 4, 1801 March 4, 1809
3 James Madison March 4, 1809 March 4, 1817
4 James Monroe March 4, 1817 March 4, 1825
5 John Quincy Adams March 4, 1825 March 4, 1829
6 Andrew Jackson March 4, 1829 March 4, 1837
7 Martin Van Buren March 4, 1837 March 4, 1841
8 William Henry Harrison March 4, 1841 April 4, 1841(Died in office)
9 John Tyler April 4, 1841[i] March 4, 1845
10 James K. Polk March 4, 1845 March 4, 1849
11 Zachary Taylor March 4, 1849 July 9, 1850(Died in office)
12 Millard Fillmore July 9, 1850[k] March 4, 1853
13 Franklin Pierce March 4, 1853 March 4, 1857
14 James Buchanan March 4, 1857 March 4, 1861
15 Abraham Lincoln March 4, 1861 April 15, 1865(Assassinated)
16 Andrew Johnson April 15, 1865 March 4, 1869
17 Ulysses S. Grant March 4, 1869 March 4, 1877
18 Rutherford B. Hayes March 4, 1877 March 4, 1881
19 James A. Garfield March 4, 1881 September 19, 1881(Assassinated)
20 Chester A. Arthur September 19, 1881[n] March 4, 1885
21 Grover Cleveland March 4, 1885 March 4, 1889
22 Benjamin Harrison March 4, 1889 March 4, 1893
23 Grover Cleveland March 4, 1893 March 4, 1897
24 William McKinley March 4, 1897 September 14, 1901(Assassinated)
25 Theodore Roosevelt September 14, 1901 March 4, 1909
26 William Howard Taft March 4, 1909 March 4, 1913
27 Woodrow Wilson March 4, 1913 March 4, 1921
28 Warren G. Harding March 4, 1921 August 2, 1923(Died in office)
29 Calvin Coolidge August 2, 1923[o] March 4, 1929
30 Herbert Hoover March 4, 1929 March 4, 1933
31 Franklin D. Roosevelt March 4, 1933 April 12, 1945(Died in office)
32 Harry S. Truman April 12, 1945 January 20, 1953
33 Dwight D. Eisenhower January 20, 1953 January 20, 1961
34 John F. Kennedy January 20, 1961 November 22, 1963(Assassinated)
35 Lyndon B. Johnson November 22, 1963 January 20, 1969
36 Richard Nixon January 20, 1969 August 9, 1974(Resigned)
37 Gerald Ford August 9, 1974 January 20, 1977
38 Jimmy Carter January 20, 1977 January 20, 1981
39 Ronald Reagan January 20, 1981 January 20, 1989
40 George H. W. Bush January 20, 1989 January 20, 1993
41 Bill Clinton January 20, 1993 January 20, 2001
42 George W. Bush January 20, 2001 January 20, 2009
43 Barack Obama January 20, 2009 January 20, 2017
44 Donald Trump January 20, 2017 Incumbent
Background
I have five years of NO2 measurement data, in csv files-one file for every location and year. I have loaded all the files into pandas dataframes in the same format:
Date Hour Location NO2_Level
0 01/01/2016 00 Street 18
1 01/01/2016 01 Street 39
2 01/01/2016 02 Street 129
3 01/01/2016 03 Street 76
4 01/01/2016 04 Street 40
Goal
For each dataframe count the number of times NO2_Level is greater than 150 and output this.
So I wrote a loop that's creates all the dataframes from the right directories and cleans them appropriately .
Problem
Whatever I've tried produces results I know on inspection are incorrect, e.g :
-the count value for every location on a given year is the same (possible but unlikely)
-for a year when I know there should be any positive number for the count, every location returns 0
What I've tried
I have tried a lot of approaches to getting this value for each dataframe, such as making the column a series:
NO2_Level = pd.Series(df['NO2_Level'])
count = (NO2_Level > 150).sum()'''
Using pd.count():
count = df[df['NO2_Level'] >= 150].count()
These two approaches have gotten closest to what I want to output
Example to test on
data = {'Date': ['01/01/2016','01/02/2016',' 01/03/2016', '01/04/2016', '01/05/2016'], 'Hour': ['00', '01', '02', '03', '04'], 'Location': ['Street','Street','Street','Street','Street',], 'NO2_Level': [18, 39, 129, 76, 40]}
df = pd.DataFrame(data=d)
NO2_Level = pd.Series(df['NO2_Level'])
count = (NO2_Level > 150).sum()
count
Expected Outputs
So from this I'm trying to get it to output a single line for each dataframe that was made in the format Location, year, count (of condition):
Kirkstall Road,2013,47
Haslewood Close,2013,97
...
Jack Lane Hunslet,2015,158
So the above example would produce
Street, 2016, 1
Actual
Every year produces the same result for each location, for some years (2014) the count doesn't seem to work at all when on inspection there should be:
Kirkstall Road,2013,47
Haslewood Close,2013,47
Tilbury Terrace,2013,47
Corn Exchange,2013,47
Temple Newsam,2014,0
Queen Street Morley,2014,0
Corn Exchange,2014,0
Tilbury Terrace,2014,0
Haslewood Close,2015,43
Tilbury Terrace,2015,43
Corn Exchange,2015,43
Jack Lane Hunslet,2015,43
Norman Rows,2015,43
Hopefully this helps.
import pandas as pd
ddict = {
'Date':['2016-01-01','2016-01-01','2016-01-01','2016-01-01','2016-01-01','2016-01-02',],
'Hour':['00','01','02','03','04','02'],
'Location':['Street','Street','Street','Street','Street','Street',],
'N02_Level':[19,39,129,76,40, 151],
}
df = pd.DataFrame(ddict)
# Convert dates to datetime
df['Date'] = pd.to_datetime(df['Date'])
# Make a Year column
df['Year'] = df['Date'].apply(lambda x: x.strftime('%Y'))
# Group by lcoation and year, count by M02_Level > 150
df1 = df[df['N02_Level'] > 150].groupby(['Location','Year']).size().reset_index(name='Count')
# Interate the results
for i in range(len(df1)):
loc = df1['Location'][i]
yr = df1['Year'][i]
cnt = df1['Count'][i]
print(f'{loc},{yr},{cnt}')
### To not use f-strings
for i in range(len(df1)):
print('{loc},{yr},{cnt}'.format(loc=df1['Location'][i], yr=df1['Year'][i], cnt=df1['Count'][i]))
Sample data:
Date Hour Location N02_Level
0 2016-01-01 00 Street 19
1 2016-01-01 01 Street 39
2 2016-01-01 02 Street 129
3 2016-01-01 03 Street 76
4 2016-01-01 04 Street 40
5 2016-01-02 02 Street 151
Output:
Street,2016,1
here is a solution with a sample generated (randomly):
def random_dates(start, end, n):
start_u = start.value // 10 ** 9
end_u = end.value // 10 ** 9
return pd.to_datetime(np.random.randint(start_u, end_u, n), unit='s')
location = ['street', 'avenue', 'road', 'town', 'campaign']
df = pd.DataFrame({'Date' : random_dates(pd.to_datetime('2015-01-01'), pd.to_datetime('2018-12-31'), 20),
'Location' : np.random.choice(location, 20),
'NOE_level' : np.random.randint(low=130, high= 200, size=20)})
#Keep only year for Date
df['Date'] = df['Date'].dt.strftime("%Y")
print(df)
df = df.groupby(['Location', 'Date'])['NOE_level'].apply(lambda x: (x>150).sum()).reset_index(name='count')
print(df)
Example df generated:
Date Location NOE_level
0 2018 town 191
1 2017 campaign 187
2 2017 town 137
3 2016 avenue 148
4 2017 campaign 195
5 2018 town 181
6 2018 road 187
7 2018 town 184
8 2016 town 155
9 2016 street 183
10 2018 road 136
11 2017 road 171
12 2018 street 165
13 2015 avenue 193
14 2016 campaign 170
15 2016 street 132
16 2016 campaign 165
17 2015 road 161
18 2018 road 161
19 2015 road 140
output:
Location Date count
0 avenue 2015 1
1 avenue 2016 0
2 campaign 2016 2
3 campaign 2017 2
4 road 2015 1
5 road 2017 1
6 road 2018 2
7 street 2016 1
8 street 2018 1
9 town 2016 1
10 town 2017 0
11 town 2018 3
You can use this to create the dataframe:
xyz = pd.DataFrame({'release' : ['7 June 2013', '2012', '31 January 2013',
'February 2008', '17 June 2014', '2013']})
I am trying to split the data and save, them into 3 columns named "day, month and year", using this command:
dataframe[['day','month','year']] = dataframe['release'].str.rsplit(expand=True)
The resulting dataframe is :
dataframe
As you can see, that it works perfectly when it gets 3 strings, but whenever it is getting less then 3 strings, it saves the data at the wrong place.
I have tried split and rsplit, both are giving the same result.
Any solution to get the data at the right place?
The last one is year and it is present in every condition , it should be the first one to be saved and then month if it is present otherwise nothing and same way the day should be stored.
You could
In [17]: dataframe[['year', 'month', 'day']] = dataframe['release'].apply(
lambda x: pd.Series(x.split()[::-1]))
In [18]: dataframe
Out[18]:
release year month day
0 7 June 2013 2013 June 7
1 2012 2012 NaN NaN
2 31 January 2013 2013 January 31
3 February 2008 2008 February NaN
4 17 June 2014 2014 June 17
5 2013 2013 NaN NaN
Try reversing the result.
dataframe[['year','month','day']] = dataframe['release'].str.rsplit(expand=True).reverse()