Python Web Scraping and Pandas Dataframe - python

I am relatively new to Python and completely new to webscraping, but I am trying to gather data from this website:
https://www.usclimatedata.com/climate/cumming/georgia/united-states/usga1415
I wanna grab the info from the tables from Jan-Dec and put it into a Pandas data frame and print it back to the user. I plan on doing some more stuff with the data like computing my own averages and mean/medians etc., but I am struggling with getting the data initially. Any help would be appreciated!!

if you getting data from files ,you can use (x=pd.read_csv or put the file extension that u use instead of csv )and print(x)

First check terms of website services in robots.txt to check whether it is legal to scrape the web page.
If it is, then you can use bs4's BeautifulSoup package to scrape the web page.
def get_state_holiday_data(self, year: int, state_name: str) -> pd.DataFrame:
try:
pagecontent = self.get_page_content(year, state_name)
holiday_table_list = []
for table in pagecontent.findAll("table"):
for tbody in table.findAll("tbody"):
for row in tbody.findAll("tr"):
holiday_row_list = []
if len(row.findAll("td")) == 3:
for cell_data in row.findAll("td"):
holiday_row_list.append(cell_data.find(text=True).replace('\n', '').strip(' '))
holiday_table_list.append(holiday_row_list)
break
state_holiday_df = pd.DataFrame.from_records(holiday_table_list, columns=['Date', 'Day', 'Holiday'])
state_holiday_df['Date'] = state_holiday_df['Date'].apply(
lambda date: str(year) + '-' + datetime.strptime(date, '%d %b').strftime('%m-%d'))
del state_holiday_df['Day']
return state_holiday_df
except Exception as e:
raise e
Above is the sample code to scrape a table and convert it to dataframe, where table and tbody are the html table element id/name.

Related

How to scrape website with mouseover events and no class on most elements

I am trying to scrape a table with data off a website with mouse-over color changing events and on each row only the first two columns have a class:
Additionally, when I do try to scrape those first two rows I get the output as such:
This is the code I am running.
lists = soup.find_all('table', class_= "report")
for list in lists:
date = list.find_all('td', class_= "time")
flag = list.find_all('td', class_= "flag")
info = [date, flag]
print(info)
I was expecting to receive only the numerical values so that I can export them and work with them.
I tried to use the .replace() function but it didn't remove anything.
I was unable to use .text even after converting date and flag to strings.
Notes
It might be better to not use list as a variable name since it already means something in python....
Also, I can't test any of the suggestions below without having access to your HTML. It would be helpful if you include how you fetched the HTML to parse with BeautifulSoup. (Did you use some form of requests? Or something like selenium? Or do you just have the HTML file?) With requests or even selenium, sometimes the HTML fetched is not what you expected, so that's another issue...
The Reason behind the Issue
You can't apply .text/.get_text to the ResultSets (lists), which are what .find_all and .select return. You can apply .text/.get_text to Tags like the ones returned by .find or .select_one (but you should check first None was not returned - which happens when nothing is found).
So date[0].get_text() might have returned something, but since you probably want all the dates and flags, that's not the solution.
Solution Part 1 Option A: Getting the Rows with .find... chain
Instead of iterating through tables directly, you need to get the rows first (tr tags) before trying to get at the cells (td tags); if you have only one table with class= "report", you could just do something like:
rows = soup.find('table', class_= "report").find_all('tr', {'onmouseover': True})
But it's risky to chain multiple .find...s like that, because an error will be raised if any of them return None before reaching the last one.
Solution Part 1 Option B: Getting the Rows More Safely with .find...
It would be safer to do something like
table = soup.find('table', class_= "report")
rows = table.find_all('tr', {'onmouseover': True}) if table else []
or, if there might be more than one table with class= "report",
rows = []
for table in soup.find_all('table', class_= "report"):
rows += table.find_all('tr', {'onmouseover': True})
Solution Part 1 Option C: Getting the Rows with .select
However, I think the most convenient way is to use .select with CSS selectors
# if there might be other tables with report AND other classes that you don't want:
# rows = soup.select('table[class="report"] tr[onmouseover]')
rows = soup.select('table.report tr[onmouseover]')
this method is only unsuitable if there might be more than one table with class= "report", but you only want rows from the first one; in that case, you might prefer the table.find_....if table else [] approach.
Solution Part 2 Option A: Iterating Over rows to Print Cell Contents
Once you have rows, you can iterate over them to print the date and flag cell contents:
for r in rows:
date, flag = r.find('td', class_= "time"), r.find('td', class_= "flag")
info = [i.get_text() if i else i for i in [date, flag]]
if any([i is not None for i in info]): print(info)
# [ only prints if there are any non-null values ]
Solution Part 2 Option B: Iterating Over rows with a Fucntion
Btw, if you're going to be extracting multiple tag attributes/texts repeatedly, you might find my selectForList function useful - it could have been used like
for r in rows:
info = selectForList(r, ['td.time', 'td.flag'], printList=True)
if any([i is not None for i in info]): print(info)
or, to get a list of dictionaries like [{'time': time_1, 'flag': flag_1}, {'time': time_2, 'flag': flag_2}, ...],
infList = [selectForList(r, {
'time': 'td.time', 'flag': 'td.flag', ## add selectors for any info you want
}) for r in soup.select('table.report tr[onmouseover]')]
infList = [d for d in infoList if [v for v in d.values() if v is not None]]
Added EDIT:
To get all displayed cell contents:
for r in rows:
info = [(td.get_text(),''.join(td.get('style','').split())) for td in r.select('td')]
info = [txt.strip() for txt, styl in r.select('td') if 'display:none' not in styl]
# info = [i if i else 0 for i in info] # fill blanks with zeroes intead of ''
if any(info): print(info) ## or write to CSV

how to loop through a column of IDs to construct urls and put those urls in a list?

i am trying to learn Python for data analysis/data science. I'm working on a project where I would be webscraping key movie information (director, original language, budget, revenue, etc.) off of TMDb and IMDb using bs4. I would like to do this for a list of various movies that I have rated and downloaded into a csv file. The csv file contains columns like "Type" and "TMDb ID" that would be needed to construct the URLs that I want to scrape.
like so:
TMDb ID
IMDb ID
Type
Name
11282
tt0366551
movie
Harold & Kumar Go To White Castle
the URL would be
url = "https://api.themoviedb.org/3/"+ type + "/" + id + "?api_key=" + API_KEY + "&language=en-US/"
So I'm attempting to do this by iterating through the respective columns and constructing a URL from that, and using that list of URLs to webscrape. I got stuck on printing all the URLs correctly. Depending on if I put the print statement inside the for loop or outside of it, I either get:
the last URL in the csv file printed over and over again (109 of the same last URL) OR
the correct URLs except they each get printed the same amount of times as the length of the csv file (109 rows x 109 urls)
This is what I have so far:
import requests
from bs4 import BeautifulSoup
import pandas as pd
from urllib.request import Request, urlopen
API_KEY = 'xxx'
tmdb_export = pd.read_csv('/Users/xxx/Downloads/xxx.csv')
tmdb_export.drop(['Season Number','Episode Number'], axis=1, inplace=True)
tmdb = tmdb_export['TMDb ID']
type = tmdb_export['Type']
urls = []
# pulls TMDb IDs from df column
for i, tmdbID in tmdb.iteritems():
id = str(tmdbID)
url = "https://api.themoviedb.org/3/"+ type + "/" + id + "?api_key=" + API_KEY + "&language=en-US/"
urls.append(url)
print(urls)
Do I have to include a nested for loop in the urls.append(url) ?? What am I missing? I feel like this is a silly mistake I'm making because I have a hard time with for loops and understanding how they work. so I've decided to stop lurking on here and ask y'all for help! I'm open to any suggestions, guidance, explanations and advice that I can get. Thank you in advance!!
I would recommend that you convert the dataframe value into a list
example:
id = list(df['id'])
t = list(df['t'])
Then, use zip the two lists and iterate over
for a, b in zip(id, t):
# todo here, you can assign a as an id value and b as a type value
Try taking the print(URLs) out of the loop:
for i, tmdbID in tmdb.iteritems():
id = str(tmdbID)
url = "https://api.themoviedb.org/3/"+ type + "/" + id + "?api_key=" + API_KEY + "&language=en-US/"
urls.append(url)
print(urls)
You should not use type as the variable name as its a predefined keyword.
And one crisp way of doing this is using apply function that you can leverage from pandas and create a column of URL in the dataframe and extract it and cast it into list.
def createUrl(tmdbID,Type):
Tid = str(tmdbID)
url = "https://api.themoviedb.org/3/"+ Type + "/" + Tid + "?api_key=" + API_KEY + "&language=en-US/"
return url
tmdb_export['URL'] = tmdb_export.apply(lambda x: f(x['TMDb ID'], x['Type']), axis=1)
urls=list(tmdb_export['URL'])

Optimizing multicriteria filtering for data with Pandas

I'm trying to filter data with Pandas using a list of values which are a couple of str book_tittle and int book_price :
import pandas as pd
import requests
from bs4 import BeautifulSoup
# settings_#############################################################################
isbn = {'9782756002484', '9782756025117', '9782756072449'}
url = 'https://www.abebooks.fr/servlet/SearchResults?sts=t&cm_sp=SearchF-_-NullResults-_-Results&isbn={}'
book_title = ["Mondes", "X-Wing"]
book_price = [100, 10]
#######################################################################################
### creation de lien à partir des codes ISBN#
def url_isbn(isbn):
merged = []
for link in isbn:
link_isbn = url.format(link)
merged.append(link_isbn)
return merged
### scraping each url from url_isbn
def get_data():
data = []
for i in url_isbn(isbn):
r = requests.get(i)
soup = BeautifulSoup(r.text, 'html.parser')
item = soup.find_all('div', {'class': 'result-data col-xs-9 cf'})
for x in item:
title = x.find('h2', {'class': 'title'}).text.replace('\n', '')
price = x.find('p', {'class': 'item-price'}).text.replace('EUR ', '').replace(',', '.')
url = 'https://www.abebooks.fr'+x.find('a', {'itemprop': 'url'})['href']
products = title, int(float(price)), url
data.append(products)
return data
###creating the dataframe
df = pd.DataFrame(get_data(), columns=["Titre", "Prix", "URL"])
###Filter data into the dataframe
for filtered in df:
df_final_to_email = filtered[(df['Titre'].str.contains(book_title) & (df.Prix < book_price))]
print(df_final_to_email)
I'm getting an error : TypeError : unhashable type : 'list'
I assume I cannot use a list for filtering in cause of the mix of data type, I tested with Tuple and dict, I get the same kind of error
I also try with df.query but it gives empty data frame
The filter will allow me to filter all the books which has "Mondes" in the title for a price < 100 but also all the books which contains "X-Wing" below a price of < 10, I'll also add more item to find with a price related.
Titre
Prix
Mondes Infernaux
95,10
Star Wars, Mondes Infernaux
75,50
X-Wing Rogue
9,50
X-Wing Rogue Squadron
7,50
Nothing about the filtering, but do you know how I could figure the following :products = title, int(float(price)), url ? I had to use float as I'm not able to convert like int(price) as int, I'm a bit annoyed of having rounded down numbers in the dataframe. (if any moderator can tell me if I have to do another post for this specific need ? thank you)
Thank you for your kind help
The error resides in your filtering code:
df_final_to_email = filtered[(df['Titre'].str.contains(book_title) & (df.Prix < book_price))]
book_title is a list. .str.contains does not work with list. It works with a single string or a regex pattern.
If your intention is to find books with "Mondes" in the title and price under 100 or "X-Wing" in the title and price under 10, you can use the following filtering code:
###Filter data into the dataframe
cond = pd.Series([False] * len(df), index=df.index)
for title, price in zip(book_title, book_price):
cond |= df["Titre"].str.contains(title) & df["Prix"].lt(price)
print(df[cond])
How it works:
We start by selecting no rows cond = <all False>
For each title and price, evaluate each row to see if they meet the criteria. A row needs to only match one (title, price) condition from the list so we use the "in-place or" operator (|=) to update our selection list.
The |= operator is equivalent to:
cond = cond | (df["Titre"].str.contains(title) & df["Prix"].lt(price))
If you need all matching rows in the dataframe there's no need to use a for loop.
Maybe try something like this:
def find_book(str, price):
return df[ (df['Titre'].str.contains(str)) & (df['Prix']<price) ]
# find all books containing the substring 'Wing' in the title with price <7
find_book('Wing', 7)
I believe this piece of code filters dataframe as you want (not tested):
df_final_to_email = pd.concat([df.loc[df["Titre"].str.contains(t) & df["Prix"].lt(p)]
for t,p in zip(book_title, book_price)])

Scraper To Copy Articles In Bulk

I'm working on an AI project, and one of the steps is to get ~5,000 articles from an online outlet.
I'm a beginner programmer, so please be kind. I've found a site that is very easy to scrape from, in terms of URL structure - I just need a scraper that can take an entire article from a site (we will be analyzing the articles in bulk, with AI).
The div containing the article text for each piece, is the same across the entire site - "col-md-12 description-content-wrap".
Does anyone know a simple Python script that would simply go thru a .CSV of URLs, pull the text from the above listed ^ div of each article, and output it as plain text? I've found a few solutions, but none are 100% what I need.
Ideally all of the 5,000 articles would be outputted in one file, but if they need to each be separate, that's fine too. Thanks in advance!
I did something a little bit similar to this about a week ago. Here is the code that I came up with.
from bs4 import BeautifulSoup
import urllib.request
from pandas import DataFrame
resp = urllib.request.urlopen("https://www.cnbc.com/finance/")
soup = BeautifulSoup(resp, from_encoding=resp.info().get_param('charset'))
substring = 'https://www.cnbc.com/'
df = ['review']
for link in soup.find_all('a', href=True):
#print(link['href'])
if (link['href'].find(substring) == 0):
# append
df.append(link['href'])
#print(link['href'])
#list(df)
# convert list to data frame
df = DataFrame(df)
#type(df)
#list(df)
# add column name
df.columns = ['review']
from nltk.sentiment.vader import SentimentIntensityAnalyzer
sid = SentimentIntensityAnalyzer()
df['sentiment'] = df['review'].apply(lambda x: sid.polarity_scores(x))
def convert(x):
if x < 0:
return "negative"
elif x > .2:
return "positive"
else:
return "neutral"
df['result'] = df['sentiment'].apply(lambda x:convert(x['compound']))
df['result']
df_final = pd.merge(df['review'], df['result'], left_index=True, right_index=True)
df_final
df_final.to_csv('C:\\Users\\ryans\\OneDrive\\Desktop\\out.csv')
Result:

How to parse hundreds of websites with same JSON data type using Python?

I am new to Python and currently working on a project that requires me to extract data from hundreds of websites that contains JSON data. I manage to scrape data from one website but have no idea how to scrape all websites simultaneously. Below is my code.
import openpyxl
import requests
import pandas as pd
import simplejson as json
url="https://ws-public.interpol.int/notices/v1/red?ageMin=45&ageMax=60&arrestWarrantCountryId=US&resultPerPage=20&page=1"
response=requests.get(url)
response.raise_for_status()
data=response.json()['_embedded']['notices']
list=[]
for item in data:
result={"forename":None,"date_of_birth":None,"nationalities":None,"name":None}
result["forename"] = item["forename"]
result["date_of_birth"]=item["date_of_birth"]
result["nationalities"] = item["nationalities"]
result["name"] = item["name"]
list.append(result)
#print(list)
df=pd.DataFrame(list)
df.to_excel("test.xlsx")
Example of other websites:
https://ws-public.interpol.int/notices/v1/red?arrestWarrantCountryId=BA&resultPerPage=20&page=5, https://ws-public.interpol.int/notices/v1/red?arrestWarrantCountryId=BA&resultPerPage=20&page=1,
I think this will work for you. You'll have to either add the urls manually or specify some logic to get them, I also noticed the json response has the url for the next page so you could have a list of all the first pages and use those to crawl through the pages, unless you can just get all the results in one json response. I also don't have excel installed so I used csv instead but it should be the same:
import requests
import pandas as pd
urls = [
'https://ws-public.interpol.int/notices/v1/red?ageMin=45&ageMax=60&arrestWarrantCountryId=US&resultPerPage=20&page=1',
'https://ws-public.interpol.int/notices/v1/red?arrestWarrantCountryId=BA&resultPerPage=20&page=5',
'https://ws-public.interpol.int/notices/v1/red?arrestWarrantCountryId=BA&resultPerPage=20&page=1',
# add more urls here, you could also use a file to store these
# you could also write some logic to get the urls but you'd need to specify that logic
]
def get_data(url):
data = requests.get(url).json()['_embedded']['notices']
# filter the returned fields
return [{k: v for k, v in row.items()
if k in ['forename', 'date_of_birth', 'nationalities', 'name']}
for row in data]
df = pd.DataFrame()
# the data from each url in a dataframe instead of in dictionary for speed
for url in urls:
print(f'Processing {url}')
df = df.append(get_data(url))
# output to csv or whatever (I don't have excel installed so I did csv)
df.to_csv('data.csv')
# df.to_excel('data.xlsx')
Output (data.csv):
,forename,date_of_birth,nationalities,name
0,CARLOS LEOPOLDO,1971/10/31,['US'],ALVAREZ
1,MOHAMED ABDIAZIZ,1974/01/01,"['SO', 'ET']",KEROW
2,SEUXIS PAUCIS,1966/07/30,['CO'],HERNANDEZ-SOLARTE
3,JOHN G.,1966/10/20,"['PH', 'US']",PANALIGAN
4,SOFYAN ISKANDAR,1968/04/04,['ID'],NUGROHO
5,SOLOMON ANTHONY,1965/02/05,['TZ'],BANDIHO
6,ROLAND,1969/07/21,"['US', 'DE']",AGUILAR
7,FERNANDO,1972/07/25,['MX'],RODRIGUEZ
8,RAUL,1966/12/08,['US'],ORTEGA
9,DANIEL,1962/08/30,['US'],LEIJA
10,FRANCISCO,1961/10/23,['EC'],MARTINEZ
11,HORACIO CARLOS,1963/09/10,"['US', 'MX']",TERAN
12,FREDIS RENTERIA,1965/07/07,['CO'],TRUJILLO
13,JUAN EXEQUIEL,1968/08/18,['AR'],HEINZ
14,JIMMY JULIUS,1971/05/03,"['IL', 'US']",KAROW
15,JOHN,1959/10/28,['LY'],LOWRY
16,FIDEL,1959/07/25,['CO'],CASTRO MURILLO
17,EUDES,1968/12/20,['CO'],OJEDA OVANDO
18,BEJARNI,1968/07/12,"['US', 'NI']",RIVAS
19,DAVID,1973/12/02,['GT'],ALDANA
20,SLOBODAN,1952/10/02,['BA'],RIS
21,ALEN,1978/05/27,['BA'],DEMIROVIC
22,DRAGAN,1987/02/09,['ME'],GAJIC
23,JOZO,1968/03/03,"['HR', 'BA']",BRICO
24,ZHIYIN,1962/07/01,['CN'],XU
25,NOVAK,1955/04/10,['BA'],DUKIC
26,NEBOJSA,1973/01/08,['BA'],MILANOVIC
27,MURADIF,1960/04/12,['BA'],HAMZABEGOVIC
28,BOSKO,1940/11/25,"['RS', 'BA']",LUKIC
29,RATKO,1967/05/16,['BA'],SAMAC
30,BOGDAN,1973/04/05,['BA'],BOZIC
31,ZELJKO,1965/10/21,"['BA', 'HR']",RODIN
32,SASA,1973/04/19,['RS'],DUNOVIC
33,OBRAD,1964/03/10,['BA'],OZEGOVIC
34,SENAD,1981/03/01,['BA'],KAJTEZOVIC
35,MLADEN,1973/04/29,"['HR', 'BA']",MARKOVIC
36,PERO,1972/01/29,"['BA', 'HR']",MAJIC
37,MARCO,1968/04/12,"['BA', 'HR']",VIDOVIC
38,MIRSAD,1964/07/27,['HR'],SMAJIC
39,NIJAZ,1961/11/20,,SMAJIC
40,GOJKO,1959/10/08,['BA'],BORJAN
41,DUSAN,1954/06/25,"['RS', 'BA']",SPASOJEVIC
42,MIRSAD,1991/04/20,['BA'],CERIMOVIC
43,GORAN,1962/01/24,['BA'],TESIC
44,IZET,1970/09/18,"['RS', 'BA']",REDZOVIC
45,DRAGAN,1973/09/30,['BA'],STOJIC
46,MILOJKO,1962/05/19,"['BA', 'RS']",KOVACEVIC
47,DRAGAN,1971/11/07,"['RS', 'BA']",MARJANOVIC
48,ALEKSANDAR,1979/09/22,"['AT', 'BA']",RUZIC
49,MIRKO,1992/04/29,['BA'],ATELJEVIC
50,SLAVOJKA,1967/01/13,['BA'],MARINKOVIC
51,SLADAN,1968/03/09,"['BA', 'RS']",TASIC
52,ESED,1963/01/12,['BA'],ABDAGIC
53,DRAGOMIR,1954/01/29,"['RS', 'BA']",KEZUNOVIC
54,NEDZAD,1961/01/01,['BA'],KAHRIMANOVIC
55,NEVEN,1980/10/08,"['BA', 'SI']",STANIC
56,VISNJA,1972/04/12,"['RS', 'BA']",ACIMOVIC
57,MLADEN,1974/08/05,"['HR', 'DE', 'BA']",DZIDIC
58,IVICA,1964/12/23,"['BA', 'HR']",KOLOBARA
59,ZORAN,1963/11/08,"['BA', 'RS']",ADAMOVIC

Categories

Resources