Webscraping 1000's of links using Python concurrent.futures - python

I am trying to scrape data from about 1000's of links which have the same content and the same procedure to extract data. To speed up the process I am using the python's concurrent.futures, which I think is the best in terms of speed. When I scrape data from about 30 - 40 links as a trial, it works; but as the number increases it does not. Here is my code:
import re
import json
import requests
import concurrent.futures
import time
links_json = ['https://webgate.ec.europa.eu/rasff-window/backend/public/notification/view/id/485387/',
'https://webgate.ec.europa.eu/rasff-window/backend/public/notification/view/id/485256/',
'https://webgate.ec.europa.eu/rasff-window/backend/public/notification/view/id/487113/',
'https://webgate.ec.europa.eu/rasff-window/backend/public/notification/view/id/486733/',
'https://webgate.ec.europa.eu/rasff-window/backend/public/notification/view/id/486937/',
'https://webgate.ec.europa.eu/rasff-window/backend/public/notification/view/id/486946/',
'https://webgate.ec.europa.eu/rasff-window/backend/public/notification/view/id/485444/',
'https://webgate.ec.europa.eu/rasff-window/backend/public/notification/view/id/487258/',
'https://webgate.ec.europa.eu/rasff-window/backend/public/notification/view/id/487011/',
'https://webgate.ec.europa.eu/rasff-window/backend/public/notification/view/id/487254/']
MAX_THREADS = 30
Data_Source = "RASFF"
Product_Category = []
Date = []
Product_name = []
Reference = []
def scrape(links):
data = requests.get(links).json()
Product_Category.append(data["product"]["productCategory"]["description"])
Date.append(data["ecValidationDate"])
Product_name.append(data["product"]["description"])
Reference.append(data["reference"])
def download_data(links_json):
threads = min(MAX_THREADS, len(links_json))
with concurrent.futures.ThreadPoolExecutor(max_workers=threads) as executor:
executor.map(scrape, links_json)
def main(new_links):
t0 = time.time()
download_data(new_links)
t1 = time.time()
print(f"{t1-t0} seconds to crawl {len(new_links)} in total.")
main(links_json)
When I try to run the main function, it is very inconsistent. Also right now there are only 12 links to scrape but as the links increase the data that should be extracted in the list also decreases. For instance: if there are about 200 links, there should be 200 values in the Product_category list but there are sometimes 100, 67 etc., meaning it is very inconsistent. I am not sure if I am missing something. I have even tried adding the time.sleep(0.25) in the scrape function but it does not work. I don't know how I can provide a list of 500 - 1000 links here.

Here's an example of how one could do this using the threading module:-
import requests
import threading
Product_Category = []
Date = []
Product_name = []
Reference = []
AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 11_5_1) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.1.2 Safari/605.1.15'
BASEURL = 'https://webgate.ec.europa.eu/rasff-window/backend/public/notification/view/id/'
LOCK = threading.Lock()
headers = {'User-Agent': AGENT}
links = ['485387',
'485256',
'487113',
'486733',
'486937',
'486946',
'485444',
'487258',
'487011',
'487254']
def scrape(session, link):
response = session.get(f'{BASEURL}{link}/', headers=headers)
response.raise_for_status()
json = response.json()
try:
LOCK.acquire()
Product_Category.append(
json["product"]["productCategory"]["description"])
Date.append(json["ecValidationDate"])
Product_name.append(json["product"]["description"])
Reference.append(json["reference"])
finally:
LOCK.release()
def main():
with requests.Session() as session:
ta = []
for link in links:
t = threading.Thread(target=scrape, args=(session, link))
ta.append(t)
t.start()
for t in ta:
t.join()
print(Product_Category)
print(Date)
print(Product_name)
print(Reference)
if __name__ == '__main__':
main()

Related

Create dataframe by scraping from different two or more <div>s etc

I'm stuck on a little problem and hope you can help.
I want to create a df by scraping from two parts of a web page. I seem to be stuck on the second part.
My requirement is to get a df with each Horse name and the associated odds.
eg.
Horse Odds
name1 odd1
name2 odd2
I've used a sample page in the script but it will be the same for any
: base url https://www.racingtv.com/racecards/tomorrow
: then select any time to get another page with the horse name and odds details etc.
import requests
import pandas as pd
from bs4 import BeautifulSoup
def main():
# base url is https://www.racingtv.com/racecards/tomorrow
# select any time to get the horse name and odds details etc.
url = 'https://www.racingtv.com/racecards/catterick-bridge/372180-watch-racing-tv-now-novices-hurdle-gbb-race?'
res = requests.get(url)
soup = BeautifulSoup(res.content, "html.parser")
strike = soup.select('div', class_='data-strike-out-group')
# this bit seems to be working
for data in soup.find_all('div',
class_='racecard__runner__column racecard__runner__name'):
for a in data.find_all('a'):
print(a.text)
# this bit sort of works but it seems to repeat the first three items of data
for odds in soup.find_all('div',
class_='racecard__runner__column racecard__runner__column--price'):
for odd1 in odds.find_all('ruk-odd'):
print(odd1.text)
# I tried this to work out how to stop getting the three duplicates but it does not work
for odds in strike.select('div',
class_='racecard__runner__column racecard__runner__column--price'):
for odd1 in odds.find_all('ruk-odd'):
print(odd1.text)
return
if __name__ == '__main__':
main()
class_='data-strike-out-group'
this isn't a class, check the raw html. It's an attribute of the div... weird
Glad you posted this, might end up using this site for a personal project. Figured you'd be interested in this code:
import requests
from bs4 import BeautifulSoup
import pandas as pd
from datetime import datetime
headers = {
'accept':'*/*',
'user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36',
}
url = 'https://www.racingtv.com/racecards/catterick-bridge/372180-watch-racing-tv-now-novices-hurdle-gbb-race?'
resp = requests.get(url,headers=headers)
print(resp)
soup = BeautifulSoup(resp.text,'html.parser')
table = soup.find('div',{'class':'page__content__section racecard'})
race_id = url.split('/')[-1].split('-')[0]
race_name = soup.find('div',class_='race__name').text.strip()
race_date = soup.find('div',class_='race__date').text.strip()
clean_date = datetime.strptime(race_date,'%d %b %Y').strftime('%Y%m%d')
race_info1 = soup.find_all('div',class_='race__subtitle')[0].text.strip()
race_info2 = soup.find_all('div',class_='race__subtitle')[1].text.strip()
final = []
for row in table.find_all('div',class_='racecard__runner--content'):
try:
num = row.find('div',class_='racecard__runner__cloth-number').text.strip()
last_days_ugly = row.find('div',class_='racecard__runner__name').find('a').find('sup').text
horse_name = row.find('div',class_='racecard__runner__name').find('a').text.strip().replace(last_days_ugly,'')
horse_link = 'http://www.racingtv.com'+row.find('div',class_='racecard__runner__name').find('a')['href']
last_race_days = last_days_ugly.strip().replace('(','').replace(')','')
for people in row.find_all('div',class_='racecard__runner__person'):
if 'J:' in people.getText():
jockey = people.find('a').text.strip()
jockey_link = 'http://www.racingtv.com'+people.find('a')['href']
if 'T:' in people.getText():
trainer = people.find('a').text.strip()
trainer_link = 'http://www.racingtv.com'+people.find('a')['href']
form = row.find('div',class_='racecard__runner__column--form_lr').find_all('div')[0].text.strip()
equip = row.find('div',class_='racecard__runner__column--form_lr').find_all('div')[1].text.strip()
weight = row.find('div',class_='racecard__runner__column--weight_age').find_all('div')[0].text.strip()
age = row.find('div',class_='racecard__runner__column--weight_age').find_all('div')[1].text.strip()
o_r = row.find('div',class_='racecard__runner__column--or').text.strip()
odds = row.find('div',class_='racecard__runner__column--price').getText()
odds_dec = row.find('div',class_='racecard__runner__column--price').find('ruk-odd')['data-js-odds-decimal']
odds_data = row.find('div',class_='racecard__runner__column--price').find('ruk-odd')['data-js-odd-alternatives']
except AttributeError: #skip blank starting gates
continue
item = {
'race_url' : url,
'race_id': race_id,
'race_name':race_name,
'race_date':clean_date,
'race_info1':race_info1,
'race_info2':race_info2,
'num': num,
'horse_name':horse_name,
'horse_link':horse_link,
'last_race_days':last_race_days,
'jockey':jockey,
'jockey_link':jockey_link,
'trainer':trainer,
'trainer_link':trainer_link,
'form':form,
'equip':equip,
'weight':weight,
'age':age,
'o_r':o_r,
'odds':odds,
'odds_dec':odds_dec,
'odds_data':odds_data
}
final.append(item)
df = pd.DataFrame(final)
df.to_csv('racingtv.csv',index=False)
print('Saved to racingtv.csv')
Following on from the script supplied kindly by bushcat69 and my subsequent question "how to get the race time into the df" I have cobbled together some code (cut and paste from other sites). I thought you may be interested. It may not be elegant but it seems to work. The section:
race_data.extend(get_racecards_data(url_race, date, racetime
is used to pass the url etc to the bushcat69 script.
Thanks again.
def get_meetings():
global date
global date_ext
odds_date = date_ext
url = f'https://www.racingtv.com/racecards/{date_ext}'
try:
res = requests.get(url, headers = headers)
except:
print('Date or Connection error occured! \nTry again!!')
return
soup = BeautifulSoup(res.text, 'html.parser')
meetings = soup.select('.race-selector__times__race')
course_num = len(meetings)
meetings1 = [a['href'] for a in soup.select('.race-selector__times__race')]
course_num = len(meetings1)
cnt01 = 0
if course_num == 0:
print('Provide a upcoming valid date')
return
for track in meetings1[:course_num]:
cnt01 = cnt01 + 1
trackref = track.split("/")[2]
print(cnt01, ": ", trackref)
need = input(f'{course_num} courses found \nHow many courses to scrape? Press \'a\' for all :\n')
if need == 'a':
n = course_num
else:
try:
n = int(need)
except:
print('Invalid input !')
return
cnt01 = 0
race_data = []
for mtm in meetings[:course_num]:
cnt01 = cnt01 + 1
racetime = mtm.text
href = mtm.attrs
htxt = Text(href)
url_race = htxt.partition("/")[2]
url_race = "/" + url_race.rpartition("'")[0]
print(cnt01, racetime, url_race)
time.sleep(1)
race_data.extend(get_racecards_data(url_race, date, racetime))
print(f"Meeting {url_race.split('/')[2]} scraping completed")
if cnt01 == n:
break
df_race = pd.DataFrame(race_data)
df = df_race

Python Web Scraping - Is it not possible to scrape this site?

I want to scrape the following website: https://www.globenewswire.com/NewsRoom
My goal is to store the press releases and articles in a database that I utilize later on. I've done this with other news sites too and deleted the code on here for easier readability (100% no influence on the code given to you). My problem is that I can't figure out how to exactly scrape headlines, links and other data since the html-code is structured with unusual attributes.
The following code is how I approached it. Maybe someone has an idea on what mistakes I did in scraping. Gladly appreciate any help.
import requests
import sqlite3
import Keywords
from bs4 import BeautifulSoup
from time import sleep
from random import randint
from datetime import datetime
from datetime import timedelta
# ----- Initializing Database & Notification Service -----
connect = sqlite3.connect('StoredArticles.db')
cursor = connect.cursor()
print("Connection created.")
try:
cursor.execute('''CREATE TABLE articlestable (article_time TEXT, article_title TEXT, article_keyword TEXT,
article_link TEXT, article_description TEXT, article_entry_time DATETIME)''')
cursor.execute('''CREATE UNIQUE INDEX index_article_link ON articlestable(article_link)''')
except:
pass
print("Table ready.")
while True:
class Scrapers:
# ----- Initialize Keywords -----
def __init__(self):
self.article_keyword = None
self.article_title = None
self.article_link = None
self.article_time = None
self.article_time_drop = None
self.article_description = None
self.article_entry_time = None
self.headers = {
'User-Agent':
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_6) AppleWebKit/605.1.15 (KHTML, like Gecko)' +
'Version/14.0.1 Safari/605.1.15'
}
def scraping_globenewswire(self, page):
url = 'https://www.globenewswire.com/NewsRoom?page=' + str(page)
r = requests.get(url, headers=self.headers)
soup = BeautifulSoup(r.text, 'html.parser')
articles = soup.select('.main-container > .row')
print("GlobeNewswire - Scraping page " + str(page) + "...")
sleep(randint(0, 1))
for item in articles:
self.article_title = item.select_one('a[data-autid="article-url"]').text.strip()
self.article_time = item.select_one('span[data-autid="article-published-date"]').text.strip()
self.article_link = 'https://www.globenewswire.com' + \
item.select_one('a[data-autid="article-url"]')['href']
self.article_description = item.select_one('span', _class='pagging-list-item-text-body').text.strip()
self.article_entry_time = datetime.now()
cursor.execute('''INSERT OR IGNORE INTO articlestable VALUES(?,?,?,?,?,?)''',
(self.article_time, self.article_title, self.article_keyword, self.article_link,
self.article_description, self.article_entry_time))
print(self.article_title)
return
# ----- End of Loops -----
scraper = Scrapers()
# ----- Range of Pages to scrape through -----
for x in range(1, 3):
scraper.scraping_globenewswire(x)
# ----- Add to Database -----
connect.commit()
print("Process done. Starting to sleep again. Time: " + str(datetime.now()))
sleep(randint(5, 12))
I extracted all the headlines of page=1 from the given URL.
The headlines are present inside an <a> with the attribue data-autid equals to article-url
Select all the <a> with the above attributes using findAll().
Iterate over all the selected <a> above and extract the headlines i.e, text
You can extend this and extract whatever data you need with this approach.
This code will print all the headlines of page=1 from the given URL.
import requests
import bs4 as bs
url = 'https://www.globenewswire.com/NewsRoom'
resp = requests.get(url)
soup = bs.BeautifulSoup(resp.text, 'lxml')
headlines = soup.findAll('a', attrs={'data-autid': 'article-url'})
for i in headlines:
print(i.text, end="\n")

How to web scrape a list of URLs of a website with multiprocessing when I login using Python

First of all I am a beginner with Python. Now I am trying to create a script that does the following
login to a website using Selenium
load a list of the website's URLs from a CSV file
web scrape data using multiprocessing method
I am using the following script
#Load URLS from CSV
def mycontents():
contents = []
with open('global_csv.csv', 'r') as csvf:
reader = csv.reader(csvf, delimiter=";")
for row in reader:
contents.append(row[1]) # Add each url to list contents
return contents
# parse a single item to get information
def parse(url):
headers = {'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.143 Safari/537.36'}
r = requests.get(url, headers, timeout=10)
sleep(3)
info = []
availability_text = '-'
price_text = '-'
if r.status_code == 200:
print('Processing..'+ url)
html = r.text
soup = BeautifulSoup(html, 'html.parser')
time.sleep(4)
price = soup.select(".price")
if price is not None:
price_text = price.text.strip()
print(price_text)
else:
price_text = "0,00"
print(price_text)
availability = soup.find('span', attrs={'class':'wholesale-availability'})
if availability is not None:
availability_text = availability.text.strip()
print(availability_text)
else:
availability_text = "Not Available"
print(availability_text)
info.append(price_text)
info.append(availability_text)
return ';'.join(info)
web_links = None
web_links = mycontents()
#Insert First Row
fields=['SKU','price','availability']
with open('output_global.csv', 'w') as f:
writer = csv.writer(f)
writer.writerow(fields)
if __name__ == "__main__":
#Load Webdriver
browser = webdriver.Chrome('C:\\chromedriver.exe')
browser.get('TheLoginPage')
#Find Username Field
username = browser.find_element_by_id('email')
username.send_keys('myusername')
#Find Password Field
password = browser.find_element_by_id('pass')
time.sleep(2)
password.send_keys('mypassword')
#Find Connect Button
sing_in = browser.find_element_by_xpath('//*[#id="send2"]')
sing_in.click()
#Start MultiProcess
with Pool(4) as p:
records = p.map(parse, web_links)
if len(records) > 0:
with open('output_global.csv', 'a') as f:
f.write('\n'.join(records))
When I run the script is not getting anything and in Command Window it is just shows the URLs, which makes me think that even if I connect successfully the sessions are different?!
I tried to save the session by putting it inside parse method or
if __name__ == "__main__":
I tried to connect to the browser the same session but I get errors like
You have not defined a session
TypeError: get() takes 2 positional arguments but 3 were given
local variable 'session' referenced before assignment
How can I practically login to the website and use multiprocessing to web scrape the URLs I need?

How to properly store BeautifulSoup objects for later use [duplicate]

I have some code that is quite long, so it takes a long time to run. I want to simply save either the requests object (in this case "name") or the BeautifulSoup object (in this case "soup") locally so that next time I can save time. Here is the code:
from bs4 import BeautifulSoup
import requests
url = 'SOMEURL'
name = requests.get(url)
soup = BeautifulSoup(name.content)
Since name.content is just HTML, you can just dump this to a file and read it back later.
Usually the bottleneck is not the parsing, but instead the network latency of making requests.
from bs4 import BeautifulSoup
import requests
url = 'https://google.com'
name = requests.get(url)
with open("/tmp/A.html", "w") as f:
f.write(name.content)
# read it back in
with open("/tmp/A.html") as f:
soup = BeautifulSoup(f)
# do something with soup
Here is some anecdotal evidence for the fact that bottleneck is in the network.
from bs4 import BeautifulSoup
import requests
import time
url = 'https://google.com'
t1 = time.clock();
name = requests.get(url)
t2 = time.clock();
soup = BeautifulSoup(name.content)
t3 = time.clock();
print t2 - t1, t3 - t2
Output, from running on Thinkpad X1 Carbon, with a fast campus network.
0.11 0.02
Storing requests locally and restoring them as Beautifoul Soup object latter on
If you are iterating through pages of web site you can store each page with request explained here.
Create folder soupCategory in same folder where your script is.
Use any latest user agent for headers
headers = {'user-agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.0 Safari/605.1.15'}
def getCategorySoup():
session = requests.Session()
retry = Retry(connect=7, backoff_factor=0.5)
adapter = HTTPAdapter(max_retries=retry)
session.mount('http://', adapter)
session.mount('https://', adapter)
basic_url = "https://www.somescrappingdomain.com/apartments?adsWithImages=1&page="
t0 = time.time()
j=0
totalPages = 1525 # put your number of pages here
for i in range(1,totalPages):
url = basic_url+str(i)
r = requests.get(url, headers=headers)
pageName = "./soupCategory/"+str(i)+".html"
with open(pageName, mode='w', encoding='UTF-8', errors='strict', buffering=1) as f:
f.write(r.text)
print (pageName, end=" ")
t1 = time.time()
total = t1-t0
print ("Total time for getting ",totalPages," category pages is ", round(total), " seconds")
return
Latter on you can create Beautifoul Soup object as #merlin2011 mentioned with:
with open("/soupCategory/1.html") as f:
soup = BeautifulSoup(f)

Simple web crawler very slow

I have built a very simple web crawler to crawl ~100 small json files in the URL below. The issue is that the crawler takes more than an hour to complete. I find that hard to understand given how small the json files are. Am I doing something fundamentally wrong here?
def get_senate_vote(vote):
URL = 'https://www.govtrack.us/data/congress/113/votes/2013/s%d/data.json' % vote
response = requests.get(URL)
json_data = json.loads(response.text)
return json_data
def get_all_votes():
all_senate_votes = []
URL = "http://www.govtrack.us/data/congress/113/votes/2013"
response = requests.get(URL)
root = html.fromstring(response.content)
for a in root.xpath('/html/body/pre/a'):
link = a.xpath('text()')[0].strip()
if link[0] == 's':
vote = int(link[1:-1])
try:
vote_json = get_senate_vote(vote)
except:
return all_senate_votes
all_senate_votes.append(vote_json)
return all_senate_votes
vote_data = get_all_votes()
Here is a rather simple code sample, I've calculated the time taken for each call. On my system its taking on an average 2 secs per request, and there are 582 pages to visit, so around 19 mins without printing the JSON to the console. In your case network time plus print time may increase it.
#!/usr/bin/python
import requests
import re
import time
def find_votes():
r=requests.get("https://www.govtrack.us/data/congress/113/votes/2013/")
data = r.text
votes = re.findall('s\d+',data)
return votes
def crawl_data(votes):
print("Total pages: "+str(len(votes)))
for x in votes:
url ='https://www.govtrack.us/data/congress/113/votes/2013/'+x+'/data.json'
t1=time.time()
r=requests.get(url)
json = r.json()
print(time.time()-t1)
crawl_data(find_votes())
If you are using python 3.x and you are crawling multiple sites, for even better performances I offer warmly to you to use the aiohttp module, which implements the asynchronous principles.
For example:
import aiohttp
import asyncio
sites = ['url_1', 'url_2']
results = []
def save_reponse(result):
site_content = result.result()
results.append(site_content)
async def crawl_site(site):
async with aiohttp.ClientSession() as session:
async with session.get(site) as resp:
resp = await resp.text()
return resp
tasks = []
for site in sites:
task = asyncio.ensure_future(crawl_site(site))
task.add_done_callback(save_reponse)
tasks.append(task)
all_tasks = asyncio.gather(*tasks)
loop = asyncio.get_event_loop()
loop.run_until_complete(all_tasks)
loop.close()
print(results)
For more reading about aiohttp.

Categories

Resources