I have a simple Python script that loops through a dictionary, where the keys are url links. I have to extract some info from each link and store the info into another dictionary. The code below is the first part of the function and the code seems to work as expected.
But it opens only one link at time, while I think I might get some improvement in run time if I do this in parallel. Do you have any suggestion how I can achieve this in a simple way in Python?
def updater(local):
links = myItems['links']
for link in links.keys():
page = requests.get(link)
soup = BeautifulSoup(page.content, 'html.parser')
newsoup = soup.find("div", {"id": "overviewQuickstatsBenchmarkDiv"})
rows = newsoup.findAll('tr')[1]
counter = 0
date = ""
for td in rows.findAll('td'):
counter += 1
if td.contents[0] == 'Date':
date = td.text.replace("Date", "")
elif counter == 2:
pass
elif counter == 3:
price = re.findall("\d+\.\d+", td.string)[0]
Here's what I tried using multiprocessing (but I cannot get any result and the code seems to run doing nothing):
def read(url):
result = {'link': url, 'data': requests.get(url)}
print "Reading: " + url
return result
def updater(local):
links = myItems['links']
pool = Pool(processes=5)
results = pool.map(read, links.keys())
for link in links.keys():
# need to read the results and store data into a dictionary
Related
Context: I'm working on pagination of this website: https://skoodos.com/schools-in-uttarakhand. When I inspected, this website has no proper number of pages visible except the next button which is ?page=2 after the url. Also, searching for page-link gave me number 20 at the end. So I assumed that the total number of pages is 20, upon checking manually, I learnt that, there only exist 11 pages.
After many trials and errors, I finally decided to go with just the indexing from 0 until 12 (12 is excluded by python however).
What I want to know is that, how wold you go about figuring out the number of pages on a particular website that doesn't show the actual number of pages other than previous and next button and how can I optimize this in terms of the same?
Here's my solution to pagination. Any way to optimize this other than me manually finding the number of pages?
from myWork.commons import url_parser, write
def data_fetch(url):
school_info = []
for page_number in range(0, 4):
next_web_page = url + f'?page={page_number}'
soup = url_parser(next_web_page)
search_results = soup.find('section', {'id': 'search-results'}).find(class_='container').find(class_='row')
# rest of the code
for page_number in range(4, 12):
next_web_page = url + f'?page={page_number}'
soup = url_parser(next_web_page)
search_results = soup.find('section', {'id': 'search-results'}).find(class_='container').find(class_='row')
# rest of the code
def main():
url = "https://skoodos.com/schools-in-uttarakhand"
data_fetch(url)
if __name__ == "__main__":
main()
Each of your pages (except the last one) will have an element like this:
<a class="page-link"
href="https://skoodos.com/schools-in-uttarakhand?page=2"
rel="next">Next »</a>
E.g. you can extract the link as follows (here for the first page):
link = soup.find('a', class_='page-link', href=True, rel='next')
print(link['href'])
https://skoodos.com/schools-in-uttarakhand?page=2
So, you could make your function recursive. E.g. use something like this:
import requests
from bs4 import BeautifulSoup
def data_fetch(url, results = list()):
resp = requests.get(url)
soup = BeautifulSoup(resp.content, 'lxml')
search_results = soup.find('section', {'id': 'search-results'})\
.find(class_='container').find(class_='row')
results.append(search_results)
link = soup.find('a', class_='page-link', href=True, rel='next')
# link will be `None` for last page (i.e. `page=11`)
if link:
# just adding some prints to show progress of iteration
if not 'page' in url:
print('getting page: 1', end=', ')
url = link['href']
# subsequent page nums being retrieved
print(f'{url.rsplit("=", maxsplit=1)[1]}', end=', ')
# recursive call
return data_fetch(url, results)
else:
# `page=11` with no link, we're done
print('done')
return results
url = 'https://skoodos.com/schools-in-uttarakhand'
data = data_fetch(url)
So, a call to this function will print progress as:
getting page: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, done
And you'll end up with data with 11x bs4.element.Tag, one for each page.
print(len(data))
11
print(set([type(d) for d in data]))
{<class 'bs4.element.Tag'>}
Good luck with extracting the required info; the site is very slow, and the HTML is particularly sloppy and inconsistent. (e.g. you're right to note there is a page-link elem, which suggests there are 20 pages. But its visibility is set to hidden, so apparently this is just a piece of deprecated/unused code.)
There's a bit at the top that says "Showing the 217 results as per selected criteria". You can code to extract the number from that, then count the number number of results per page and divide by that to get the expected number of pages (don't forget to round up ).
If you want to double check, add more code to go to the calculated last page and
if there's no such page, keep decrementing the total and checking until you hit a page that exists
if there is such a page, but it has an active/enabled "Next" button, keep going to Next page until reaching the last (basically as you are now)
(Remember that the two listed above are contingencies and wouldn't be executed in an ideal scenario.)
So, just to find the number of pages, you could do something like:
import requests
from bs4 import BeautifulSoup
import math
def soupFromUrl(scrapeUrl):
req = requests.get(scrapeUrl)
if req.status_code == 200:
return BeautifulSoup(req.text, 'html.parser')
else:
raise Exception(f'{req.reason} - failed to scrape {scrapeUrl}')
def getPageTotal(url):
soup = soupFromUrl(url)
#totalResults = int(soup.find('label').get_text().split('(')[-1].split(')')[0])
totalResults = int(soup.p.strong.get_text()) # both searches should work
perPageResults = len(soup.select('.m-show')) #probably always 20
print(f'{perPageResults} of {totalResults} results per page')
if not (perPageResults > 0 and totalResults > 0):
return 0
lastPageNum = math.ceil(totalResults/perPageResults)
# Contingencies - will hopefully never be needed
lpSoup = soupFromUrl(f'{url}?page={lastPageNum}')
if lpSoup.select('.m-show'): #page exists
while lpSoup.select_one('a[rel="next"]'):
nextLink = lpSoup.select_one('a[rel="next"]')['href']
lastPageNum = int(nextLink.split('page=')[-1])
lpSoup = soupFromUrl(nextLink)
else: #page does not exist
while not (lpSoup.select('.m-show') or lastPageNum < 1):
lastPageNum = lastPageNum - 1
lpSoup = soupFromUrl(f'{url}?page={lastPageNum}')
# end Contingencies section
return lastPageNum
However, it looks like you only want the total pages in order to start the for-loop, but it's not even necessary to use a for-loop at all - a while-loop might be better:
def data_fetch(url):
school_info = []
nextUrl = url
while nextUrl:
soup = soupFromUrl(nextUrl)
#GET YOUR DATA FROM PAGE
nextHL = soup.select_one('a[rel="next"]')
nextUrl = nextHL.get('href') if nextHL else None
# code after fetching all pages' data
Although, you could still use for-loop if you had a maximum page number in mind:
def data_fetch(url, maxPages):
school_info = []
for p in range(1, maxPages+1):
soup = soupFromUrl(f'{url}?page={p}')
if not soup.select('.m-show'):
break
#GET YOUR DATA FROM PAGE
# code after fetching all pages' data [upto max]
I am a beginner in web scraping, and I need help with this problem.
The website, allrecipes.com, is a website where you can find recipes based on a search, which in this case is 'pie':
link to the html file:
'view-source:https://www.allrecipes.com/search/results/?wt=pie&sort=re'
(right click-> view page source)
I want to create a program that takes a input, searches it up on allrecipes, and returns a list with tuples of the first five recipes with data such as the time that takes to make, serving yield, ingrediants, and more.
This is my program so far:
import requests
from bs4 import BeautifulSoup
def searchdata():
inp=input('what recipe would you like to search')
url ='http://www.allrecipes.com/search/results/?wt='+str(inp)+'&sort=re'
r=requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
links=[]
#fill in code for finding top 3 or five links
for i in range(3)
a = requests.get(links[i])
soupa = BeautifulSoup(a.text, 'html.parser')
#fill in code to find name, ingrediants, time, and serving size with data from soupa
names=[]
time=[]
servings=[]
ratings=[]
ingrediants=[]
searchdata()
Yes, i know, my code is very messy but What should I fill in in the two code fill-in areas?
Thanks
After searching for the recipe you have to get the links of each recipe and then request again for each of those links, because the information you're looking for is not available on the search page. That would not look clean without OOP so here's the class I wrote that does what you want.
import requests
from time import sleep
from bs4 import BeautifulSoup
class Scraper:
links = []
names = []
def get_url(self, url):
url = requests.get(url)
self.soup = BeautifulSoup(url.content, 'html.parser')
def print_info(self, name):
self.get_url(f'https://www.allrecipes.com/search/results/?wt={name}&sort=re')
if self.soup.find('span', class_='subtext').text.strip()[0] == '0':
print(f'No recipes found for {name}')
return
results = self.soup.find('section', id='fixedGridSection')
articles = results.find_all('article')
texts = []
for article in articles:
txt = article.find('h3', class_='fixed-recipe-card__h3')
if txt:
if len(texts) < 5:
texts.append(txt)
else:
break
self.links = [txt.a['href'] for txt in texts]
self.names = [txt.a.span.text for txt in texts]
self.get_data()
def get_data(self):
for i, link in enumerate(self.links):
self.get_url(link)
print('-' * 4 + self.names[i] + '-' * 4)
info_names = [div.text.strip() for div in self.soup.find_all(
'div', class_='recipe-meta-item-header')]
ingredient_spans = self.soup.find_all('span', class_='ingredients-item-name')
ingredients = [span.text.strip() for span in ingredient_spans]
for i, div in enumerate(self.soup.find_all('div', class_='recipe-meta-item-body')):
print(info_names[i].capitalize(), div.text.strip())
print()
print('Ingredients'.center(len(ingredients[0]), ' '))
print('\n'.join(ingredients))
print()
print('*' * 50, end='\n\n')
chrome = Scraper()
chrome.print_info(input('What recipe would you like to search: '))
I am trying to crawl all news link that has a certain keyword that is looking for.
import urllib.request
import urllib.parse
from bs4 import BeautifulSoup
import re
key_word = urllib.parse.quote("금리")
url = "https://search.naver.com/search.naver?where=news&query=" + key_word +"%EA%B8%88%EB%A6%AC&sm=tab_opt&sort=0&photo=0&field=0&reporter_article=&pd=3&ds=2020.04.13&de=2020.04.14&docid=&nso=so%3Ar%2Cp%3Afrom20200413to20200414%2Ca%3Aall&mynews=0&refresh_start=0&related=0"
html = urllib.request.urlopen(url).read()
soup = BeautifulSoup(html, 'html.parser')
anchor_set = soup.findAll('a')
news_link = []
for a in anchor_set:
if str(a).find('https://news.naver.com/main/read.nhn?') != -1:
a = a.get('href')
news_link.append(a)'
Untill this section (code above), I parse into the url and retrieve all links that has a certian read.nhn(naver news platform) and append it to news_link.
This is working fine, but the proble is the url used above only shows 10 articles in the page.
count_tag = soup.find("div",{"class","title_desc all_my"})
count_text=count_tag.find("span").get_text().split()
total_num=count_text[-1][0:-1].replace(",","")
print(total_num)'
Using the code above I've found out there are a total of 1297 articles that I need to collect. but since the original link above only has 10 articles in the page.
for val in range(int(total_num)//10+1):
start_val=str(val*10+1)
I was told i needed to insert this into the url to retrieve ALL newslinks.
Thus, I've used the while method
while start_val <= total_num:
url = "https://search.naver.com/search.naver?where=news&query=" + key_word +"%EA%B8%88%EB%A6%AC&sm=tab_opt&sort=0&photo=0&field=0&reporter_article=&pd=3&ds=2020.04.13&de=2020.04.14&docid=&nso=so%3Ar%2Cp%3Afrom20200413to20200414%2Ca%3Aall&mynews=0&refresh_start=" + start_val + "&related=0"
html = urllib.request.urlopen(url).read()
soup = BeautifulSoup(html, 'html.parser')
news_link = []
anchor_set = soup.findAll('a')
for a in anchor_set:
if str(a).find('https://news.naver.com/main/read.nhn?') != -1:
a = a.get('href')
news_link.append(a)
However, when I run the program, it seems the loop does not stop. obviously there is no else or break.. How can i break this loop and successfully collect all the links?
Your current while loop doesn't stop because you haven't incremented the value of start_val. Also, later you have range(int(total_num)//10+1) so if you converted total_num to a string, then the string comparison in while start_val <= total_num is wrong - for strings "21" > "1297", because "2" > "1". Compare them as int's.
And since you're creating the sequence of vals to use, you don't need a separate upper bound check.
So far, this would give you the correct finite loop:
for val in range(int(total_num)//10+1): # no upper bound check needed
start_val=str(val*10+1)
url = "https://search.naver.com/search.naver?where=news&query=" ...
html = urllib.request.urlopen(url).read()
...
For the values needed for the pages/next starting item, instead of doing:
for val in range(int(total_num)//10+1):
start_val = str(val*10+1)
You can get the actual val's from range(). To starting at 1 and going in steps of 10 to get: 1, 11, 21, ... , upto and including the total:
for val in range(1, total_num + 1, 10):
start_val = str(val) # don't need this assignment actually
Next thing: the URL for page 2 onwards is wrong. Currently, your while loop will generate the following URL for page 2:
https://search.naver.com/search.naver?where=news&query=%EA%B8%88%EB%A6%AC%EA%B8%88%EB%A6%AC&sm=tab_opt&sort=0&photo=0&field=0&reporter_article=&pd=3&ds=2020.04.13&de=2020.04.14&docid=&nso=so%3Ar%2Cp%3Afrom20200413to20200414%2Ca%3Aall&mynews=0&refresh_start=11&related=0
But if you click on page "2" of the results, you get the URL:
https://search.naver.com/search.naver?&where=news&query=%EA%B8%88%EB%A6%AC%EA%B8%88%EB%A6%AC&sm=tab_pge&sort=0&photo=0&field=0&reporter_article=&pd=3&ds=2020.04.13&de=2020.04.14&docid=&nso=so:r,p:from20200413to20200414,a:all&mynews=0&cluster_rank=35&start=11&refresh_start=0
The main difference is at the end: &refresh_start=11 in yours vs &start=11&refresh_start=0 actual. Since that format also works for page 1 (just checked), use that instead.
You have some extra characters in the section after the keyword: ...&query=" + key_word +"%EA%B8%88%EB%A6%AC&sm=tab_opt. That %EA%B8%88%EB%A6%AC is from your previous search keyword.
You can also skip several unneeded URL parameters, by testing which are actually not needed.
Putting all that together:
for val in range(1, total_num + 1, 10):
start_val = str(val)
url = ("https://search.naver.com/search.naver?&where=news&query=" +
key_word +
"&sm=tab_pge&sort=0&photo=0&field=0&reporter_article=&pd=3&ds=2020.04.13&de=2020.04.14" +
"&docid=&nso=so:r,p:from20200413to20200414,a:all&mynews=0&cluster_rank=51" +
"&refresh_start=0&start=" +
start_val)
html = urllib.request.urlopen(url).read()
... # etc.
I'm having an issue. It loops through the list of URLS, but it's not adding the text content of each page scraped to the presults list.
I haven't gotten to the raw text processing yet. I'll probably make a question for that once I get there if I can't figure out.
What is wrong here? The length of presults remains at 1 even though it seems to be looping through the list of urls for the scrape...
Here's part of the code I'm having an issue with:
counter=0
for xa in range(0,len(qresults)):
pageURL=qresults[xa].format()
pageresp= requests.get(pageURL, headers=headers)
if pageresp.status_code==200:
print(pageURL)
psoup=BeautifulSoup(pageresp.content, 'html.parser')
presults=[]
para=psoup.text
presults.append(para)
print(len(presults))
else: print("Could not reach domain")
print(len(presults))
Your immediate problem is here:
presults=[]
para=psoup.text
presults.append(para)
On every for iteration, you replace your existing presults list with the empty list and add one item. On the next iteration, you again wipe out the previous result.
Your initialization must be done only once and that before the loop:
presults = []
for xa in range(0,len(qresults)):
Ok, I don't even see you looping through any URLs here, but below is a generic example of how this kind of request can be achieved.
import requests
from bs4 import BeautifulSoup
base_url = "http://www.privredni-imenik.com/pretraga?abcd=&keyword=&cities_id=0&category_id=0&sub_category_id=0&page=1"
current_page = 1
while current_page < 200:
print(current_page)
url = base_url + str(current_page)
#current_page += 1
r = requests.get(url)
zute_soup = BeautifulSoup(r.text, 'html.parser')
firme = zute_soup.findAll('div', {'class': 'jobs-item'})
for title in firme:
title1 = title.findAll('h6')[0].text
print(title1)
adresa = title.findAll('div', {'class': 'description'})[0].text
print(adresa)
kontakt = title.findAll('div', {'class': 'description'})[1].text
print(kontakt)
print('\n')
page_line = "{title1}\n{adresa}\n{kontakt}".format(
title1=title1,
adresa=adresa,
kontakt=kontakt
)
current_page += 1
I've created a script in python to parse the website address of different agencies from it's landing page and the location address from it's inner page. What I can't understand is how can i return a string and a list at the same time in order for them to be reused in another function. To be clearer: I wish to return the website address and list of links from collect_links() function and reuse them in get_info() function. My current approach throws an error - ValueError: not enough values to unpack (expected 2, got 1).
Tis is my attempt so far:
import re
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
def collect_links(link):
res = requests.get(link)
soup = BeautifulSoup(res.text, "lxml")
website = [soup.select_one("p.company-profile-website > a").get("href")]
items = [urljoin(url,item.get("href")) for item in soup.select("[id^='company-'] .search-companies-result-info h2 > a")]
return website,items
def get_info(website,link):
res = requests.get(link)
soup = BeautifulSoup(res.text, "lxml")
address = soup.select_one("p.footer-right").get_text(strip=True)
print(website,address)
if __name__ == '__main__':
url = "https://www.cv-library.co.uk/companies/agencies/A"
for item,link in collect_links(url):
get_info(item,link)
How can I return a string and a list from one function to another?
PS I would like to be stick to the design I've already tried.
Your websites is a list with a single element string, not a string as you've enclosed it in [] literal. You need to drop [] to make it a string as no point making that a list.
After doing that, you can get the return value, and iterate over the links like:
if __name__ == '__main__':
url = "https://www.cv-library.co.uk/companies/agencies/A"
website, links = collect_links(url)
for link in links:
get_info(website, link)
Main error in the code is in this link.
website = [soup.select_one("p.company-profile-website > a").get("href")]
This only returns one value:
http://www.autoskills-uk.com
Your function should be:
def collect_links(link):
res = requests.get(link)
soup = BeautifulSoup(res.text, "lxml")
websites = [x.get("href") for x in soup.select("p.company-profile-website > a")] #<============== Changed
items = [urljoin(url,item.get("href")) for item in soup.select("[id^='company-'] .search-companies-result-info h2 > a")]
return zip(websites, items)
Return as zip of websites and items.
Now you can list unpack item and link in the for loop:
if __name__ == '__main__':
url = "https://www.cv-library.co.uk/companies/agencies/A"
for item,link in collect_links(url):
get_info(item,link)
You are returning two lists, one with one element and another, with many elements as a tuple, and try to iterator over this tuple, unpacking each list into two elements item and link.
I don't see, what you really want to do, but you should separate for-loop and return values:
website, links = collect_links(url)
for link in links:
get_info(website[0], link)