Python web scraping(request.get) can't get target web page - python

I am using Python requests.get to scrape a website. It worked fine yesterday, but issues happened today.
I can't get requests.text as before. Now, requests returns This process is automatic. Your browser will redirect to your requested content shortly. Please allow up to 5 seconds.
To solve this problem, I tried to set time.sleep() function inside requests.get. But it didn't work and I still get the allow up to 5 seconds response as before.
Here's my code below:
def web_scraping_corporationwiki(df):
# Scrape website
df_scrape = df.reset_index()
for i in df_scrape.index:
name = df_scrape.loc[i, 'Name']
state = df_scrape.loc[i, 'State']
origin_row = df_scrape.loc[i]
url = 'https://www.corporationwiki.com/search/withfacets?term='+ name +'&stateFacet='+ state
for page_num in range(8):
req = requests.get(url, time.sleep(50), params= dict(query="wiki", page = page_num),timeout = 100A)
soup = BeautifulSoup(req.text, "html.parser")
page = soup.find_all('div', class_ = 'list-group-item')
Can someone help me with this problem?

Related

Web Scraper failing after 3rd page ('NoneType' object has no attribute 'find_all')

I've written a function to try and get the names of authors and their respective links from a sandbox website (https://quotes.toscrape.com/), which should move onto the next page when all have been covered.
It works for the first two pages but fails when moving onto the third with the error 'NoneType' object has no attribute 'find_all'.
Why would it break at the start of the new page when it has already successfully moved pages already?
Here's the function:
def AuthorLink(url):
a = 0
url = url
page = requests.get(url)
soup = BeautifulSoup(page.content, "html.parser")
divContainer = soup.find("div", class_="container")
divRow = divContainer.find_all("div", class_= "row")
for result in divRow:
divQuotes = result.find_all("div", class_="quote")
for quotes in divQuotes:
for el in quotes.find_all("small", class_="author"):
print(el.get_text())
for link in quotes.find_all("a"):
if link['href'][1:7] == "author":
print(url + link['href'])
a += 1
print("Page:", a)
nav = soup.find("li", class_="next")
nextPage = nav.find("a")
AuthorLink(url + nextPage['href'])
Here's the code that it broke on:
5 soup = BeautifulSoup(page.content, "html.parser")
6 divContainer = soup.find("div", class_="container")
----> 7 divRow = divContainer.find_all("div", class_= "row")
I don't see why this is happening if it ran for the first two pages successfully.
I've checked the structure of the website and it seems little has changed from each page.
I've also tried to change the code so that instead of using the link from "Next" at the bottom of the page, it just adds the number of the next page to the URL but this doesn't work either.
You are facing this error because your new requsting url is adding in that previous url which means.
url value is iterations:
"https://quotes.toscrape.com/", where works;
"https://quotes.toscrape.com/page/2/", where also works;
"https://quotes.toscrape.com/page/2//page/3/", but here website can't serve the page. So, doesn't work.
Exact solution could be different, but here's a little bit changed in my answer.
import requests
from bs4 import BeautifulSoup
base_url="https://quotes.toscrape.com"
def AuthorLink(url):
page = requests.get(url)
soup = BeautifulSoup(page.content, "html.parser")
divContainer = soup.find("div", class_="container")
divRow = divContainer.find_all("div", class_= "row")[1]
divQuotes = divRow.find_all("div", class_="quote")
for quotes in divQuotes:
for el in quotes.find_all("small", class_="author"):
print(el.get_text())
for link in quotes.find_all("a"):
if link['href'][1:7] == "author":
print(base_url + link['href'])
for i in range(1,5):
AuthorLink(f"{base_url}/page/{i}")
I have defined new base_url to store actual website link. And next page is "/page/[i]" which means we can use for loop to generate i=1,2,3... . And other change is print(base_url + link['href']) where you had used url instead of "base url" that again leads to same URL changing problem from above.

How to extract the Coronavirus cases from a website?

I'm trying to extract the Coronavirus from a website (https://www.trackcorona.live) but I got an error.
This is my code:
response = requests.get('https://www.trackcorona.live')
data = BeautifulSoup(response.text,'html.parser')
li = data.find_all(class_='numbers')
confirmed = int(li[0].get_text())
print('Confirmed Cases:', confirmed)
It gives the following error (though it was working few days back) because it is returning an empty list (li)
IndexError
Traceback (most recent call last)
<ipython-input-15-7a09f39edc9d> in <module>
2 data=BeautifulSoup(response.text,'html.parser')
3 li=data.find_all(class_='numbers')
----> 4 confirmed = int(li[0].get_text())
5 countries = li[1].get_text()
6 dead = int(li[3].get_text())
IndexError: list index out of range
​
Well, Actually the site is generating a redirection behind CloudFlare, And then it's loaded dynamically via JavaScript once the page loads, Therefore we can use several approach such as selenium and requests_html but i will mention for you the quickest solution for that as we will render the JS on the fly :)
import cloudscraper
from bs4 import BeautifulSoup
scraper = cloudscraper.create_scraper()
html = scraper.get("https://www.trackcorona.live/").text
soup = BeautifulSoup(html, 'html.parser')
confirmed = soup.find("a", id="valueTot").text
print(confirmed)
Output:
110981
A tip for 503 response code:
Basically that code referring to service unavailable.
More technically, the GET request which you sent is couldn't be served. the reason why it's because the request got stuck between the receiver of the request which is https://www.trackcorona.live/ where's it's handling it to another source on the same HOST which is https://www.trackcorona.live/?cf_chl_jschl_tk=
Where __cf_chl_jschl_tk__= is holding a token to be authenticated.
So you should usually follow your code to server the host with required data.
Something like the following showing the end url:
import requests
from bs4 import BeautifulSoup
def Main():
with requests.Session() as req:
url = "https://www.trackcorona.live"
r = req.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
redirect = f"{url}{soup.find('form', id='challenge-form').get('action')}"
print(redirect)
Main()
Output:
https://www.trackcorona.live/?__cf_chl_jschl_tk__=575fd56c234f0804bd8c87699cb666f0e7a1a114-1583762269-0-AYhCh90kwsOry_PAJXNLA0j6lDm0RazZpssum94DJw013Z4EvguHAyhBvcbhRvNFWERtJ6uDUC5gOG6r64TOrAcqEIni_-z1fjzj2uhEL5DvkbKwBaqMeIZkB7Ax1V8kV_EgIzBAeD2t6j7jBZ9-bsgBBX9SyQRSALSHT7eXjz8r1RjQT0SCzuSBo1xpAqktNFf-qME8HZ7fEOHAnBIhv8a0eod8mDmIBDCU2-r6NSOw49BAxDTDL57YAnmCibqdwjv8y3Yf8rYzm2bPh74SxVc
Now to be able to call the end URL so you need to pass the required Form-Data:
Something like that:
def Main():
with requests.Session() as req:
url = "https://www.trackcorona.live"
r = req.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
redirect = f"{url}{soup.find('form', id='challenge-form').get('action')}"
data = {
'r': 'none',
'jschl_vc': 'none',
'pass': 'none',
'jschl_answer': 'none'
}
r = req.post(redirect, data=data)
print(r.text)
Main()
here you will end up with text without your desired values. because your values is rendered via JS.
That site is covered by Cloudflare DDoS protection, so the Html returned is a Cloudflare page stating this, not the content you want. You will need to navigate that by first, presumably by getting and setting some cookies, etc.
As an alternative, I recommend taking a look at Selenium. It drives a browser and will execute any js on the page and should get you past this much easier if you are just starting out.
Hope that helps!
The website is now protected with Cloudflare DDoS Protection, so it cannot be directly accessed with python requests.
You can just try this out with https://github.com/Anorov/cloudflare-scrape which bypasses this page. pip package is named as cfscrape

scraping h3 from div using python

I'd like to scrape, using Python 3.6, H3 titles from within DIV - from the page:
https://player.bfi.org.uk/search/rentals?q=&sort=title&page=1
Note that the page number changes, increment of 1.
I'm struggling to return or identify the title.
from requests import get
url = 'https://player.bfi.org.uk/search/rentals?q=&sort=title&page=1'
response = get(url)
from bs4 import BeautifulSoup
html_soup = BeautifulSoup(response.text, 'lxml')
type(html_soup)
movie_containers = html_soup.find_all('div', class_ = 'card card--rentals')
print(type(movie_containers))
print(len(movie_containers))
I've tried looping through them also:
for dd in page("div.card__content"):
print(div.select_one("h3.card__title").text.strip())
Any help would be great.
Thanks,
I'm expecting results of Title of each film from each page, including link to the film. Eg. https://player.bfi.org.uk/rentals/film/watch-akenfield-1975-online
The page is loading content via xhr to another url so you are missing this. You can mimic that xhr POST request the page uses and alter post json sent. If you change size you get more results.
import requests
data = {"size":1480,"from":0,"sort":"sort_title","aggregations":{"genre":{"terms":{"field":"genre.raw","size":10}},"captions":{"terms":{"field":"captions"}},"decade":{"terms":{"field":"decade.raw","order":{"_term":"asc"},"size":20}},"bbfc":{"terms":{"field":"bbfc_rating","size":10}},"english":{"terms":{"field":"english"}},"audio_desc":{"terms":{"field":"audio_desc"}},"colour":{"terms":{"field":"colour"}},"mono":{"terms":{"field":"mono"}},"fiction":{"terms":{"field":"fiction"}}},"min_score":0.5,"query":{"bool":{"must":{"match_all":{}},"must_not":[],"should":[],"filter":{"term":{"pillar.raw":"rentals"}}}}}
r = requests.post('https://search-es.player.bfi.org.uk/prod-films/_search', json = data).json()
for film in r['hits']['hits']:
print(film['_source']['title'], 'https://player.bfi.org.uk' + film['_source']['url'])
The actual result count for rentals is in the json, r['hits']['total'], so you can do an initial request, starting with a number much higher than you expect, check if another request is needed, and then gather any extra by altering the from and size to mop up any outstanding.
import requests
import pandas as pd
initial_count = 10000
results = []
def add_results(r):
for film in r['hits']['hits']:
results.append([film['_source']['title'], 'https://player.bfi.org.uk' + film['_source']['url']])
with requests.Session() as s:
data = {"size": initial_count,"from":0,"sort":"sort_title","aggregations":{"genre":{"terms":{"field":"genre.raw","size":10}},"captions":{"terms":{"field":"captions"}},"decade":{"terms":{"field":"decade.raw","order":{"_term":"asc"},"size":20}},"bbfc":{"terms":{"field":"bbfc_rating","size":10}},"english":{"terms":{"field":"english"}},"audio_desc":{"terms":{"field":"audio_desc"}},"colour":{"terms":{"field":"colour"}},"mono":{"terms":{"field":"mono"}},"fiction":{"terms":{"field":"fiction"}}},"min_score":0.5,"query":{"bool":{"must":{"match_all":{}},"must_not":[],"should":[],"filter":{"term":{"pillar.raw":"rentals"}}}}}
r = s.post('https://search-es.player.bfi.org.uk/prod-films/_search', json = data).json()
total_results = int(r['hits']['total'])
add_results(r)
if total_results > initial_count :
data['size'] = total_results - initial_count
data['from'] = initial_count
r = s.post('https://search-es.player.bfi.org.uk/prod-films/_search', json = data).json()
add_results(r)
df = pd.DataFrame(results, columns = ['Title', 'Link'])
print(df.head())
the issue you are having is not actually with finding the div - I think you are doing that correctly. However, when you try to access the website with
from requests import get
url = 'https://player.bfi.org.uk/search/rentals?q=&sort=title&page=1'
response = get(url)
the response doesn't actually include all the content you see in the browser. You can check that this is the case with 'card' in response == False. This is most likely because after the website is loaded, all the cards are loaded via javascript, therefore just loading the basic content in with requests library is not sufficient to get all the information you want to scrape.
I suggest you maybe try looking at how the website loads all the cards - Network tab in the browser dev tools might help.

My for loop isnt being read on my scraper of gamestop

I cant get his for loop to be read and take the listing of items it just prints nothing at all and skips the whole loop
import requests
import re
from bs4 import BeautifulSoup
maxPages = 10
keyword = "ps4"
costMax = 0
costMin = 0
def tradeSpiderGS(maxPages):
page = 1
while page <= maxPages:
print(page)
#creating url for soup
if page <= 1:
url = 'https://www.gamestop.com/browse?nav=16k-3-'+ keyword
+',28zu0'
else:
url = 'https://www.gamestop.com/browse?nav=16k-3-' + keyword +
',2b'+
str(page *12) + ',28zu0'
#creating soup object
srcCode = requests.get(url)
plainTxt = srcCode.text
soup = BeautifulSoup(plainTxt,"html.parser")
#this for loop is not being read supposed to grab links on gs website
for links in soup.find_all('a', {'class': 'ats-product-title-lnk'}):
href = links.get('href')
trueHref = 'https://www.gamestop.com/' + href
print(trueHref)
page += 1
tradeSpiderGS(maxPages)
Why Doesn't the Loop Run?
The loop doesn't run because soup.find_all('a', {'class': 'ats-product-title-lnk'}) is [] (there aren't any a with that class).
The reason there aren't any a with that class is that GameStop doesn't let you access the /browse pages unless you've been to a normal page first. You can confirm this by opening one of the urls in a web browser in incognito mode:
Workarounds:
You can use a different scraping mechanism like Selenium in python to work around this. You might also be able to copy headers from a web browser request into the request.get call, although I wasn't able to get this to work.

Log in to Amazon Mechanical Turk using Python and Parse HITS

I am trying to use Python (2.7) to automatically log into Amazon Mechanical Turk and grab information about some of the HITS available. If you attempt to go past page 20, it requires a log in, which is where I am having difficulty. I have attempted to us many python packages including mechanize, urllib2, and most recently I found a very related solution on stackoverflow here using requests. I added the slight modifications necessary for my context, see below, but the code is not working. The response page is again the login page with an error displayed: Your password is incorrect. Additionally, the code from the original post no longer works for its context either; the same error is displayed. So I assume Amazon has changed something, and I cannot seem to figure out what it is and how to fix it. Any help along this line would be very appreciated.
import bs4, requests
headers = {
'User-Agent': 'Chrome'
}
from bs4 import BeautifulSoup
url = "https://www.mturk.com/mturk/viewhits?searchWords=&pageNumber=21" \
"&searchSpec=HITGroupSearch%23T%232%23100%23-1%23T%23%21%23%21" \
"LastUpdatedTime%211%21%23%21&sortType=LastUpdatedTime%3A1" \
"&selectedSearchType=hitgroups"
with requests.Session() as s:
s.headers = headers
r = s.get(url)
soup = BeautifulSoup(r.content, "html.parser")
signin_data = {s["name"]: s["value"]
for s in soup.select("form[name=signIn]")[0].select("input[name]")
if s.has_attr("value")}
signin_data[u'email'] = ''
signin_data[u'password'] =''
for k,v in signin_data.iteritems():
print k + ": " + v
action = soup.find('form', id='ap_signin_form').get('action')
response = s.post(action, data=signin_data)
soup = bs4.BeautifulSoup(response.text, "html.parser")
warning = soup.find('div', {'id': 'message_error'})
if warning:
print('Failed to login: {0}'.format(warning.text))

Categories

Resources