Web crawling error using Python at 'whoscored.com' - python

import requests
from bs4 import BeautifulSoup
user_agent = {'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36'}
page = requests.get("https://www.whoscored.com/Regions/252/Tournaments/2/England-Premier-League", headers=user_agent)
soup = BeautifulSoup(page.text,'html.parser')
print(soup)
I'm trying to do webcrawling on 'whoscored.com' but I can't get all the HTML Tell me the solution.
Request unsuccessful. Incapsula incident ID: 946001050011236585-61439481461474967
this is result.

from selenium import webdriver
import time
from bs4 import BeautifulSoup
browser = webdriver.Firefox()
url = 'https://www.whoscored.com/Regions/252/Tournaments/2/England-Premier-League'
sada = browser.get(url)
time.sleep(3)
source = browser.page_source
soup = BeautifulSoup(source, 'html.parser')
print(soup.prettify())
There is a couple of issues here. The root cause is that the website you are trying to scrape knows you're not a real person and is blocking you. Lots of websites do this simply by checking headers to see if a request is coming from a browser or not (robot). However, this site looks like they use Incapsula, which is designed to provide more sophisticated protection

Related

JavaScript is not available. (Python)

I am trying to scrape data from a Twitter webpage using Python but instead of getting the data back, I keep getting "Javascript is not available". I've enabled Javascript in my browser(chrome) but nothing changes.
Here is the error -->
<h1>JavaScript is not available.</h1>
<p>We’ve detected that JavaScript is disabled in this browser. Please enable JavaScript or switch to a supported browser to continue using twitter.com. You can see a list of supported browsers in our Help Center.</p>
Here is the code -->
from bs4 import BeautifulSoup
import requests
url = "https://twitter.com/search?q=%23developer%20advocate&src=typed_query&f=user"
source_code = requests.get(url).text
soup = BeautifulSoup(source_code, "lxml")
content = soup.find("div")
print(content)
I've tried enabling Javascript in my browser(chrome), I expected to get the required data back instead the error "Javascript is not availble" persists.
I will never advise scraping twitter by violating their policies, you should use an API instead! But for the Javascript part, just pass user agent in headers in your requests.
from bs4 import BeautifulSoup
import requests
user_agent = 'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.0.0 Safari/537.36'
headers = {'User-Agent': user_agent}
url = "https://twitter.com/search?q=%23developer%20advocate&src=typed_query&f=user"
source_code = requests.get(url, headers=headers).text
soup = BeautifulSoup(source_code, "lxml")
content = soup.find("div")
print(content)

While webscrapping this error shows Not Acceptable! An appropriate representation of the requested resource could not be found on this server

I am trying to scrape data from a website but it shows this error. I don't know how to fix this.
b'<head><title>Not Acceptable!</title></head><body><h1>Not Acceptable!</h1><p>An appropriate representation of the requested resource could not be found on this server. This error was generated by Mod_Security.</p></body></html>'
This is my code
from bs4 import BeautifulSoup
import requests
import pandas as pd
url = 'https://insights.blackcoffer.com/how-is-login-logout-time-tracking-for-employees-in-office-done-by-ai/'
page = requests.get(url).content
page
Output
You need to add user-agent and it works.
If you do not put user-agent of some browser, the site thinks that you are bot and block you.
from bs4 import BeautifulSoup
import requests
url = 'https://insights.blackcoffer.com/how-is-login-logout-time-tracking-for-employees-in-office-done-by-ai/'
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.0.0 Safari/537.36"}
page = requests.get(url, headers=headers).content
print(page)

BeautifulSoup will only return None

Im learning Beautiful Soup and I dont know what I could be doing wrong, I'm using soup.find on an id, and Ive tried this on multiple different sites, and I run it and it always returns None.
import requests
from bs4 import BeautifulSoup
site = 'https://www.amazon.com/PlayStation-5-Console/dp/B09DFCB66S'
headers = {'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.45 Safari/537.36'}
def stock_check():
page = requests.get(site, headers = headers)
soup = BeautifulSoup(page.content, 'html.parser')
title = soup.find('span', id = 'productTitle')
print(title)
stock_check()
There are 3 errors in your code:
1.incorrect locator
2.not invoking text
3.not inject cookies
Now your code is working fine:
import requests
from bs4 import BeautifulSoup
site = 'https://www.amazon.com/PlayStation-5-Console/dp/B09DFCB66S'
headers = {'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.45 Safari/537.36'}
cookies={'session':'141-2320098-4829807'}
def stock_check():
page = requests.get(site, headers = headers,cookies=cookies)
soup = BeautifulSoup(page.content, 'html.parser')
title = soup.find('span', attrs={'id':'productTitle'})
print(title.get_text(strip=True))
stock_check()
Output:
PlayStation 5 Console
The answers of HedgeHog and Fazlul are of course correct, but I want to comment on this.
When you scrape something from the web and try to extract tags from the HTML but get nothing, first check the whole HTML document you recieved to make sure it's what you expected. Personally I just print out soup.prettify() to debug this, as explained in BeautifulSoup's Quick Start:
Another nifty trick if the HTML is impractical to read is to paste it into a HTML previewer like this one, and we get the answer quickly.
BeautifulSoup can be a great tool, but you need to pay attention when using it.

BeautifulSoup Find periodically returns None

I am trying to get a value from a class. From time to time, find returns the value I need, but another time it no longer works.
Code:
import requests
from bs4 import BeautifulSoup
url = 'https://beru.ru/catalog/molotyi-kofe/76321/list'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 '
'(KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36'}
page = requests.get(url, headers=headers)
soup = BeautifulSoup(page.content, 'html.parser')
item_count = (soup.find('div', class_='_2StYqKhlBr')).text.split()[4]
print(item_count)
The reason why that you get the values sometimes and sometimes not. That's because the website is protected by CAPTCHA
So when the request is blocked by CAPTCHA
It's became like the following:
https://beru.ru/showcaptcha?retpath=https://beru.ru/catalog/molotyi-kofe/76321/list?ncrnd=4561_aa1b86c2ca77ae2b0831c4d95b9d85a4&t=0/1575204790/b39289ef083d539e2a4630548592a778&s=7e77bfda14c97f6fad34a8a654d9cd16
You can verify by parse the response content:
import requests
from bs4 import BeautifulSoup
r = requests.get(
'https://beru.ru/catalog/molotyi-kofe/76321/list')
soup = BeautifulSoup(r.text, 'html.parser')
for item in soup.findAll('div', attrs={'class': '_2StYqKhlBr _1wAXjGKtqe'}):
print(item)
for item in soup.findAll('div', attrs={'class': 'captcha__image'}):
for captcha in item.findAll('img'):
print(captcha.get('src'))
And you will get the CAPTCHA image link:
https://beru.ru/captchaimg?aHR0cHM6Ly9leHQuY2FwdGNoYS55YW5kZXgubmV0L2ltYWdlP2tleT0wMEFMQldoTnlaVGh3T21WRmN4NWFJRUdYeWp2TVZrUCZzZXJ2aWNlPW1hcmtldGJsdWU,_0/1575206667/b49556a86deeece9765a88f635c7bef2_df12d7a36f0e2d36bd9c9d94d8d9e3d7

Why can't I scrape Amazon by BeautifulSoup?

Here is my python code:
import urllib2
from bs4 import BeautifulSoup
page = urllib2.urlopen("http://www.amazon.com/")
soup = BeautifulSoup(page)
print soup
it works for google.com and many other websites, but it doesn't work for amazon.com.
I can open amazon.com in my browser, but the resulting "soup" is still none.
Besides, I find that it cannot scrape from appannie.com, either. However, rather than give none, the code returns an error:
HTTPError: HTTP Error 503: Service Temporarily Unavailable
So I doubt whether Amazon and App Annie block scraping.
Add a header, then it will work.
from bs4 import BeautifulSoup
import requests
url = "http://www.amazon.com/"
# add header
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.71 Safari/537.36'}
r = requests.get(url, headers=headers)
soup = BeautifulSoup(r.content, "lxml")
print soup
You can try this:
import urllib2
from bs4 import BeautifulSoup
page = urllib2.urlopen("http://www.amazon.com/")
soup = BeautifulSoup(page)
print soup
In python arbitrary text is called a string and it must be enclosed in quotes(" ").
I just ran into this and found that setting any user-agent will work. You don't need to lie about your user agent.
response = HTTParty.get #url, headers: {'User-Agent' => 'Httparty'}
Add a header
import urllib2
from bs4 import BeautifulSoup
headers = {'User-agent': 'Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.120 Safari/537.36'}
page = urllib2.urlopen("http://www.amazon.com/")
soup = BeautifulSoup(page)
print soup

Categories

Resources