I'm trying to scrape word definitions, but can't get python to redirect to the correct page. For example, I'm trying to get the definition for the word 'agenesia'. When you load that page in a browser with https://www.lexico.com/definition/agenesia, the page which loads is https://www.lexico.com/definition/agenesis, however in Python the page doesn't redirect and gives a 200 status code
URL = 'https://www.lexico.com/definition/agenesia'
page = requests.head(URL, allow_redirects=True)
This is how I'm currently retrieving the page content, I've also tried using requests.get
but that also doesn't work
EDIT: Because it isn't clear, I'm aware that I could change the word to 'agenesis' in the URL to get the correct page, but I am scraping a list of words and would rather automatically follow the URL rather than searching in a browser for the redirect by hand first.
EDIT 2: I realised it might be easier to check solutions with the rest of my code, so far this works with agenesis but not agenesia:
soup = BeautifulSoup(page.content, 'html.parser')
print(soup.find("span", {"class": "ind"}).get_text(), '\n')
print(soup.find("span", {"class": "pos"}).get_text())
Other answers mentioned before doesn't make your request redirect. The cause is you didn't use the correct request header. Try code below:
import requests
from bs4 import BeautifulSoup
headers = {
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
}
page = requests.get('https://www.lexico.com/definition/agenesia', headers=headers)
soup = BeautifulSoup(page.content, 'html.parser')
print(page.url)
print(soup.find("span", {"class": "ind"}).get_text(), '\n')
print(soup.find("span", {"class": "pos"}).get_text())
And print:
https://www.lexico.com/definition/agenesis?s=t
Failure of development, or incomplete development, of a part of the body.
noun
You are doing an HEAD request
The HTTP HEAD method requests the headers that would be returned if the HEAD request's URL was instead requested with the HTTP GET method.
You want to do
URL = 'https://www.lexico.com/definition/agenesia'
page = requests.get(URL, allow_redirects=True)
If you don't mind a pop-up window, Selenium.py is really good for scraping at a more user-friendly level. If you know the selector of the page element, you can scrape it with driver.find_element_by_css_selector('theselector').text
Where driver = webdriver.chromedriver('file path').
This is a pretty radical circumvention of the problem so I understand if it's not applicable to your specific situation but hopefully you find this answer useful. :)
This works as expected:
>>> import requests
>>> url = 'https://www.lexico.com/definition/agenesia'
>>> requests.get(url, allow_redirects=True).url
'https://www.lexico.com/search?filter=en_dictionary&query=agenesia'
This is the URL that other commands find. For example, with curl:
$ curl -v https://www.lexico.com/definition/agenesia 2>&1 | grep location:
< location: https://www.lexico.com/search?filter=en_dictionary&query=agenesia
Related
Hi, I want to get the text(number 18) from em tag as shown in the picture above.
When I ran my code, it did not work and gave me only empty list. Can anyone help me? Thank you~
here is my code.
from urllib.request import urlopen
from bs4 import BeautifulSoup
url = 'https://blog.naver.com/kwoohyun761/221945923725'
html = urlopen(url)
soup = BeautifulSoup(html, 'lxml')
likes = soup.find_all('em', class_='u_cnt _count')
print(likes)
When you disable javascript you'll see that the like count is loaded dynamically, so you have to use a service that renders the website and then you can parse the content.
You can use an API: https://www.scraperapi.com/
Or run your own for example: https://github.com/scrapinghub/splash
EDIT:
First of all, I missed that you were using urlopen incorrectly the correct way is described here: https://docs.python.org/3/howto/urllib2.html . Assuming you are using python3, which seems to be the case judging by the print statement.
Furthermore: looking at the issue again it is a bit more complicated. When you look at the source code of the page it actually loads an iframe and in that iframe you have the actual content: Hit ctrl + u to see the source code of the original url, since the side seems to block the browser context menu.
So in order to achieve your crawling objective you have to first grab the initial page and then grab the page you are interested in:
from urllib.request import urlopen
from bs4 import BeautifulSoup
# original url
url = "https://blog.naver.com/kwoohyun761/221945923725"
with urlopen(url) as response:
html = response.read()
soup = BeautifulSoup(html, 'lxml')
iframe = soup.find('iframe')
# iframe grabbed, construct real url
print(iframe['src'])
real_url = "https://blog.naver.com" + iframe['src']
# do your crawling
with urlopen(real_url) as response:
html = response.read()
soup = BeautifulSoup(html, 'lxml')
likes = soup.find_all('em', class_='u_cnt _count')
print(likes)
You might be able to avoid one round trip by analyzing the original url and the URL in the iframe. At first glance it looked like the iframe url can be constructed from the original url.
You'll still need a rendered version of the iframe url to grab your desired value.
I don't know what this site is about, but it seems they do not want to get crawled maybe you respect that.
I am currently going through the 'Automate the Boring Stuff' Udemy Course, lesson '40. Parsing HTML with the Beautiful Soup Module'. About minutes in, Al uses requests the html of an amazon page and uses soup.select with the prices selector in order to print it out. I am currently trying to that with the exact same code, except for the usage of headers with seems to be necessary, otherwise i get a server error. I have read through some similar questions and the general solution seems to be to find the source for the data using the network panel. Unfortunately i have no clue on how to do that :/
import requests
import bs4
headers = {'User-Agent': 'Chrome'}
url = 'https://www.amazon.com/Automate-Boring-Stuff-Python-Programming-ebook/dp/B00WJ049VU/ref=tmm_kin_swatch_0?_encoding=UTF8&qid=&sr='
res = requests.get(url, headers=headers)
soup = bs4.BeautifulSoup(res.text, features='html.parser')
print(soup.select('#mediaNoAccordion > div.a-row > div.a-column.a-span4.a-text-right.a-span-last > span.a-size-medium.a-color-price.header-price'))
You need to use a more forgiving parser. You can also use a much shorter and more robust selector.
import requests
import bs4
headers = {'User-Agent': 'Chrome'}
url = 'https://www.amazon.com/Automate-Boring-Stuff-Python-Programming-ebook/dp/B00WJ049VU/ref=tmm_kin_swatch_0?_encoding=UTF8&qid=&sr='
res = requests.get(url, headers=headers)
soup = bs4.BeautifulSoup(res.text, features='lxml')
print(soup.select_one('.mediaTab_subtitle').text.strip())
For better uses you can do inspect element and click on the top-left corner on the arrow icon and activate it. Then you can hover over the element and select. After selecting you can choose from copying xpath/css selector/class/id
I'm trying to extract the Coronavirus from a website (https://www.trackcorona.live) but I got an error.
This is my code:
response = requests.get('https://www.trackcorona.live')
data = BeautifulSoup(response.text,'html.parser')
li = data.find_all(class_='numbers')
confirmed = int(li[0].get_text())
print('Confirmed Cases:', confirmed)
It gives the following error (though it was working few days back) because it is returning an empty list (li)
IndexError
Traceback (most recent call last)
<ipython-input-15-7a09f39edc9d> in <module>
2 data=BeautifulSoup(response.text,'html.parser')
3 li=data.find_all(class_='numbers')
----> 4 confirmed = int(li[0].get_text())
5 countries = li[1].get_text()
6 dead = int(li[3].get_text())
IndexError: list index out of range
Well, Actually the site is generating a redirection behind CloudFlare, And then it's loaded dynamically via JavaScript once the page loads, Therefore we can use several approach such as selenium and requests_html but i will mention for you the quickest solution for that as we will render the JS on the fly :)
import cloudscraper
from bs4 import BeautifulSoup
scraper = cloudscraper.create_scraper()
html = scraper.get("https://www.trackcorona.live/").text
soup = BeautifulSoup(html, 'html.parser')
confirmed = soup.find("a", id="valueTot").text
print(confirmed)
Output:
110981
A tip for 503 response code:
Basically that code referring to service unavailable.
More technically, the GET request which you sent is couldn't be served. the reason why it's because the request got stuck between the receiver of the request which is https://www.trackcorona.live/ where's it's handling it to another source on the same HOST which is https://www.trackcorona.live/?cf_chl_jschl_tk=
Where __cf_chl_jschl_tk__= is holding a token to be authenticated.
So you should usually follow your code to server the host with required data.
Something like the following showing the end url:
import requests
from bs4 import BeautifulSoup
def Main():
with requests.Session() as req:
url = "https://www.trackcorona.live"
r = req.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
redirect = f"{url}{soup.find('form', id='challenge-form').get('action')}"
print(redirect)
Main()
Output:
https://www.trackcorona.live/?__cf_chl_jschl_tk__=575fd56c234f0804bd8c87699cb666f0e7a1a114-1583762269-0-AYhCh90kwsOry_PAJXNLA0j6lDm0RazZpssum94DJw013Z4EvguHAyhBvcbhRvNFWERtJ6uDUC5gOG6r64TOrAcqEIni_-z1fjzj2uhEL5DvkbKwBaqMeIZkB7Ax1V8kV_EgIzBAeD2t6j7jBZ9-bsgBBX9SyQRSALSHT7eXjz8r1RjQT0SCzuSBo1xpAqktNFf-qME8HZ7fEOHAnBIhv8a0eod8mDmIBDCU2-r6NSOw49BAxDTDL57YAnmCibqdwjv8y3Yf8rYzm2bPh74SxVc
Now to be able to call the end URL so you need to pass the required Form-Data:
Something like that:
def Main():
with requests.Session() as req:
url = "https://www.trackcorona.live"
r = req.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
redirect = f"{url}{soup.find('form', id='challenge-form').get('action')}"
data = {
'r': 'none',
'jschl_vc': 'none',
'pass': 'none',
'jschl_answer': 'none'
}
r = req.post(redirect, data=data)
print(r.text)
Main()
here you will end up with text without your desired values. because your values is rendered via JS.
That site is covered by Cloudflare DDoS protection, so the Html returned is a Cloudflare page stating this, not the content you want. You will need to navigate that by first, presumably by getting and setting some cookies, etc.
As an alternative, I recommend taking a look at Selenium. It drives a browser and will execute any js on the page and should get you past this much easier if you are just starting out.
Hope that helps!
The website is now protected with Cloudflare DDoS Protection, so it cannot be directly accessed with python requests.
You can just try this out with https://github.com/Anorov/cloudflare-scrape which bypasses this page. pip package is named as cfscrape
I'm trying to extract the sector of a stock for a ML classification project. If I go to the following page:
https://www.six-swiss-exchange.com/shares/security_info_en.html?id=CH0012221716CHF4
I get (on the screen) some information about this stock (it changes, with the id code - I just pick the first one of the list). However, none of the information is available on a regular request. (The html page contains mostly javascript functions)
What I need is on the "Shares Details" tab (ICB Supersector at the bottom of the page). Once again nothing is available with a regular requests. I looked into what happens when I click this tab and the desired request is inside the url:
http://www.six-swiss-exchange.com/shares/info_details_en.html?id=CH0210483332CHF4&portalSegment=EQ&dojo.preventCache=1520360103852 HTTP/1.1
However, if I use this url directly, I get an 403 error from requests but work from a browser. I usually don't have any problems with this sort of things but in this case, do I have to submit cookies or any other information to access that page - no login is required and it can be easily accessed from any browser.
I am thinking 1) make a first request to the url that works, 2) store the cookie they send you (I don't know how to do that really) and 3) make a second request to the desired url. Would this work?
I tried using request.session() but I'm not sure if this is the solution or if I implemented it properly.
If anyone has dealt with that sort of problem, I would love any pointers in solving this. Thanks.
from urllib.parse import urljoin
import requests
from bs4 import BeautifulSoup
BASE_URL = 'https://www.six-swiss-exchange.com'
def get_page_html(isin):
params = {
'id': isin,
'portalSegment': 'EQ'
}
r = requests.get(
'{}/shares/info_details_en.html'.format(BASE_URL),
params=params
)
r.raise_for_status()
return r.text
def get_supersector_info(soup):
supersector = soup.find('td', text='ICB Supersector').next_sibling.a
return {
'link': urljoin(BASE_URL, supersector['href']),
'text': supersector.text
}
if __name__ == '__main__':
page_html = get_page_html('CH0012221716CHF4')
soup = BeautifulSoup(page_html, 'lxml')
supersector_info = get_supersector_info(soup)
Console:
https://www.six-swiss-exchange.com/search/quotes_en.html?security=C2700T
Industrial Goods & Services
I am trying to get the url and sneaker titles at https://stockx.com/sneakers.
This is my code so far:
in main.py
from bs4 import BeautifulSoup
from utils import generate_request_header
import requests
url = "https://stockx.com/sneakers"
html = requests.get(url, headers=generate_request_header()).content
soup = BeautifulSoup(html, "lxml")
print soup
in utils.py
def generate_request_header():
header = BASE_REQUEST_HEADER
header["User-Agent"] = random.choice(USER_AGENT_HEADER_LIST)
return header
But whenever I print soup, I get the following output: https://pastebin.com/Ua6B6241. There doesn't seem to be any HTML extracted. How would I get it? Should I be using something like Selenium?
requests doesn't seem to be able to verify the ssl certificates, to temporarily bypass this error, you can use verify=False, i.e.:
requests.get(url, headers=generate_request_header(), verify=False)
To fix it permanently, you may want to read:
http://docs.python-requests.org/en/master/user/advanced/#ssl-cert-verification
I'm guessing the data you're looking for are at line 126 in the pastebin. I've never tried to extract the text of a script but I'm sure it could be done.
In lxml, something like:
source_code.xpath('//script[#type="text/javascript"]') should return a list of all the scripts as objects.
Or to try and get straight to the "tickers":
[i for i in source_code.xpath('//script[#type="text/javascript"]') if 'tickers' in i.xpath('string')]