This question already has an answer here:
Python change Accept-Language using requests
(1 answer)
Closed 6 years ago.
I am using requests and bs4 to scrape some data from a Chinese website that also has an English version. I wrote this to see if I get the right data:
import requests
from bs4 import BeautifulSoup
page = requests.get('http://dotamax.com/hero/rate/')
soup = BeautifulSoup(page.content, "lxml")
for i in soup.find_all('span'):
print i.text
And I do, the only problem is that the text is in Chinese, although it is in English when I look at the page source. Why do I get Chinese instead of English. How to fix that?
The website appears to check the GET request for an Accept-Language parameter. If the request doesn't have one, it shows the Chinese version. However, this is an easy fix - use headers as described in the requests documentation:
import requests
from bs4 import BeautifulSoup
headers = {'Accept-Language': 'en-US,en;q=0.8'}
page = requests.get('http://dotamax.com/hero/rate/', headers=headers)
soup = BeautifulSoup(page.content, "lxml")
for i in soup.find_all('span'):
print i.text
produces:
Anti-Mage
Axe
Bane
Bloodseeker
Crystal Maiden
Drow Ranger
...
etc.
Usually when a request shows up differently in your browser and in the requests content, it has to do with the type of request and headers you're using. One really useful tip for web-scraping that I wish I had realized much earlier on is that if you hit F12 and go to the "Network" tab on Chrome or Firefox, you can get a lot of useful information that you can use for debugging:
you have to tell the server which language you like in the http headers:
import requests
from bs4 import BeautifulSoup
header={
'Accept-Language': 'en-US'
}
page = requests.get('http://dotamax.com/hero/rate/',headers=header)
soup = BeautifulSoup(page.content, "html5lib")
for i in soup.find_all('span'):
print(i.text)
Related
Hello I am trying to use beautiful soup and requests to log the data coming from an anemometer which updates live every second. The link to this website here:
http://88.97.23.70:81/
The piece of data I want to scrape is highlighted in purple in the image :
from inspection of the html in my browser.
I have written the code bellow in to try to print out the data however when I run the code it prints: None. I think this means that the soup object doesnt infact contain the whole html page? Upon printing soup.prettify() I cannot find the same id=js-2-text I find when inspecting the html in my browser. If anyone has any ideas why this might be or how to fix it I would be most grateful.
from bs4 import BeautifulSoup
import requests
wind_url='http://88.97.23.70:81/'
r = requests.get(wind_url)
data = r.text
soup = BeautifulSoup(data, 'lxml')
print(soup.find(id='js-2-text'))
All the best,
Brendan
The data is loaded from external URL, so beautifulsoup doesn't need it. You can try to use API URL the page is connecting to:
import requests
from bs4 import BeautifulSoup
api_url = "http://88.97.23.70:81/cgi-bin/CGI_GetMeasurement.cgi"
data = {"input_id": "1"}
soup = BeautifulSoup(requests.post(api_url, data=data).content, "html.parser")
_, direction, metres_per_second, *_ = soup.csv.text.split(",")
knots = float(metres_per_second) * 1.9438445
print(direction, metres_per_second, knots)
Prints:
210 006.58 12.79049681
I am using Selenium to do web scraping and would like to instead use beautiful soup, but I am new to this library, I wanna get all company names and the time and jump to the next page.
Please find my codes using selenium first:
driver.get('http://www.csisc.cn/zbscbzw/isinbm/index_list_code.shtml')
while True:
links = [link.get_attribute('href') for link in driver.find_elements_by_xpath('//*[#class="sibian"]/tbody/tr/td/table[2]/tbody/tr/td[2]/a')]
for link in links:
driver.get(link)
driver.implicitly_wait(10)
windows = driver.window_handles
driver.switch_to.window(windows[-1])
time = driver.find_element_by_xpath('//*[#class="con_bj"]/table[3]/tbody/tr/td/publishtime').text
company = driver.find_element_by_xpath('//*[#class="title_A"]').text
driver.back()
if(len(links)< 20):
break
I tried doing the same with beautifulsoup as:
from bs4 import BeautifulSoup
import requests
html='http://www.csisc.cn/zbscbzw/isinbm/index_list_code.shtml'
soup = BeautifulSoup(html, 'html.parser')
for link in soup.find_all('td'):
num=link.find('a').get('href')
print(num)
But I get nothing and stuck with the first step.
Could you please help with that?
you are not making a request. You are thinking that BeautifulSoup is a HTTPRequest library, it is just a parser. Think of driver.get() as requests.get() (yes i know they are not the same, but it is for an easier understanding). You need to do something like this:
from bs4 import BeautifulSoup
import requests
html_link='http://www.csisc.cn/zbscbzw/isinbm/index_list_code.shtml'
html = requests.get(html_link).text
soup = BeautifulSoup(html, 'html.parser')
for link in soup.find_all('td'):
num=link.find('a').get('href')
print(num)
This will allow you to further debug your code. This MAY NOT work as some sites require specific headers or automatically reject your request, such as a user-agent header. Requests is a very easy (subjective of course) library to work with and has a lot of support on this site. To save some head-scratching I will go ahead and tell you that if the site requires javascript, Selenium or some variant is the best option.
This question already has answers here:
Scraping YouTube links from a webpage
(3 answers)
Closed 2 years ago.
i am scraping youtube search results using the following code :
import requests
from bs4 import BeautifulSoup
url = "https://www.youtube.com/results?search_query=python"
response = requests.get(url)
soup = BeautifulSoup(response.content,'html.parser')
for each in soup.find_all("a", class_="yt-simple-endpoint style-scope ytd-video-renderer"):
print(each.get('href'))
but it is returning nothing . what is wrong with this code?
BeatifulSoup is not the right tool for Youtube scraping_ - Youtube is generating a lot of content using JavaScript.
You can easily test it:
>>> import requests
>>> from bs4 import BeautifulSoup
>>> url = "https://www.youtube.com/results?search_query=python"
>>> response = requests.get(url)
>>> soup = BeautifulSoup(response.content,'html.parser')
>>> soup.find_all("a")
[About, Press, Copyright, Contact us, Creators, Advertise, Developers, Terms, Privacy, Policy and Safety, Test new features]
(pay attention there's that links you see on the screenshot are not present in the list)
You need to use another solution for that - Selenium might be a good choice. Please have at look at this thread for details Fetch all href link using selenium in python
I am currently going through the 'Automate the Boring Stuff' Udemy Course, lesson '40. Parsing HTML with the Beautiful Soup Module'. About minutes in, Al uses requests the html of an amazon page and uses soup.select with the prices selector in order to print it out. I am currently trying to that with the exact same code, except for the usage of headers with seems to be necessary, otherwise i get a server error. I have read through some similar questions and the general solution seems to be to find the source for the data using the network panel. Unfortunately i have no clue on how to do that :/
import requests
import bs4
headers = {'User-Agent': 'Chrome'}
url = 'https://www.amazon.com/Automate-Boring-Stuff-Python-Programming-ebook/dp/B00WJ049VU/ref=tmm_kin_swatch_0?_encoding=UTF8&qid=&sr='
res = requests.get(url, headers=headers)
soup = bs4.BeautifulSoup(res.text, features='html.parser')
print(soup.select('#mediaNoAccordion > div.a-row > div.a-column.a-span4.a-text-right.a-span-last > span.a-size-medium.a-color-price.header-price'))
You need to use a more forgiving parser. You can also use a much shorter and more robust selector.
import requests
import bs4
headers = {'User-Agent': 'Chrome'}
url = 'https://www.amazon.com/Automate-Boring-Stuff-Python-Programming-ebook/dp/B00WJ049VU/ref=tmm_kin_swatch_0?_encoding=UTF8&qid=&sr='
res = requests.get(url, headers=headers)
soup = bs4.BeautifulSoup(res.text, features='lxml')
print(soup.select_one('.mediaTab_subtitle').text.strip())
For better uses you can do inspect element and click on the top-left corner on the arrow icon and activate it. Then you can hover over the element and select. After selecting you can choose from copying xpath/css selector/class/id
I am trying to scrape some data from a specific website using the requests and Beautiful Soup libraries. Unfortunately, I am not receiving the HTML for that page, but for the parent page https://salesweb.civilview.com. Thank you for your help!
import requests
from bs4 import BeautifulSoup
example="https://salesweb.civilview.com/Sales/SaleDetails?PropertyId=473016965"
exampleGet=requests.get(example)
exampleGetText=exampleGet.text
soup = BeautifulSoup(exampleGetText,"lxml")
soup
You need to feed a cookie to the request:
import requests
from bs4 import BeautifulSoup
cookie = {'ASP.NET_SessionId': 'rk2b0dxast1eyu5jvxezltgh'}
example="https://salesweb.civilview.com/Sales/SaleDetails?PropertyId=473016964"
exampleGet=requests.get(example, cookies=cookie)
exampleGetText=exampleGet.text
soup = BeautifulSoup(exampleGetText,"lxml")
soup.title
<title>Sales Listing Detail</title>
That particular cookie may not work for you, so you'll need manually navigate to that page one time, then go into the developer (web inspector) tools in your browser, and lookup the cookie under "Headers" in the network tab. My cookie looked like 'ASP.NET_SessionId=rk2b0dxast1eyu5jvxezltgh'.
The cookie should be valid for other property pages as well.