I am trying to find a word on a website via BeautifulSoup, but i can't seem to get it. This is my code so far:
import requests
from bs4 import BeautifulSoup
session = requests.Session()
s = session.get('https://www.doctolib.de/institut/berlin/ciz-berlin-berlin?pid=practice-158431')
soup = BeautifulSoup(s.text, 'html.parser')
tags = soup.find_all(class_="dl-text dl-text-body dl-text-regular dl-text-s dl-text-color-inherit")
for i in tags:
print(i.string)
See below for the picture regarding the specific HTML element. I am try to search and find "Keine Verfügbarkeiten"
Anyone that can help me? Because the code i have used is returning nothing.
Vaccine check
Although the content you look for in that site generates dynamically, it is still available in some script tag in page source (ctrl + U). Following is one of the ways you can parse the same using requests module in combination with re and json.
import re
import json
import requests
url = "https://www.doctolib.de/institut/berlin/ciz-berlin-berlin?pid=practice-158431"
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.80 Safari/537.36',
}
res = requests.get(url,headers=headers)
script = re.search(r"window\.translation_keys[^{]+(.*?});",res.text).group(1)
items = json.loads(script)
print(items['root']['common']['availabilities']['no_availabilities_vaccination'])
Output:
Keine Verfügbarkeiten
The page you are retrieving is generating it's content in JavaScript, so your GET request won't find what you are looking for but instead is going to retrieve the actual page source ( view-source:https://www.doctolib.de/institut/berlin/ciz-berlin-berlin?pid=practice-158431 ) without any processing.
What you can do instead is to run Selenium WebDriver that will act like an actual browser allowing it to execute the JavaScript and process the page you see when opening the website from your browser.
Then when you open your page using Selenium you can find the element you are looking for using the find_element_by_css_selector() method
If instead you don't want to use Selenium what you could try to do is to check where the webpage is getting it's data from. With a quick look I can see that it is querying this link to get availabilities data. If you use this method, you can just make a GET request to that API link and parse the JSON response.
Very useful information. Makes sense! I will try using selenium to find the element. If i can stuck I'll get back to you. Thanks!
Related
iam trying to scrape table of kurs transaction https://www.bi.go.id/id/moneter/informasi-kurs/transaksi-bi/Default.aspx
from 2015-2020, but the problem is the link between the default date and the date that I chose is still the same. So how can I tell python to scrape data from 2015-2020(20-Nov-15 -- 20-nov-20)? I'm very new to python and using python 3.thank you in advance
import requests
from bs4 import BeautifulSoup
import pandas as pd
headers={
"User-Agent":"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.101 Safari/537.36",
"X-Requested-With":"XMLHttpRequest"
}
url = "https://www.bi.go.id/id/moneter/informasi-kurs/transaksi-bi/Default.aspx"
import requests
from lxml import html
response = requests.get(url)
content= response.content
print(content)
A few different approaches:
Use array slicing if you are working with 1 dimensional data
Use filter / groupby methods from the Pandas library after putting your data into a dataframe
The website requires you to enter in start and end dates for the query. However, as far as I know, bs4 only scrapes html that is already displayed on the browser, and is not so useful for making a query on the web site itself.
From the source code and the POST request it looks like a complicated request so you might be better off simulating mouse clicks.
This can be done using the automated browser testing selenium package to automate opening Google Chrome browser, entering the date into the From and To fields, then clicking the Lihat button, waiting for page to load, then scraping the displayed table using bs4 or selenium.
I'm trying to make a bot that send me an email once a new product is online on a website.
I tried to do that with requests and beautifulSoup.
This is my code :
import requests
from bs4 import BeautifulSoup
URL = 'https://www.vinted.fr/vetements?search_text=football&size_id[]=207&price_from=0&price_to=15&order=newest_first'
headers = {'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.89 Safari/537.36"}
page = requests.get(URL, headers=headers)
soup = BeautifulSoup(page.content, 'html.parser')
products = soup.find_all("div", class_="c-box")
print(len(products))
Next, I'll want to compare the number of products before and after my new request in a loop.
But when I try to see the number of products that I found, I get an empty list : []
I don't know how to fix that ...
The div that I use is in others div, I don't know if it has a relation
Thanks by advance
You have problem with the website that you are trying to parse.
The website in your code generates elements you are looking for(div.c-box) after the website is fully loaded, using javascript, at the client-side. So it's like:
Browser gets HTML source from server --(1)--> JS files loaded as browser loads html source --> JS files add elements to the HTML source --(2)--> Those elements are loaded to the browser
You cannot fetch the data you want by requests.get because requests.get method can only get HTML source at point (1), but the website loads the data at (2) point. To fetch such data, you should use automated browser modules such as selenium.
You should always check the data.
Convert your BeautifulSoup object to string with soup.decode('utf-8') and write it on a file. Then check what you get from the website. In this case, there is no element with c-box class.
You should use selenium instead of requests.
I'm trying to scrape data from Google translate for educational purpose.
Here is the code
from urllib.request import Request, urlopen
from bs4 import BeautifulSoup
#https://translate.google.com/#view=home&op=translate&sl=en&tl=en&text=hello
#tlid-transliteration-content transliteration-content full
class Phonetizer:
def __init__(self,sentence : str,language_ : str = 'en'):
self.words=sentence.split()
self.language=language_
def get_phoname(self):
for word in self.words:
print(word)
url="https://translate.google.com/#view=home&op=translate&sl="+self.language+"&tl="+self.language+"&text="+word
print(url)
req = Request(url, headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:71.0) Gecko/20100101 Firefox/71.0'})
webpage = urlopen(req).read()
f= open("debug.html","w+")
f.write(webpage.decode("utf-8"))
f.close()
#print(webpage)
bsoup = BeautifulSoup(webpage,'html.parser')
phonems = bsoup.findAll("div", {"class": "tlid-transliteration-content transliteration-content full"})
print(phonems)
#break
The problem is when gives me the html, there is no tlid-transliteration-content transliteration-content full class, of css.
But using inspect, I have found that, phoneme are inside this css class, here take a snap :
I have saved the html, and here it is, take a look, no tlid-transliteration-content transliteration-content full is present and it not like other google translate page, it is not complete. I have heard google blocks crawler, bot, spyder. And it can be easily detected by their system, so I added the additional header, but still I can't access the whole page.
How can I do so ? Access the whole page and read all data from google translate page?
Want to contribute on this project?
I have tried this code below :
from requests_html import AsyncHTMLSession
asession = AsyncHTMLSession()
lang = "en"
word = "hello"
url="https://translate.google.com/#view=home&op=translate&sl="+lang+"&tl="+lang+"&text="+word
async def get_url():
r = await asession.get(url)
print(r)
return r
results = asession.run(get_url)
for result in results:
print(result.html.url)
print(result.html.find('#tlid-transliteration-content'))
print(result.html.find('#tlid-transliteration-content transliteration-content full'))
It gives me nothing, till now.
Yes, this happens because some javascript generated content are rendered by the browser on page load, but what you see is the final DOM, after all kinds of manipulation happened by javascript (adding content). To solve this you would need to use selenium but it has multiple downsides like speed and memory issues. A more modern and better way, in my opinion, is to use requests-html where it will replace both bs4 and urllib and it has a render method as mentioned in the documentation.
Here is a sample code using requests_html, just keep in mind what you trying to print is not utf8 so you might run into some issues printing it on some editors like sublime, it ran fine using cmd.
from requests_html import HTMLSession
session = HTMLSession()
r = session.get("https://translate.google.com/#view=home&op=translate&sl=en&tl=en&text=hello")
r.html.render()
css = ".source-input .tlid-transliteration-content"
print(r.html.find(css, first=True).text)
# output: heˈlō,həˈlō
First of all, I would suggest you to use the Google Translate API instead of scraping google page. The API is a hundred times easier, hassle-free and a legal and conventional way of doing this.
However, if you want to fix this, here is the solution.
You are not dealing with Bot detection here. Google's bot detection is so strong it would just open the google re-captcha page and not even show your desired web-page.
The problem here is that the results of translation are not returned using the URL you have used. This URL just displays the basic translator page, the results are fetched later by javascript and are shown on the page after the page has been loaded. The javascript is not processed by python-requests and this is why the class doesn't even exist in the web-page you are accessing.
The solution is to trace the packets and detect which URL is being used by javascript to fetch results. Fortunately, I have found the found the desired URL for this purpose.
If you request https://translate.google.com/translate_a/single?client=webapp&sl=en&tl=fr&hl=en&dt=at&dt=bd&dt=ex&dt=ld&dt=md&dt=qca&dt=rw&dt=rm&dt=ss&dt=t&dt=gt&source=bh&ssel=0&tsel=0&kc=1&tk=327718.241137&q=goodmorning, you will get the response of translator as JSON. You can parse the JSON to get the desired results.
Here, you can face Bot detection here which can straight away throw an HTTP 403 error.
You can also use selenium to process javascript and give you results. Following changes inyour code can fix it using selenium
from selenium import webdriver
from urllib.request import Request, urlopen
from bs4 import BeautifulSoup
#https://translate.google.com/#view=home&op=translate&sl=en&tl=en&text=hello
#tlid-transliteration-content transliteration-content full
class Phonetizer:
def __init__(self,sentence : str,language_ : str = 'en'):
self.words=sentence.split()
self.language=language_
def get_phoname(self):
for word in self.words:
print(word)
url="https://translate.google.com/#view=home&op=translate&sl="+self.language+"&tl="+self.language+"&text="+word
print(url)
#req = Request(url, headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:71.0) Gecko/20100101 Firefox/71.0'})
#webpage = urlopen(req).read()
driver = webdriver.Chrome()
driver.get(url)
webpage = driver.page_source
driver.close()
f= open("debug.html","w+")
f.write(webpage.decode("utf-8"))
f.close()
#print(webpage)
bsoup = BeautifulSoup(webpage,'html.parser')
phonems = bsoup.findAll("div", {"class": "tlid-transliteration-content transliteration-content full"})
print(phonems)
#break
You should scrape this page with Javascript support, since the content you're looking for "hiding" inside <script> tag, which urllib does not render.
I would suggest to use Selenium or other equivalent framework.
Take a look here: Web-scraping JavaScript page with Python
I've created a script in python to log in a webpage using credentials and then parse a piece of information SIGN OUT from another link (the script is supposed to get redirected to that link) to make sure I did log in.
Website address
I've tried with:
import requests
from bs4 import BeautifulSoup
url = "https://member.angieslist.com/gateway/platform/v1/session/login"
link = "https://member.angieslist.com/"
payload = {"identifier":"usename","token":"password"}
with requests.Session() as s:
s.post(url,json=payload,headers={
"User-Agent":"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.87 Safari/537.36",
"Referer":"https://member.angieslist.com/member/login",
"content-type":"application/json"
})
r = s.get(link,headers={"User-Agent":"Mozilla/5.0"},allow_redirects=True)
soup = BeautifulSoup(r.text,"lxml")
login_stat = soup.select_one("button[class*='menu-item--account']").text
print(login_stat)
When i run the above script, I get AttributeError: 'NoneType' object has no attribute 'text' this error which means I went somewhere wrong in my log in process as the information I wish to parse SIGN OUT is a static content.
How can I parse this SIGN OUT information from that webpage?
This website requires JavaScript to work with. Though you generate the login token correctly from the login API, but when you go to the home page, it make multiple additional API calls and then updates the page.
So the issue has nothing to do with login not working. You need to use something like selenium for this
from selenium import webdriver
driver = webdriver.Chrome()
driver.get("https://member.angieslist.com/member/login")
driver.find_element_by_name("email").send_keys("none#getnada.com")
driver.find_element_by_name("password").send_keys("NUN#123456")
driver.find_element_by_id("login--login-button").click()
import time
time.sleep(3)
soup = BeautifulSoup(driver.page_source,"lxml")
login_stat = soup.select("[id*='menu-item']")
for item in login_stat:
print(item.text)
print(login_stat)
driver.quit()
I have mixed bs4 and selenium here to get it easy for you but you can use just selenium as well if you want
I'm new to Python(actually second time I try to learn the language so i know a little something) and I'm trying to build a script that scrapes the weather forecast.
Now i have a little problem with finding the right html classes to import into python. I have this code now:
import requests
from bs4 import BeautifulSoup
page = requests.get("https://openweathermap.org/city/2743477")
soup = BeautifulSoup(page.content, 'html.parser')
city_name = soup.find(class_="weather-widget__city-name")
print(city_name)
Problem is that this just returns 'None'
I found the class that the code searches for via chrome and inspect page. If i export the html page through python with the following code:
import requests
from bs4 import BeautifulSoup
page = requests.get("https://openweathermap.org/city/2743477")
soup = BeautifulSoup(page.content, 'html.parser')
city_name = soup.find(class_="weather-widget__city-name")
print(soup.prettify())
Then I see the html page in cmd(as expected) but I'm also unable to find 'class_="weather-widget__city-name"' so I'm not amazed that python is also unable to. My question is, why is the html code that python gives me different than the html code Chrome shows on the site? And am I doing something wrong with trying to find the weather widget through BeautifulSoup this way?
Here is a picture from the page, the part that I'm trying to scrape is circled in red.
Screenshot from website
Thanks in advance!
That site is loaded with JS.
Python requests doesn't activate those scripts. One of those scripts is responsible for loading the data you are looking for (you can see it's JS, maybe with a bit jQuery I didn't actually check, by the spinning circle while it's loading).
My suggestion here is use the sites API.
I am not subscribed to the site so I can't show here an example but the trick is simple. You subscribe to the sites API with the basic (and free) plan, get an API key and start sending get requests to the API URLs.
This will also simplify things for you further since you wouldn't need BeautifulSoup for the parsing. The responses are all in JSON.
There is another nastier way around it and that is using selenium. This module will simulate the web browser and all of it's JS activating, HTML rendering, CSS loading mechanisms.
I have experience with both and I strongly recommend sticking to the API (if that option exists).
For sites that use JS to send further requests, after we request the initial URL, one method that can work is to study the network tab of Chrome's developer tools (or the equivalent in any other browser).
You'll usually find a big list of URLs being requested by the browser. Most of them are unnecessary for our purposes. And few of them relate to other sites such as Google, Facebook.
In this particular case, after the initial URL is requested, you'll find a few '.js' files being retrieved and after that, three scripts (forecast, weather, daily) that correspond to the data which finally gets presented by the browser.
From those three, the data you ask for comes from the 'weather' script. If you click on it in the network tab, another side pane will open which will contain header information, preview, etc.
In the Headers tab, you'll find the URL that you need to use, which is:
https://openweathermap.org/data/2.5/weather?id=2743477&units=metric&appid=b1b15e88fa797225412429c1c50c122a1
The b1b15e88fa797225412429c1c50c122a1 might be a general API key that is assigned to a browser request. I don't know for sure. But all we need to know is that it doesn't change. I've tried on two different systems and this value doesn't change.
The 2743477 is, of course, the City ID. You can download a reference of various cities and their IDs in their site itself:
http://bulk.openweathermap.org/sample/
As nutmeg64 said, the site actually responds with a JSON file. That's the case with both the API and the request of this URL found in the network tab of a browser.
As for the codes appearing in the JSON, the site gives you a reference to the codes and their meanings:
https://openweathermap.org/weather-conditions
With this information, you can use requests and json to retrieve and manipulate the data. Here's a sample script:
from pprint import pprint
import json
import requests
city_id = 2743477
url = 'https://openweathermap.org/data/2.5/weather?id={}&units=metric&appid=b1b15e88fa797225412429c1c50c122a1'.format(city_id)
req_headers = {
'Accept': '*/*',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'en-US,en;q=0.8',
'Connection': 'keep-alive',
'Host': 'openweathermap.org',
'Referer': 'https://openweathermap.org/city/2743477',
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'
}
s = requests.Session()
r = s.get(url, headers=req_headers)
d = json.loads(r.text)
pprint(d)
However, as nutmeg64 said, it's better to use the API, and resist the temptation to bombard the site with more requests than you truly need.
You can find all about their API here:
https://openweathermap.org/current
Use selenium in combination with BeautifulSoup to get any of the table from that page with no hardship. Here is how you can do:
from selenium import webdriver
from bs4 import BeautifulSoup
driver=webdriver.Chrome()
driver.get("https://openweathermap.org/city/2743477")
soup = BeautifulSoup(driver.page_source, 'lxml')
driver.quit()
table_tag = soup.select(".weather-widget__items")[0]
tab_data = [[item.text.strip() for item in row_data.select("td")]
for row_data in table_tag.select("tr")]
for data in tab_data:
print(data)
Partial result:
['Wind', 'Gentle Breeze,\n 3.6 m/s, Southwest ( 220 )']
['Cloudiness', 'Broken clouds']
['Pressure', '1014 hpa']
['Humidity', '100 %']
['Sunrise', '11:53']