scraping table data based on date - python

iam trying to scrape table of kurs transaction https://www.bi.go.id/id/moneter/informasi-kurs/transaksi-bi/Default.aspx
from 2015-2020, but the problem is the link between the default date and the date that I chose is still the same. So how can I tell python to scrape data from 2015-2020(20-Nov-15 -- 20-nov-20)? I'm very new to python and using python 3.thank you in advance
import requests
from bs4 import BeautifulSoup
import pandas as pd
headers={
"User-Agent":"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.101 Safari/537.36",
"X-Requested-With":"XMLHttpRequest"
}
url = "https://www.bi.go.id/id/moneter/informasi-kurs/transaksi-bi/Default.aspx"
import requests
from lxml import html
response = requests.get(url)
content= response.content
print(content)

A few different approaches:
Use array slicing if you are working with 1 dimensional data
Use filter / groupby methods from the Pandas library after putting your data into a dataframe

The website requires you to enter in start and end dates for the query. However, as far as I know, bs4 only scrapes html that is already displayed on the browser, and is not so useful for making a query on the web site itself.
From the source code and the POST request it looks like a complicated request so you might be better off simulating mouse clicks.
This can be done using the automated browser testing selenium package to automate opening Google Chrome browser, entering the date into the From and To fields, then clicking the Lihat button, waiting for page to load, then scraping the displayed table using bs4 or selenium.

Related

Why I cannot scrape all the data from Zillow?

I'm trying to scrape the data from Zillow (prices) as a practice with Python and I'm not getting the data complete.
This is my code
from jobEntryBot import JobEntryBot
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.by import By
from pprint import pprint
import time
import requests
URL_ZILLOW = r"https://www.zillow.com/homes/for_rent/?searchQueryState=%7B%22pagination%22%3A%7B%7D%2C%22mapBounds%22%3A%7B%22west%22%3A-123.4663871665039%2C%22east%22%3A-121.7744926352539%2C%22south%22%3A37.03952097286371%2C%22north%22%3A38.19687379258651%7D%2C%22isMapVisible%22%3Atrue%2C%22filterState%22%3A%7B%22price%22%3A%7B%22max%22%3A872627%7D%2C%22beds%22%3A%7B%22min%22%3A1%7D%2C%22fore%22%3A%7B%22value%22%3Afalse%7D%2C%22mp%22%3A%7B%22max%22%3A3000%7D%2C%22auc%22%3A%7B%22value%22%3Afalse%7D%2C%22nc%22%3A%7B%22value%22%3Afalse%7D%2C%22fr%22%3A%7B%22value%22%3Atrue%7D%2C%22fsbo%22%3A%7B%22value%22%3Afalse%7D%2C%22cmsn%22%3A%7B%22value%22%3Afalse%7D%2C%22fsba%22%3A%7B%22value%22%3Afalse%7D%7D%2C%22isListVisible%22%3Atrue%2C%22mapZoom%22%3A9%7D"
header = {
'user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/511.22 (KHTML, like Gecko) Chrome/139.3.3.3 Safari/312.311',
'Accept-Language': 'en-US,en;q=0.9'
}
data = requests.get(headers=header, url=URL_ZILLOW)
soup = BeautifulSoup(data.text, "html.parser")
selector_for_prices = ".gMDnGj span"
prices = soup.select(selector_for_prices)
for price in prices:
print(price.text)
I try this but **only get 9 prices ** not all the 40 something prices on the webpage.
enter image description here
I've tried using other functions like soup.find_all() but it doesn't work. I've tried even using selenium.
If I inspect the Zillow page and use the selector I use in the code it works but not in my code.
Pd: I changed the user_agent for the code I show fyi
Since the website has web-detection capabilities, you will first need find a way to avoid detection. This post contains a comprehensive list of methods to avoid detection.
It may also be worth looking into the APIs Zillow offers, as it does not seem like there will be a simple way to scrape their website. But if your just doing fun or as a personal learning experience, then it definitely worth take some time to figure out the best approach to scrape Zillow.

Search text on website BeautifulSoup python

I am trying to find a word on a website via BeautifulSoup, but i can't seem to get it. This is my code so far:
import requests
from bs4 import BeautifulSoup
session = requests.Session()
s = session.get('https://www.doctolib.de/institut/berlin/ciz-berlin-berlin?pid=practice-158431')
soup = BeautifulSoup(s.text, 'html.parser')
tags = soup.find_all(class_="dl-text dl-text-body dl-text-regular dl-text-s dl-text-color-inherit")
for i in tags:
print(i.string)
See below for the picture regarding the specific HTML element. I am try to search and find "Keine Verfügbarkeiten"
Anyone that can help me? Because the code i have used is returning nothing.
Vaccine check
Although the content you look for in that site generates dynamically, it is still available in some script tag in page source (ctrl + U). Following is one of the ways you can parse the same using requests module in combination with re and json.
import re
import json
import requests
url = "https://www.doctolib.de/institut/berlin/ciz-berlin-berlin?pid=practice-158431"
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.80 Safari/537.36',
}
res = requests.get(url,headers=headers)
script = re.search(r"window\.translation_keys[^{]+(.*?});",res.text).group(1)
items = json.loads(script)
print(items['root']['common']['availabilities']['no_availabilities_vaccination'])
Output:
Keine Verfügbarkeiten
The page you are retrieving is generating it's content in JavaScript, so your GET request won't find what you are looking for but instead is going to retrieve the actual page source ( view-source:https://www.doctolib.de/institut/berlin/ciz-berlin-berlin?pid=practice-158431 ) without any processing.
What you can do instead is to run Selenium WebDriver that will act like an actual browser allowing it to execute the JavaScript and process the page you see when opening the website from your browser.
Then when you open your page using Selenium you can find the element you are looking for using the find_element_by_css_selector() method
If instead you don't want to use Selenium what you could try to do is to check where the webpage is getting it's data from. With a quick look I can see that it is querying this link to get availabilities data. If you use this method, you can just make a GET request to that API link and parse the JSON response.
Very useful information. Makes sense! I will try using selenium to find the element. If i can stuck I'll get back to you. Thanks!

Python : find_all() return an empty list

I'm trying to make a bot that send me an email once a new product is online on a website.
I tried to do that with requests and beautifulSoup.
This is my code :
import requests
from bs4 import BeautifulSoup
URL = 'https://www.vinted.fr/vetements?search_text=football&size_id[]=207&price_from=0&price_to=15&order=newest_first'
headers = {'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.89 Safari/537.36"}
page = requests.get(URL, headers=headers)
soup = BeautifulSoup(page.content, 'html.parser')
products = soup.find_all("div", class_="c-box")
print(len(products))
Next, I'll want to compare the number of products before and after my new request in a loop.
But when I try to see the number of products that I found, I get an empty list : []
I don't know how to fix that ...
The div that I use is in others div, I don't know if it has a relation
Thanks by advance
You have problem with the website that you are trying to parse.
The website in your code generates elements you are looking for(div.c-box) after the website is fully loaded, using javascript, at the client-side. So it's like:
Browser gets HTML source from server --(1)--> JS files loaded as browser loads html source --> JS files add elements to the HTML source --(2)--> Those elements are loaded to the browser
You cannot fetch the data you want by requests.get because requests.get method can only get HTML source at point (1), but the website loads the data at (2) point. To fetch such data, you should use automated browser modules such as selenium.
You should always check the data.
Convert your BeautifulSoup object to string with soup.decode('utf-8') and write it on a file. Then check what you get from the website. In this case, there is no element with c-box class.
You should use selenium instead of requests.

Unable to scrape a piece of static information from a webpage

I've created a script in python to log in a webpage using credentials and then parse a piece of information SIGN OUT from another link (the script is supposed to get redirected to that link) to make sure I did log in.
Website address
I've tried with:
import requests
from bs4 import BeautifulSoup
url = "https://member.angieslist.com/gateway/platform/v1/session/login"
link = "https://member.angieslist.com/"
payload = {"identifier":"usename","token":"password"}
with requests.Session() as s:
s.post(url,json=payload,headers={
"User-Agent":"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.87 Safari/537.36",
"Referer":"https://member.angieslist.com/member/login",
"content-type":"application/json"
})
r = s.get(link,headers={"User-Agent":"Mozilla/5.0"},allow_redirects=True)
soup = BeautifulSoup(r.text,"lxml")
login_stat = soup.select_one("button[class*='menu-item--account']").text
print(login_stat)
When i run the above script, I get AttributeError: 'NoneType' object has no attribute 'text' this error which means I went somewhere wrong in my log in process as the information I wish to parse SIGN OUT is a static content.
How can I parse this SIGN OUT information from that webpage?
This website requires JavaScript to work with. Though you generate the login token correctly from the login API, but when you go to the home page, it make multiple additional API calls and then updates the page.
So the issue has nothing to do with login not working. You need to use something like selenium for this
from selenium import webdriver
driver = webdriver.Chrome()
driver.get("https://member.angieslist.com/member/login")
driver.find_element_by_name("email").send_keys("none#getnada.com")
driver.find_element_by_name("password").send_keys("NUN#123456")
driver.find_element_by_id("login--login-button").click()
import time
time.sleep(3)
soup = BeautifulSoup(driver.page_source,"lxml")
login_stat = soup.select("[id*='menu-item']")
for item in login_stat:
print(item.text)
print(login_stat)
driver.quit()
I have mixed bs4 and selenium here to get it easy for you but you can use just selenium as well if you want

python web scraping Weatherforecast

I'm new to Python(actually second time I try to learn the language so i know a little something) and I'm trying to build a script that scrapes the weather forecast.
Now i have a little problem with finding the right html classes to import into python. I have this code now:
import requests
from bs4 import BeautifulSoup
page = requests.get("https://openweathermap.org/city/2743477")
soup = BeautifulSoup(page.content, 'html.parser')
city_name = soup.find(class_="weather-widget__city-name")
print(city_name)
Problem is that this just returns 'None'
I found the class that the code searches for via chrome and inspect page. If i export the html page through python with the following code:
import requests
from bs4 import BeautifulSoup
page = requests.get("https://openweathermap.org/city/2743477")
soup = BeautifulSoup(page.content, 'html.parser')
city_name = soup.find(class_="weather-widget__city-name")
print(soup.prettify())
Then I see the html page in cmd(as expected) but I'm also unable to find 'class_="weather-widget__city-name"' so I'm not amazed that python is also unable to. My question is, why is the html code that python gives me different than the html code Chrome shows on the site? And am I doing something wrong with trying to find the weather widget through BeautifulSoup this way?
Here is a picture from the page, the part that I'm trying to scrape is circled in red.
Screenshot from website
Thanks in advance!
That site is loaded with JS.
Python requests doesn't activate those scripts. One of those scripts is responsible for loading the data you are looking for (you can see it's JS, maybe with a bit jQuery I didn't actually check, by the spinning circle while it's loading).
My suggestion here is use the sites API.
I am not subscribed to the site so I can't show here an example but the trick is simple. You subscribe to the sites API with the basic (and free) plan, get an API key and start sending get requests to the API URLs.
This will also simplify things for you further since you wouldn't need BeautifulSoup for the parsing. The responses are all in JSON.
There is another nastier way around it and that is using selenium. This module will simulate the web browser and all of it's JS activating, HTML rendering, CSS loading mechanisms.
I have experience with both and I strongly recommend sticking to the API (if that option exists).
For sites that use JS to send further requests, after we request the initial URL, one method that can work is to study the network tab of Chrome's developer tools (or the equivalent in any other browser).
You'll usually find a big list of URLs being requested by the browser. Most of them are unnecessary for our purposes. And few of them relate to other sites such as Google, Facebook.
In this particular case, after the initial URL is requested, you'll find a few '.js' files being retrieved and after that, three scripts (forecast, weather, daily) that correspond to the data which finally gets presented by the browser.
From those three, the data you ask for comes from the 'weather' script. If you click on it in the network tab, another side pane will open which will contain header information, preview, etc.
In the Headers tab, you'll find the URL that you need to use, which is:
https://openweathermap.org/data/2.5/weather?id=2743477&units=metric&appid=b1b15e88fa797225412429c1c50c122a1
The b1b15e88fa797225412429c1c50c122a1 might be a general API key that is assigned to a browser request. I don't know for sure. But all we need to know is that it doesn't change. I've tried on two different systems and this value doesn't change.
The 2743477 is, of course, the City ID. You can download a reference of various cities and their IDs in their site itself:
http://bulk.openweathermap.org/sample/
As nutmeg64 said, the site actually responds with a JSON file. That's the case with both the API and the request of this URL found in the network tab of a browser.
As for the codes appearing in the JSON, the site gives you a reference to the codes and their meanings:
https://openweathermap.org/weather-conditions
With this information, you can use requests and json to retrieve and manipulate the data. Here's a sample script:
from pprint import pprint
import json
import requests
city_id = 2743477
url = 'https://openweathermap.org/data/2.5/weather?id={}&units=metric&appid=b1b15e88fa797225412429c1c50c122a1'.format(city_id)
req_headers = {
'Accept': '*/*',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'en-US,en;q=0.8',
'Connection': 'keep-alive',
'Host': 'openweathermap.org',
'Referer': 'https://openweathermap.org/city/2743477',
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'
}
s = requests.Session()
r = s.get(url, headers=req_headers)
d = json.loads(r.text)
pprint(d)
However, as nutmeg64 said, it's better to use the API, and resist the temptation to bombard the site with more requests than you truly need.
You can find all about their API here:
https://openweathermap.org/current
Use selenium in combination with BeautifulSoup to get any of the table from that page with no hardship. Here is how you can do:
from selenium import webdriver
from bs4 import BeautifulSoup
driver=webdriver.Chrome()
driver.get("https://openweathermap.org/city/2743477")
soup = BeautifulSoup(driver.page_source, 'lxml')
driver.quit()
table_tag = soup.select(".weather-widget__items")[0]
tab_data = [[item.text.strip() for item in row_data.select("td")]
for row_data in table_tag.select("tr")]
for data in tab_data:
print(data)
Partial result:
['Wind', 'Gentle Breeze,\n 3.6 m/s, Southwest ( 220 )']
['Cloudiness', 'Broken clouds']
['Pressure', '1014 hpa']
['Humidity', '100 %']
['Sunrise', '11:53']

Categories

Resources