Why I cannot scrape all the data from Zillow? - python

I'm trying to scrape the data from Zillow (prices) as a practice with Python and I'm not getting the data complete.
This is my code
from jobEntryBot import JobEntryBot
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.by import By
from pprint import pprint
import time
import requests
URL_ZILLOW = r"https://www.zillow.com/homes/for_rent/?searchQueryState=%7B%22pagination%22%3A%7B%7D%2C%22mapBounds%22%3A%7B%22west%22%3A-123.4663871665039%2C%22east%22%3A-121.7744926352539%2C%22south%22%3A37.03952097286371%2C%22north%22%3A38.19687379258651%7D%2C%22isMapVisible%22%3Atrue%2C%22filterState%22%3A%7B%22price%22%3A%7B%22max%22%3A872627%7D%2C%22beds%22%3A%7B%22min%22%3A1%7D%2C%22fore%22%3A%7B%22value%22%3Afalse%7D%2C%22mp%22%3A%7B%22max%22%3A3000%7D%2C%22auc%22%3A%7B%22value%22%3Afalse%7D%2C%22nc%22%3A%7B%22value%22%3Afalse%7D%2C%22fr%22%3A%7B%22value%22%3Atrue%7D%2C%22fsbo%22%3A%7B%22value%22%3Afalse%7D%2C%22cmsn%22%3A%7B%22value%22%3Afalse%7D%2C%22fsba%22%3A%7B%22value%22%3Afalse%7D%7D%2C%22isListVisible%22%3Atrue%2C%22mapZoom%22%3A9%7D"
header = {
'user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/511.22 (KHTML, like Gecko) Chrome/139.3.3.3 Safari/312.311',
'Accept-Language': 'en-US,en;q=0.9'
}
data = requests.get(headers=header, url=URL_ZILLOW)
soup = BeautifulSoup(data.text, "html.parser")
selector_for_prices = ".gMDnGj span"
prices = soup.select(selector_for_prices)
for price in prices:
print(price.text)
I try this but **only get 9 prices ** not all the 40 something prices on the webpage.
enter image description here
I've tried using other functions like soup.find_all() but it doesn't work. I've tried even using selenium.
If I inspect the Zillow page and use the selector I use in the code it works but not in my code.
Pd: I changed the user_agent for the code I show fyi

Since the website has web-detection capabilities, you will first need find a way to avoid detection. This post contains a comprehensive list of methods to avoid detection.
It may also be worth looking into the APIs Zillow offers, as it does not seem like there will be a simple way to scrape their website. But if your just doing fun or as a personal learning experience, then it definitely worth take some time to figure out the best approach to scrape Zillow.

Related

How to efficiently scrap data from dynamic websites using Selenium?

I want to scrape data from https://ksanahealth.com/mental-health-blog/ website .
I am trying to access each blog and then click on the link and scrape the details on the details page of a given blog.
I tried to use BeautifulSoup but it returned no data, and I realized the data was loaded dynamically with JavaScript.
Then I tried to use Selenium to scrape it and this the code I came up with:
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
service = Service('/usr/bin/chromedrivers')
service.start()
driver = webdriver.Remote(service.service_url)
driver.get('https://ksanahealth.com/mental-health-blog/');
driver.quit()
Unfortunately, my code returns no results.
How best can I improve it so that I get the desired results from the blog?
You don't need selenium for this. When a page is loaded dynamically, you can look up in Network tab which urls are being accessed. The following code will get you started - returning a dataframe with blog title & url. You can further access those urls. Do tell if you need guidance.
The code is below:
import requests
import pandas as pd
from bs4 import BeautifulSoup
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64; rv:50.0) Gecko/20100101 Firefox/50.0',
'accept': 'application/json'
}
df_list = []
for x in range(1, 5):
r = requests.get(f'https://ksanahealth.com/wp-admin/admin-ajax.php?id=&post_id=107&slug=mental-health-blog&canonical_url=https%3A%2F%2Fksanahealth.com%2Fmental-health-blog%2F&posts_per_page=10&page={x}&offset=0&post_type=post&repeater=default&seo_start_page=1&preloaded=false&preloaded_amount=0&order=DESC&orderby=date&action=alm_get_posts&query_type=standard', headers=headers)
soup = BeautifulSoup(r.json()['html'], 'html.parser')
for y in soup.select('div.post-item'):
df_list.append((y.select_one('h4').text.strip(), y.select_one('a.more-link').get('href')))
df = pd.DataFrame(df_list, columns = ['Title', 'URL'])
print(df)
This returns:
Title URL
0 Addressing the Youth Mental Health Crisis Requ... https://www.hmpgloballearningnetwork.com/site/...
1 Remote work: What does it mean for local offic... https://www.klcc.org/2022-02-23/remote-work-wh...
2 Second Nature? https://www.oregonbusiness.com/article/tech/it...
3 6 Benefits of Continuous Behavioral Health Mea... https://ksanahealth.com/post/6-benefits-of-con...
4 A New Level of Measurement-Based Care https://ksanahealth.com/post/a-new-level-of-me...
5 4 Ways Continuous Behavioral Health Measuremen... https://ksanahealth.com/post/4-ways-continuous.
[....]

Search text on website BeautifulSoup python

I am trying to find a word on a website via BeautifulSoup, but i can't seem to get it. This is my code so far:
import requests
from bs4 import BeautifulSoup
session = requests.Session()
s = session.get('https://www.doctolib.de/institut/berlin/ciz-berlin-berlin?pid=practice-158431')
soup = BeautifulSoup(s.text, 'html.parser')
tags = soup.find_all(class_="dl-text dl-text-body dl-text-regular dl-text-s dl-text-color-inherit")
for i in tags:
print(i.string)
See below for the picture regarding the specific HTML element. I am try to search and find "Keine Verfügbarkeiten"
Anyone that can help me? Because the code i have used is returning nothing.
Vaccine check
Although the content you look for in that site generates dynamically, it is still available in some script tag in page source (ctrl + U). Following is one of the ways you can parse the same using requests module in combination with re and json.
import re
import json
import requests
url = "https://www.doctolib.de/institut/berlin/ciz-berlin-berlin?pid=practice-158431"
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.80 Safari/537.36',
}
res = requests.get(url,headers=headers)
script = re.search(r"window\.translation_keys[^{]+(.*?});",res.text).group(1)
items = json.loads(script)
print(items['root']['common']['availabilities']['no_availabilities_vaccination'])
Output:
Keine Verfügbarkeiten
The page you are retrieving is generating it's content in JavaScript, so your GET request won't find what you are looking for but instead is going to retrieve the actual page source ( view-source:https://www.doctolib.de/institut/berlin/ciz-berlin-berlin?pid=practice-158431 ) without any processing.
What you can do instead is to run Selenium WebDriver that will act like an actual browser allowing it to execute the JavaScript and process the page you see when opening the website from your browser.
Then when you open your page using Selenium you can find the element you are looking for using the find_element_by_css_selector() method
If instead you don't want to use Selenium what you could try to do is to check where the webpage is getting it's data from. With a quick look I can see that it is querying this link to get availabilities data. If you use this method, you can just make a GET request to that API link and parse the JSON response.
Very useful information. Makes sense! I will try using selenium to find the element. If i can stuck I'll get back to you. Thanks!

scraping table data based on date

iam trying to scrape table of kurs transaction https://www.bi.go.id/id/moneter/informasi-kurs/transaksi-bi/Default.aspx
from 2015-2020, but the problem is the link between the default date and the date that I chose is still the same. So how can I tell python to scrape data from 2015-2020(20-Nov-15 -- 20-nov-20)? I'm very new to python and using python 3.thank you in advance
import requests
from bs4 import BeautifulSoup
import pandas as pd
headers={
"User-Agent":"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.101 Safari/537.36",
"X-Requested-With":"XMLHttpRequest"
}
url = "https://www.bi.go.id/id/moneter/informasi-kurs/transaksi-bi/Default.aspx"
import requests
from lxml import html
response = requests.get(url)
content= response.content
print(content)
A few different approaches:
Use array slicing if you are working with 1 dimensional data
Use filter / groupby methods from the Pandas library after putting your data into a dataframe
The website requires you to enter in start and end dates for the query. However, as far as I know, bs4 only scrapes html that is already displayed on the browser, and is not so useful for making a query on the web site itself.
From the source code and the POST request it looks like a complicated request so you might be better off simulating mouse clicks.
This can be done using the automated browser testing selenium package to automate opening Google Chrome browser, entering the date into the From and To fields, then clicking the Lihat button, waiting for page to load, then scraping the displayed table using bs4 or selenium.

How to achieve page turn in the <nav> class when making a web crawler?

I am trying to scrape the sales and categories of the top products on https://shopee.co.id/top_products. But I got stuck on how to automates every page on the navigation bar. Particularly there is an expanded list, and I can't figure out how to go into that just by looking at the html code. Here's the picture of the web, and some of my code:
from selenium import webdriver
from bs4 import BeautifulSoup as bs
headers = {
'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.89 Safari/537.36',
'cookie': '_gcl_au=1.1.961206468.1594951946; _med=refer; _fbp=fb.2.1594951949275.1940955365; SPC_IA=-1; SPC_F=y1evilme0ImdfEmNWEc08bul3d8toc33; REC_T_ID=fab983c8-c7d2-11ea-a977-ccbbfe23657a; SPC_SI=uv1y64sfvhx3w6dir503ixw89ve2ixt4; _gid=GA1.3.413262278.1594951963; SPC_U=286107140; SPC_EC=GwoQmu7TiknULYXKODlEi5vEgjawyqNcpIWQjoxjQEW2yJ3H/jsB1Pw9iCgGRGYFfAkT/Ej00ruDcf7DHjg4eNGWbCG+0uXcKb7bqLDcn+A2hEl1XMtj1FCCIES7k17xoVdYW1tGg0qaXnSz0/Uf3iaEIIk7Q9rqsnT+COWVg8Y=; csrftoken=5MdKKnZH5boQXpaAza1kOVLRFBjx1eij; welcomePkgShown=true; _ga=GA1.1.1693450966.1594951955; _dc_gtm_UA-61904553-8=1; REC_MD_30_2002454304=1595153616; _ga_SW6D8G0HXK=GS1.1.1595152099.14.1.1595153019.0; REC_MD_41_1000044=1595153318_0_50_0_49; SPC_R_T_ID="Am9bCo3cc3Jno2mV5RDkLJIVsbIWEDTC6ezJknXdVVRfxlQRoGDcya57fIQsioFKZWhP8/9PAGhldR0L/efzcrKONe62GAzvsztkZHfAl0I="; SPC_T_IV="IETR5YkWloW3OcKf80c6RQ=="; SPC_R_T_IV="IETR5YkWloW3OcKf80c6RQ=="; SPC_T_ID="Am9bCo3cc3Jno2mV5RDkLJIVsbIWEDTC6ezJknXdVVRfxlQRoGDcya57fIQsioFKZWhP8/9PAGhldR0L/efzcrKONe62GAzvsztkZHfAl0I="'
}
driver = webdriver.Chrome(executable_path='/usr/local/bin/chromedriver')
shopee_url = 'https://shopee.co.id/top_products'
driver.get(shopee_url)
driver.implicitly_wait(15)
response = driver.page_source
driver.close()
soup = bs(response, "html.parser")
url_list = []
for tags in soup.find_all('li', attrs={'class': 'stardust-tabs-header__tab stardust-tabs-header__tab--active'}):
for a tag in tags.find_all('a',):
url_list.append()
Look at the network tab, there are several calls made there for example this:
https://shopee.co.id/api/v4/recommend/recommend?bundle=top_sold_product_microsite&limit=20&offset=0
that will give you all the nav bar links in a nicely formatted json.
Sometimes you can get more information by looking at the different requests being made on the network tab than by parsing the html body
If you look at the first item in the nav bar it says Kuota Data Internet, if you click on it, you're redirected to https://shopee.co.id/top_products?catId=ID_V2L0_65
that means each url in the nav bar is of the form https://shopee.co.id/top_products?catId=CAT_ID
you can find CAT_ID for each one looking at https://shopee.co.id/api/v4/recommend/recommend?bundle=top_sold_product_microsite&limit=20&offset=0
and maybe changing the limit to something other that 20 and the offset to something different than 0
for Kuota Data Internet the CAT_ID. is ID_V2L0_65
as shown here:

python web scraping Weatherforecast

I'm new to Python(actually second time I try to learn the language so i know a little something) and I'm trying to build a script that scrapes the weather forecast.
Now i have a little problem with finding the right html classes to import into python. I have this code now:
import requests
from bs4 import BeautifulSoup
page = requests.get("https://openweathermap.org/city/2743477")
soup = BeautifulSoup(page.content, 'html.parser')
city_name = soup.find(class_="weather-widget__city-name")
print(city_name)
Problem is that this just returns 'None'
I found the class that the code searches for via chrome and inspect page. If i export the html page through python with the following code:
import requests
from bs4 import BeautifulSoup
page = requests.get("https://openweathermap.org/city/2743477")
soup = BeautifulSoup(page.content, 'html.parser')
city_name = soup.find(class_="weather-widget__city-name")
print(soup.prettify())
Then I see the html page in cmd(as expected) but I'm also unable to find 'class_="weather-widget__city-name"' so I'm not amazed that python is also unable to. My question is, why is the html code that python gives me different than the html code Chrome shows on the site? And am I doing something wrong with trying to find the weather widget through BeautifulSoup this way?
Here is a picture from the page, the part that I'm trying to scrape is circled in red.
Screenshot from website
Thanks in advance!
That site is loaded with JS.
Python requests doesn't activate those scripts. One of those scripts is responsible for loading the data you are looking for (you can see it's JS, maybe with a bit jQuery I didn't actually check, by the spinning circle while it's loading).
My suggestion here is use the sites API.
I am not subscribed to the site so I can't show here an example but the trick is simple. You subscribe to the sites API with the basic (and free) plan, get an API key and start sending get requests to the API URLs.
This will also simplify things for you further since you wouldn't need BeautifulSoup for the parsing. The responses are all in JSON.
There is another nastier way around it and that is using selenium. This module will simulate the web browser and all of it's JS activating, HTML rendering, CSS loading mechanisms.
I have experience with both and I strongly recommend sticking to the API (if that option exists).
For sites that use JS to send further requests, after we request the initial URL, one method that can work is to study the network tab of Chrome's developer tools (or the equivalent in any other browser).
You'll usually find a big list of URLs being requested by the browser. Most of them are unnecessary for our purposes. And few of them relate to other sites such as Google, Facebook.
In this particular case, after the initial URL is requested, you'll find a few '.js' files being retrieved and after that, three scripts (forecast, weather, daily) that correspond to the data which finally gets presented by the browser.
From those three, the data you ask for comes from the 'weather' script. If you click on it in the network tab, another side pane will open which will contain header information, preview, etc.
In the Headers tab, you'll find the URL that you need to use, which is:
https://openweathermap.org/data/2.5/weather?id=2743477&units=metric&appid=b1b15e88fa797225412429c1c50c122a1
The b1b15e88fa797225412429c1c50c122a1 might be a general API key that is assigned to a browser request. I don't know for sure. But all we need to know is that it doesn't change. I've tried on two different systems and this value doesn't change.
The 2743477 is, of course, the City ID. You can download a reference of various cities and their IDs in their site itself:
http://bulk.openweathermap.org/sample/
As nutmeg64 said, the site actually responds with a JSON file. That's the case with both the API and the request of this URL found in the network tab of a browser.
As for the codes appearing in the JSON, the site gives you a reference to the codes and their meanings:
https://openweathermap.org/weather-conditions
With this information, you can use requests and json to retrieve and manipulate the data. Here's a sample script:
from pprint import pprint
import json
import requests
city_id = 2743477
url = 'https://openweathermap.org/data/2.5/weather?id={}&units=metric&appid=b1b15e88fa797225412429c1c50c122a1'.format(city_id)
req_headers = {
'Accept': '*/*',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'en-US,en;q=0.8',
'Connection': 'keep-alive',
'Host': 'openweathermap.org',
'Referer': 'https://openweathermap.org/city/2743477',
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'
}
s = requests.Session()
r = s.get(url, headers=req_headers)
d = json.loads(r.text)
pprint(d)
However, as nutmeg64 said, it's better to use the API, and resist the temptation to bombard the site with more requests than you truly need.
You can find all about their API here:
https://openweathermap.org/current
Use selenium in combination with BeautifulSoup to get any of the table from that page with no hardship. Here is how you can do:
from selenium import webdriver
from bs4 import BeautifulSoup
driver=webdriver.Chrome()
driver.get("https://openweathermap.org/city/2743477")
soup = BeautifulSoup(driver.page_source, 'lxml')
driver.quit()
table_tag = soup.select(".weather-widget__items")[0]
tab_data = [[item.text.strip() for item in row_data.select("td")]
for row_data in table_tag.select("tr")]
for data in tab_data:
print(data)
Partial result:
['Wind', 'Gentle Breeze,\n 3.6 m/s, Southwest ( 220 )']
['Cloudiness', 'Broken clouds']
['Pressure', '1014 hpa']
['Humidity', '100 %']
['Sunrise', '11:53']

Categories

Resources