Related
I am trying to explore zillow housing data for analysis. but I found the data I scraped from Zillow would be much less then listing.
there is one exmaple:
I try to pull house-for-sale listing on 35216:
https://www.zillow.com/birmingham-al-35216/?searchQueryState=%7B%22usersSearchTerm%22%3A%2235216%22%2C%22mapBounds%22%3A%7B%22west%22%3A-86.93997505787829%2C%22east%22%3A-86.62926796559313%2C%22south%22%3A33.33562772711966%2C%22north%22%3A33.51819716059094%7D%2C%22regionSelection%22%3A%5B%7B%22regionId%22%3A73386%2C%22regionType%22%3A7%7D%5D%2C%22isMapVisible%22%3Atrue%2C%22filterState%22%3A%7B%22ah%22%3A%7B%22value%22%3Atrue%7D%2C%22sort%22%3A%7B%22value%22%3A%22globalrelevanceex%22%7D%7D%2C%22isListVisible%22%3Atrue%2C%22mapZoom%22%3A13%2C%22pagination%22%3A%7B%7D%7D
we can see there are 76 records. and if I use google chrome extension: Zillow-to-excel , all 76 houses in listing can be scraped.
https://chrome.google.com/webstore/detail/zillow-to-excel/aecdekdgjlncaadbdiciepplaobhcjgi/related
But when I use Python with request to scrape zillow data, only 18-20 records could be scraped.
here is my code:
import requests
import json
from bs4 import BeautifulSoup as soup
import pandas as pd
import numpy as np
cnt=0
stop_check=0
ele=[]
url='https://www.zillow.com/birmingham-al-35216/'
headers = {
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'accept-encoding': 'gzip, deflate, br',
'accept-language': 'en-US,en;q=0.9,zh-CN;q=0.8,zh;q=0.7,zh-TW;q=0.6',
'upgrade-insecure-requests': '1',
'user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
}
for i in range(1,2):
params = {
'searchQueryState':'{"pagination":{"currentPage":'+str(i)+'},"usersSearchTerm":"35216","mapBounds":{"west":-86.83314614582643,"east":-86.73781685417354,"south":33.32843303639682,"north":33.511017584543204},"regionSelection":[{"regionId":73386,"regionType":7}],"isMapVisible":true,"filterState":{"sort":{"value":"globalrelevanceex"},"ah":{"value":true}},"isListVisible":true,"mapZoom":13}'
}
page=requests.get(url, headers=headers,params=params,timeout=2)
sp=soup(page.content, 'lxml')
lst=sp.find_all('address',{'class':'list-card-addr'})
ele.extend(lst)
print(i, len(lst))
if len(lst)==0:
stop_check+=1
if stop_check>=3:
print('stop on three empty')
Headers and params comes from web using chrome develop tool. I also tried other search and found I only can scrape first 9-11 records on each pages.
I know there is a zillow API but it could be used for a general search like all houses in a zipcode. So I want to try web-scraping.
May I have some suggestions how to fix my code?
Thanks a lot!
You can try that
import requests
import json
url = 'https://www.zillow.com/search/GetSearchPageState.htm'
headers = {
'Accept': '*/*',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'en-US,en;q=0.9',
'upgrade-insecure-requests': '1',
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36'
}
houses = []
for page in range(1, 3):
params = {
"searchQueryState": json.dumps({
"pagination": {"currentPage": page},
"usersSearchTerm": "35216",
"mapBounds": {
"west": -86.97413567189196,
"east": -86.57244804982165,
"south": 33.346263857015515,
"north": 33.48754107532057
},
"mapZoom": 12,
"regionSelection": [
{
"regionId": 73386, "regionType": 7
}
],
"isMapVisible": True,
"filterState": {
"isAllHomes": {
"value": True
},
"sortSelection": {
"value": "globalrelevanceex"
}
},
"isListVisible": True
}),
"wants": json.dumps(
{
"cat1": ["listResults", "mapResults"],
"cat2": ["total"]
}
),
"requestId": 3
}
# send request
page = requests.get(url, headers=headers, params=params)
# get json data
json_data = page.json()
# loop via data
for house in json_data['cat1']['searchResults']['listResults']:
houses.append(house)
# show data
print('Total houses - {}'.format(len(houses)))
# show info in houses
for house in houses:
if 'brokerName' in house.keys():
print('{}: {}'.format(house['brokerName'], house['price']))
else:
print('No broker: {}'.format(house['price']))
Total houses - 76
RealtySouth-MB-Crestline: $424,900
eXp Realty, LLC Central: $259,900
ARC Realty Mountain Brook: $849,000
Ray & Poynor Properties: $499,900
Hinge Realty: $1,550,000
...
P.S. do not forget to mark answer as correct if I help you :)
I'm trying to populate json response issuing a post http requests with appropriate parameters from a webpage. When I run the script, I see that the script gets stuck and doesn't bring any result. It doesn't throw any error either. This is the site link. I chose three options from the three dropdowns from this form in that site before hitting Get times & tickets button.
I've tried with:
import requests
from bs4 import BeautifulSoup
url = 'https://www.thetrainline.com/'
link = 'https://www.thetrainline.com/api/journey-search/'
payload = {"passengers":[{"dateOfBirth":"1991-01-31"}],"isEurope":False,"cards":[],"transitDefinitions":[{"direction":"outward","origin":"1f06fc66ccd7ea92ae4b0a550e4ddfd1","destination":"7c25e933fd14386745a7f49423969308","journeyDate":{"type":"departAfter","time":"2021-02-11T22:45:00"}}],"type":"single","maximumJourneys":4,"includeRealtime":True,"applyFareDiscounts":True}
with requests.Session() as s:
s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.104 Safari/537.36'
s.headers['content-type'] = 'application/json'
s.headers['accept'] = 'application/json'
r = s.post(link,json=payload)
print(r.status_code)
print(r.json())
How can I get json response issuing post requests with parameters from that site?
You are missing the required headers: x-version and referer. The referer header is referring to the search form and you can build it. Before journey-search you have to post an availability request.
import requests
from requests.models import PreparedRequest
headers = {
'authority': 'www.thetrainline.com',
'pragma': 'no-cache',
'cache-control': 'no-cache',
'x-version': '2.0.18186',
'dnt': '1',
'accept-language': 'en-GB',
'sec-ch-ua-mobile': '?0',
'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 11_1_0) AppleWebKit/537.36 (KHTML, like Gecko) '
'Chrome/88.0.4324.96 Safari/537.36',
'content-type': 'application/json',
'accept': 'application/json',
'origin': 'https://www.thetrainline.com',
'sec-fetch-site': 'same-origin',
'sec-fetch-mode': 'cors',
'sec-fetch-dest': 'empty',
}
with requests.Session() as s:
origin = "6e2242b3f38bbbd8d8124e1d84d319e1"
destination = "15bcf02bc44ea754837c8cf14569f608"
localDateTime = "2021-02-03T19:30:00"
dateOfBirth = "1991-02-03"
passenger_type = "single"
req = PreparedRequest()
url = "http://www.neo4j.com"
params = {
"origin": origin,
"destination": destination,
"outwardDate": localDateTime,
"outwardDateType": "departAfter",
"journeySearchType": passenger_type,
"passengers[]": dateOfBirth
}
req.prepare_url("https://www.thetrainline.com/book/results", params)
headers.update({"referer": req.url})
s.headers = headers
payload_availability = {
"origin": origin,
"destination": destination,
"outwardDefinition": {
"localDateTime": localDateTime,
"searchMethod": "DEPARTAFTER"
},
"passengerBirthDates": [{
"id": "PASSENGER-0",
"dateOfBirth": dateOfBirth
}],
"maximumNumberOfJourneys": 4,
"discountCards": []
}
r = s.post('https://www.thetrainline.com/api/coaches/availability', json=payload_availability)
r.raise_for_status()
payload_search = {
"passengers": [{"dateOfBirth": "1991-02-03"}],
"isEurope": False,
"cards": [],
"transitDefinitions": [{
"direction": "outward",
"origin": origin,
"destination": destination,
"journeyDate": {
"type": "departAfter",
"time": localDateTime}
}],
"type": passenger_type,
"maximumJourneys": 4,
"includeRealtime": True,
"applyFareDiscounts": True
}
r = s.post('https://www.thetrainline.com/api/journey-search/', json=payload_search)
r.raise_for_status()
print(r.json())
As Sers's reply, headers are missing.
When scrawling websites, you have to keep in mind anti-scrawling mechanism. The website will block your requests by taking into consideration your IP address, request headers, cookies, and various other factors.
what is wrong in my code, I try get the same content like in https://koleo.pl/rozklad-pkp/krakow-glowny/radom/19-03-2019_10:00/all/EIP-IC--EIC-EIP-IC-KM-REG but result is diffrent as I want to have.
import requests
from bs4 import BeautifulSoup
s = requests.Session()
s.headers.update({"User-Agent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.186 Safari/537.36'})
response=s.get('https://koleo.pl/rozklad-pkp/krakow-glowny/radom/19-03-
2019_10:00/all/EIP-IC--EIC-EIP-IC-KM-REG')
soup=BeautifulSoup(response.text,'lxml')
print(soup.prettify())
You can use requests and pass params in to get json for the train info and prices. I haven't parsed out all the info as this is just to show you it is possible. I parse out the train ids to be able to make the subsequent requests from price info which are linked by ids to the train info
import requests
from bs4 import BeautifulSoup as bs
url = 'https://koleo.pl/pl/connections/?'
headers = {
'Accept' : 'application/json, text/javascript, */*; q=0.01',
'Accept-Encoding' : 'gzip, deflate, br',
'Accept-Language' : 'en-US,en;q=0.9',
'Connection' : 'keep-alive',
'Cookie' : '_ga=GA1.2.2048035736.1553000429; _gid=GA1.2.600745193.1553000429; _gat=1; _koleo_session=bkN4dWRrZGx0UnkyZ3hjMWpFNGhiS1I3TzhQMGNyWitvZlZ0QVRUVVVtWUFPMUwxL0hJYWJyYnlGTUdHYXNuL1N6QlhHMHlRZFM3eFZFcjRuK3ZubllmMjdSaU5CMWRBSTFOc1JRc2lDUGV0Y2NtTjRzbzZEd0laZWI1bjJoK1UrYnc5NWNzZzNJdXVtUlpnVE15QnRnPT0tLTc1YzV1Q2xoRHF4VFpWWTdWZDJXUnc9PQ%3D%3D--3b5fe9bb7b0ce5960bc5bd6a00bf405df87f8bd4',
'Host' : 'koleo.pl',
'Referer' : 'https://koleo.pl/rozklad-pkp/krakow-glowny/radom/19-03-2019_10:00/all/EIP-IC--EIC-EIP-IC-KM-REG',
'User-Agent' : 'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36',
'X-CSRF-Token' : 'heag3Y5/fh0hyOfgdmSGJBmdJR3Perle2vJI0VjB81KClATLsJxFAO4SO9bY6Ag8h6IkpFieW1mtZbD4mga7ZQ==',
'X-Requested-With' : 'XMLHttpRequest'
}
params = {
'v' : 'a0dec240d8d016fbfca9b552898aba9c38fc19d5',
'query[date]' : '19-03-2019 10:00:00',
'query[start_station]' : 'krakow-glowny',
'query[end_station]': 'radom',
'query[brand_ids][]' : '29',
'query[brand_ids][]' : '28',
'query[only_direct]' : 'false',
'query[only_purchasable]': 'false'
}
with requests.Session() as s:
data= s.get(url, params = params, headers = headers).json()
print(data)
priceUrl = 'https://koleo.pl/pl/prices/{}?v=a0dec240d8d016fbfca9b552898aba9c38fc19d5'
for item in data['connections']:
r = s.get(priceUrl.format(item['id'])).json()
print(r)
You have to use selenium in order to get that dynamically generated content. And then you can parse html with BS. For example I've parsed dates:
from bs4 import BeautifulSoup
from selenium import webdriver
driver = webdriver.Firefox()
driver.get('https://koleo.pl/rozklad-pkp/krakow-glowny/radom/19-03-2019_10:00/all/EIP-IC--EIC-EIP-IC-KM-REG')
soup = BeautifulSoup(driver.page_source, 'lxml')
for div in soup.findAll("div", {"class": 'date custom-panel'}):
date = div.findAll("div", {"class": 'row'})[0].string.strip()
print(date)
Output:
wtorek, 19 marca
środa, 20 marca
import requests
from lxml import html
from bs4 import BeautifulSoup
session_requests = requests.session()
sw_url = "https://www.southwest.com"
sw_url2 = "https://www.southwest.com/flight/select-flight.html?displayOnly=&int=HOMEQBOMAIR"
#result = session_requests.get(sw_url)
#tree = html.fromstring(result.text)
payload = {"name":"AirFormModel","origin":"MCI","destination":"DAL","departDate":"2018-02-28T06:00:00.000Z","returnDate":"2018-03-03T06:00:00.000Z","tripType":"true","priceType":"DOLLARS","adult":1,"senior":0,"promoCode":""}
#{
# 'origin': 'MCI',
# 'destination': 'DAL',
# 'departDate':'2018-02-28T06:00:00.000Z',
# 'returnDate':'2018-03-01T06:00:00.000Z',
# 'adult':'1'
#}
p = requests.post(sw_url,params=payload)
#print(p.text)
print(p.content)
p1 = requests.get(sw_url2)
soup = BeautifulSoup(p.text,'html.parser')
print(soup.find("div",{"class":"productPricing"}))
pr = soup.find_all("span",{"class":"currency_symbol"})
for tag in pr:
print(tag)
print('++++')
print(tag.next_sibling)
print(soup.find("div",{"class":"twoSegments"}))
soup = BeautifulSoup(p1.text,'html.parser')
print(soup.find("div",{"class":"productPricing"}))
pr = soup.find_all("span",{"class":"currency_symbol"})
for tag in pr:
print(tag)
print('++++')
print(tag.next_sibling)
print(soup.find("div",{"class":"twoSegments"}))
I need to retrieve prices for flights between 2 locations on specific dates. I identified the parameters by looking at the session info from inspector of the browser and included them in the post request.
I am not sure what I'm doing wrong here, but I am unable to read the data from the tags correctly. It's printing none.
Edit : 4/25/2018
I'm using the following code now, but it doesn't seem to help. Please advise.
import threading
from lxml import html
from bs4 import BeautifulSoup
import time
import datetime
import requests
def worker(oa,da,ods):
"""thread worker function"""
print (oa + ' ' + da + ' ' + ods + ' ' + str(datetime.datetime.now()))
url = "https://www.southwest.com/api/air-booking/v1/air-booking/page/air/booking/shopping"
rh = {
'accept': 'application/json,text/javascript,*/*;q=0.01',
'accept-encoding': 'gzip, deflate, br',
'accept-language': 'en-US,en;q=0.5',
'cache-control': 'max-age=0',
'content-length': '454',
'content-type': 'application/json',
'referer': 'https://www.southwest.com/air/booking/select.html?originationAirportCode=MCI&destinationAirportCode=LAS&returnAirportCode=&departureDate=2018-05-29&departureTimeOfDay=ALL_DAY&returnDate=&returnTimeOfDay=ALL_DAY&adultPassengersCount=1&seniorPassengersCount=0&fareType=USD&passengerType=ADULT&tripType=oneway&promoCode=&reset=true&redirectToVision=true&int=HOMEQBOMAIR&leapfrogRequest=true',
'user-agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36'
}
fd = {
'returnAirport':'',
'twoWayTrip':'false',
'fareType':'DOLLARS',
'originAirport':oa,
'destinationAirport':da,
'outboundDateString':ods,
'returnDateString':'',
'adultPassengerCount':'1',
'seniorPassengerCount':'0',
'promoCode':'',
'submitButton':'true'
}
with requests.Session() as s:
r = s.post(url,headers = rh )
# soup = BeautifulSoup(r.content,'html.parser')
# soup = BeautifulSoup(r.content,'lxml')
print(r)
print(r.content)
print (oa + ' ' + da + ' ' + ods + ' ' + str(datetime.datetime.now()))
return
#db = MySQLdb.connect(host="localhost",user="root",passwd="vikram",db="garmin")
rcount = 0
tdelta = 55
#print(strt_date)
threads = []
count = 1
thr_max = 2
r = ["MCI","DEN","MCI","MDW","MCI","DAL"]
strt_date = (datetime.date.today() + datetime.timedelta(days=tdelta)).strftime("%m/%d/%Y")
while count < 2:
t = threading.Thread(name=r[count-1]+r[count],target=worker,args=(r[count-1],r[count],strt_date))
threads.append(t)
t.start()
count = count + 2
When you say looked at the session info from inspector of the browser, I'm assuming you meant the network tab. If that's the case, are you sure you noted the data being sent properly?
Here's the URL that gets sent by the browser, following which the page you required is fetched:
url = 'https://www.southwest.com/flight/search-flight.html'
You didn't use headers in your request, which, in my opinion, should be passed compulsorily in some cases. Here are the headers that the browser passes:
:authority:www.southwest.com
:method:POST
:path:/flight/search-flight.html
:scheme:https
accept:text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8
accept-encoding:gzip, deflate, br
accept-language:en-US,en;q=0.9
cache-control:max-age=0
content-length:564
content-type:application/x-www-form-urlencoded
origin:https://www.southwest.com
referer:https://www.southwest.com/flight/search-flight.html?int=HOMEQBOMAIR
upgrade-insecure-requests:1
user-agent:Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.186 Safari/537.36
Note:
I removed the cookie header, because that would be taken care of by requests if you're using session.
The first four headers (those that begin with a colon (':')) cannot be passed in Python's requests; so, I skipped them.
Here's the dict that I used to pass the headers:
rh = {
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'accept-encoding': 'gzip, deflate, br',
'accept-language': 'en-US,en;q=0.9',
'cache-control': 'max-age=0',
'content-length': '564',
'content-type': 'application/x-www-form-urlencoded',
'origin': 'https://www.southwest.com',
'referer': 'https://www.southwest.com/flight/search-flight.html?int=HOMEQBOMAIR',
'upgrade-insecure-requests': '1',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.186 Safari/537.36'
}
And here is the form data sent by browser:
fd = {
'toggle_selfltnew': '',
'toggle_AggressiveDrawers': '',
'transitionalAwardSelected': 'false',
'twoWayTrip': 'true',
'originAirport': 'MCI',
# 'originAirport_displayed': 'Kansas City, MO - MCI',
'destinationAirport': 'DAL',
# 'destinationAirport_displayed': 'Dallas (Love Field), TX - DAL',
'airTranRedirect': '',
'returnAirport': 'RoundTrip',
'returnAirport_displayed': '',
'outboundDateString': '02/28/2018',
'outboundTimeOfDay': 'ANYTIME',
'returnDateString': '03/01/2018',
'returnTimeOfDay': 'ANYTIME',
'adultPassengerCount': '1',
'seniorPassengerCount': '0',
'promoCode': '',
'fareType': 'DOLLARS',
'awardCertificateToggleSelected': 'false',
'awardCertificateProductId': ''
}
Note that I commented out two of the items above, but it didn't make any difference. I assumed you'd be having only the location codes and not the full name. If you do have them or if you can extract them from the page, you can send those as well along with other data.
I don't know if it makes any difference, but I used data instead of params:
with requests.Session() as s:
r = s.post(url, headers = rh, data = fd)
soup = BeautifulSoup(r.content, 'lxml')
Finally, here is the result:
>>> soup.find('span', {'class': 'currency_symbol'}).text
'$'
For fun, I'm trying to use Python requests to log on to my school's student portal. This is what I've come up with so far. I'm trying to be very explicit on the headers, because I'm getting a 200 status code (the code you also get when failing to login) instead of a 302 (successful login).
import sys
import os
import requests
def login(username, password):
url = '(link)/home.html#sign-in-content'
values = {
'translator_username' : '',
'translator_password' : '',
'translator_ldappassword' : '',
'returnUrl' : '',
'serviceName' : 'PS Parent Portal',
'serviceTicket' : '',
'pcasServerUrl' : '\/',
'credentialType' : 'User Id and Password Credential',
'account' : username,
'pw' : password,
'translatorpw' : password
}
headers = {
'accept' : 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'accept-encoding' : 'gzip, deflate, br',
'accept-language' : 'en-US,en;q=0.9',
'cache-control' : 'max-age=0',
'connection' : 'keep-alive',
'content-type' : 'application/x-www-form-urlencoded',
'host' : '(link)',
'origin' : '(link)',
'referer' : '(link)guardian/home.html',
'upgrade-insecure-requests' : '1'
}
with requests.Session() as s:
p = s.post(url, data=values)
if p.status_code == 302:
print(p.text)
print('Authentication error', p.status_code)
r = s.get('(link)guardian/home.html')
print(r.text)
def main():
login('myname', 'mypass')
if __name__ == '__main__':
main()
Using Chrome to examine the network requests, all of these headers are under 'Request Headers' in addition to a long cookie number, content-length, and user-agent.
The forms are as follows:
pstoken:(token)
contextData:(text)
translator_username:
translator_password:
translator_ldappassword:
returnUrl:(url)guardian/home.html
serviceName:PS Parent Portal
serviceTicket:
pcasServerUrl:\/
credentialType:User Id and Password Credential
account:f
pw:(id)
translatorpw:
Am I missing something with the headers/form names? Is it a problem with cookies?
If I look at p.requests.headers, this is what is sent:
{'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.89 Safari/537.36', 'accept-encoding': 'gzip, deflate, br', 'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8', 'connection': 'keep-alive', 'accept-language': 'en-US,en;q=0.9', 'cache-control': 'max-age=0', 'content-type': 'application/x-www-form-urlencoded', 'host': '(url)', 'origin': '(url)', 'referer': '(url)guardian/home.html', 'upgrade-insecure-requests': '1', 'Content-Length': '263'}
p.text gives me the HTML of the login page
Tested with PowerAPI, requests, Mechanize, and RoboBrowser. All fail.
What response do you expect? You are using a wrong way to analyze your response.
with requests.Session() as s:
p = s.post(url, data=values)
if p.status_code == 302:
print(p.text)
print('Authentication error', p.status_code)
r = s.get('(link)guardian/home.html')
print(r.text)
In your code, you print out Authentication error ignoring status_code, I think it at least should like this:
with requests.Session() as s:
p = s.post(url, data=values)
if p.status_code == 302:
print(p.text)
r = s.get('(link)guardian/home.html')
print(r.text)
else:
print('Authentication error', p.status_code)