Can someone tell me how to read that file in python? - python

This is the result of using scrapy-splash in Python after browsing a LinkedIn page. Here is its beginning.
b'<html><head></head><body>\x1f\xef\xbf\xbd\x08\x03\xef\xbf\xbd\xef\xbf\xbdko+I\xef\xbf\xbd \xef\xbf\xbd\xef\xbf\xbd\xef\xbf\xbd\xef\xbf\xbd\x0f1\x1cT]]\xef\xbf[...]
I have no clue how to read this? Thanks.

So looks like it was coming from the line which I commented ... No idea why.
lua_script = """
function main(splash)
splash.private_mode_enabled = false
assert(splash:go{
splash.args.url,
headers=splash.args.headers,
})
assert(splash:wait(5))
return {html=splash:html()}
end
"""
yield SplashRequest(url=self.url, callback=self.parse,
endpoint='render.html',
args={'lua_source': lua_script,
'wait': 5,
'private_mode_enabled': 'false',
},
headers={
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.90 Safari/537.36',
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3',
'accept-language': 'en-US,en;q=0.9,fr;q=0.8',
#'accept-encoding': 'gzip, deflate, br', # if used then file looks like shit
'referer': 'https://www.google.com/',
'upgrade-insecure-requests': 1,
},
)

Related

Can't fetch json content from a stubborn webpage using scrapy

I'm trying to create a script using scrapy to grab json content from this webpage. I've used headers within the script accordingly but when I run it, I always end up getting JSONDecodeError. The site sometimes throws captcha but not always. However, I've never got any success using the script below even when I used vpn. How can I fix it?
This is how I've tried:
import scrapy
import urllib
class ImmobilienScoutSpider(scrapy.Spider):
name = "immobilienscout"
start_url = "https://www.immobilienscout24.de/Suche/de/nordrhein-westfalen/wohnung-kaufen"
headers = {
'accept': 'application/json; charset=utf-8',
'accept-encoding': 'gzip, deflate, br',
'accept-language': 'en-US,en;q=0.9',
'x-requested-with': 'XMLHttpRequest',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.0.0 Safari/537.36',
}
params = {
'price': '1000.0-',
'constructionyear': '-2000',
'pagenumber': '1'
}
def start_requests(self):
req_url = f'{self.start_url}?{urllib.parse.urlencode(self.params)}'
yield scrapy.Request(
url=req_url,
headers=self.headers,
callback=self.parse,
)
def parse(self,response):
yield {"response":response.json()}
This is how the output should look like (truncated):
{"searchResponseModel":{"additional":{"lastSearchApiUrl":"/region?realestatetype=apartmentbuy&price=1000.0-&constructionyear=-2000&pagesize=20&geocodes=1276010&pagenumber=1","title":"Eigentumswohnung in Nordrhein-Westfalen - ImmoScout24","sortingOptions":[{"description":"Standardsortierung","code":0},{"description":"Kaufpreis (höchste zuerst)","code":3},{"description":"Kaufpreis (niedrigste zuerst)","code":4},{"description":"Zimmeranzahl (höchste zuerst)","code":5},{"description":"Zimmeranzahl (niedrigste zuerst)","code":6},{"description":"Wohnfläche (größte zuerst)","code":7},{"description":"Wohnfläche (kleinste zuerst)","code":8},{"description":"Neubau-Projekte (Projekte zuerst)","code":31},{"description":"Aktualität (neueste zuerst)","code":2}],"pagerTemplate":"|Suche|de|nordrhein-westfalen|wohnung-kaufen?price=1000.0-&constructionyear=-2000&pagenumber=%page%","sortingTemplate":"|Suche|de|nordrhein-westfalen|wohnung-kaufen?price=1000.0-&constructionyear=-2000&sorting=%sorting%","world":"LIVING","international":false,"device":{"deviceType":"NORMAL","devicePlatform":"UNKNOWN","tablet":false,"mobile":false,"normal":true}
EDIT:
This is how the script built upon requests module looks like:
import requests
link = 'https://www.immobilienscout24.de/Suche/de/nordrhein-westfalen/wohnung-kaufen'
headers = {
'accept': 'application/json; charset=utf-8',
'accept-encoding': 'gzip, deflate, br',
'accept-language': 'en-US,en;q=0.9',
'x-requested-with': 'XMLHttpRequest',
'content-type': 'application/json; charset=utf-8',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.0.0 Safari/537.36',
'referer': 'https://www.immobilienscout24.de/Suche/de/nordrhein-westfalen/wohnung-kaufen?price=1000.0-&constructionyear=-2000&pagenumber=1',
# 'cookie': 'hardcoded cookies'
}
params = {
'price': '1000.0-',
'constructionyear': '-2000',
'pagenumber': '2'
}
sess = requests.Session()
sess.headers.update(headers)
resp = sess.get(link,params=params)
print(resp.json())
Scrapy's CookiesMiddleware disregards 'cookie' passed in headers.
Reference: scrapy/scrapy#1992
Pass cookies explicitly:
yield scrapy.Request(
url=req_url,
headers=self.headers,
callback=self.parse,
# Add the following line:
cookies={k: v.value for k, v in http.cookies.SimpleCookie(self.headers.get('cookie', '')).items()},
),
Note: That site uses GeeTest CAPTCHA, which cannot be solved by simply rendering the page or using Selenium, so you still need to periodically update the hardcoded cookie (cookie name: reese84) taken from the browser, or use a service like 2Captcha.

How to deal with changing JSESSIONID in Python?

This is in continuation to my post here.
I have figured out that I require a JSESSIONID in my headers to get the data with Python. But the problem is every time the JSESSIONID changes. How to deal with this issue? Below is my code:
import requests
sym_1 = 'NIFTY'
exp_date = '26MAY2022'
headers_1 = {
'accept-encoding': 'gzip, deflate, br',
'accept-language': 'en-US,en;q=0.9',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.67 Safari/537.36',
'cookie': 'JSESSIONID=8EB69BB64441BB6906DD7A241B1AAC82;'
}
url = "https://opstra.definedge.com/api/openinterest/optionchain/free/" + sym_1 + "&" + exp_date
text_data = requests.get(url_1, headers=headers_1)

Loop through Dataframe column of URLs and parse out html tag

This shouldn't be too hard, although I can't figure it, i'm betting i'm making a dumb mistake.
Here's the code that works on an individual link and returns the zestimate (the req_headers variable prevents throwing a captcha):
req_headers = {
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'accept-encoding': 'gzip, deflate, br',
'accept-language': 'en-US,en;q=0.8',
'upgrade-insecure-requests': '1',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'
}
link = 'https://www.zillow.com/homedetails/1404-Clearwing-Cir-Georgetown-TX-78626/121721750_zpid/'
test_soup = BeautifulSoup(requests.get(link, headers=req_headers).content, 'html.parser')
results = test_soup.select_one('h4:contains("Home value")').find_next('p').get_text(strip=True)
print(results)
Here's the code i'm trying to get to work and return the zestimate for each link and add to a new dataframe column, but I get AttributeError: 'NoneType' object has no attribute 'find_next' (Also, imagine i have a dataframe column of different zillow house links):
req_headers = {
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'accept-encoding': 'gzip, deflate, br',
'accept-language': 'en-US,en;q=0.8',
'upgrade-insecure-requests': '1',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'
}
for link in df['links']:
test_soup = BeautifulSoup(requests.get(link, headers=req_headers).content, 'html.parser')
results = test_soup.select_one('h4:contains("Home value")').find_next('p').get_text(strip=True)
df['zestimate'] = results
Any help is appreciated.
I had a space before and after the links in my dataframe column :/. That was it. The code works fine. just an oversight on my part. Thanks all

Python Requests post on a site not working

I am trying to scrape property information from https://www.ura.gov.sg/realEstateIIWeb/resiRental/search.action using Python Requests. Using Chrome I have inspected the POST request and emulated it using requests. I use sessions to maintain cookies. When I try my code, the return from the website is "missing parameters in search query" so obviously something is wrong with my requests (though it is not obvious what).
Doing some digging there was one cookie that I did not get when doing request.get on the search side, so I added that manually. Still no go. I tried emulating the request headers exactly as well, still does not return the correct results.
The only time I have gotten it to work is when I manually copy the cookies from my browser to the Python request object.
url = 'https://www.ura.gov.sg/realEstateIIWeb/resiRental/submitSearch.action;jsessionid={}'
values = {'submissionType': 'pn',
'from_Date_Prj': 'JAN-2014',
'to_Date_Prj': 'JAN-2016',
'__multiselect_projectNameList': '',
'selectedProjects': '10 SHELFORD',
'__multiselect_selectedProjects': '',
'propertyType': 'lp',
'from_Date': 'JAN-2016',
'to_Date': 'JAN-2016',
'__multiselect_postalDistrictList': '',
'__multiselect_selectedPostalDistricts': ''}
header1 = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'Accept-Encoding': 'gzip, deflate, sdch',
'Accept-Language': 'en-US,en;q=0.8,nb;q=0.6,no;q=0.4',
'Cache-Control': 'max-age=0',
'Connection': 'keep-alive',
'Host': 'www.ura.gov.sg',
'Upgrade-Insecure-Requests': '1',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.116 Safari/537.36'
}
headers = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'Accept-Encoding': 'gzip, deflate',
'Accept-Language': 'en-US,en;q=0.8,nb;q=0.6,no;q=0.4',
'Cache-Control': 'max-age=0',
'Connection': 'keep-alive',
'Content-Type': 'application/x-www-form-urlencoded',
'Host': 'www.ura.gov.sg',
'Origin': 'https://www.ura.gov.sg',
'Referer': 'https://www.ura.gov.sg/realEstateIIWeb/resiRental/search.action',
'Upgrade-Insecure-Requests': '1',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.116 Safari/537.36'
}
with requests.Session() as r:
page1 = r.get('https://www.ura.gov.sg/realEstateIIWeb/resiRental/search.action', headers=header1)
requests.utils.add_dict_to_cookiejar(r.cookies, {'BIGipServerpl-prod_iis_web_v4': '3334383808.20480.0000'})
page2 = r.post(url.format(r.cookies.get_dict()['JSESSIONID']), data=values, headers=headers)

Changing the referer URL in python requests

How do I change the referer if I'm using the requests library to make a GET request to a web page. I went through the entire manual but couldn't find it.
According to http://docs.python-requests.org/en/latest/user/advanced/#session-objects , you should be able to do:
s = requests.Session()
s.headers.update({'referer': my_referer})
s.get(url)
Or just:
requests.get(url, headers={'referer': my_referer})
Your headers dict will be merged with the default/session headers. From the docs:
Any dictionaries that you pass to a request method will be merged with
the session-level values that are set. The method-level parameters
override session parameters.
here we are rotating the user_agent with referer
user_agent_list = [
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.90 Safari/537.36",
"Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36",
"Mozilla/5.0 (iPad; CPU OS 15_6 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) CriOS/104.0.5112.99 Mobile/15E148 Safari/604.1"
]
reffer_list=[
'https://stackoverflow.com/',
'https://twitter.com/',
'https://www.google.co.in/',
'https://gem.gov.in/'
]
headers = {
'Connection': 'keep-alive',
'Cache-Control': 'max-age=0',
'Upgrade-Insecure-Requests': '1',
'User-Agent': random.choice(user_agent_list),
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'Accept-Encoding': 'gzip, deflate',
'Accept-Language': 'en-US,en;q=0.9,fr;q=0.8',
'referer': random.choice(reffer_list)
}

Categories

Resources