Web Scrapping Python - Immoscout24 - Robot Rejection - python

So im trying to make a data science project using information from this site. But sadly when I try to scrape it, it blocks me because it thinks I am a bot. I saw a couple of post here: Python webscraping blocked
but it seems that Immoscout have already found a solution to this workaround. Does somebody know how I can come around this? thanks!
My Code:
import requests
from bs4 import BeautifulSoup
import random
headers = {"User-Agent": "Mozilla/5.0 (Linux; U; Android 4.2.2; he-il; NEO-X5-116A Build/JDQ39) AppleWebKit/534.30 ("
"KHTML, like Gecko) Version/4.0 Safari/534.30 , 'Accept-Language': 'en-US,en;q=0.5'"}
url = "https://www.immobilienscout24.de/Suche/de/berlin/berlin/wohnung-kaufen?enteredFrom=one_step_search"
response = requests.get(url, cookies={'required_cookie': 'reese84=xxx'} ,headers=headers)
webpage = response.content
print(response.status_code)
soup = BeautifulSoup(webpage, "html.parser")
print(soup.prettify)
thanks :)

Data is generating dynamically from API calls json response as POST method and You can extract data using only requests module.So,You can follow the next example.
import requests
headers= {
'content-type': 'application/json',
'x-requested-with': 'XMLHttpRequest'
}
api_url = "https://www.immobilienscout24.de/Suche/de/berlin/berlin/wohnung-kaufen?pagenumber=1"
jsonData = requests.post(api_url).json()
for item in jsonData['searchResponseModel']['resultlist.resultlist']['resultlistEntries'][0]['resultlistEntry']:
value=item['attributes'][0]['attribute'][0]['value'].replace('€','').replace('.',',')
print(value)
Output:
4,350,000
285,000
620,000
590,000
535,000
972,500
579,000
1,399,900
325,000
749,000
290,000
189,900
361,825
199,900
299,000
195,000
1,225,000
199,000
825,000
315,000

Related

python requests & beautifulsoup bot detection

I'm trying to scrape all the HTML elements of a page using requests & beautifulsoup. I'm using ASIN (Amazon Standard Identification Number) to get the product details of a page. My code is as follows:
from urllib.request import urlopen
import requests
from bs4 import BeautifulSoup
url = "http://www.amazon.com/dp/" + 'B004CNH98C'
response = urlopen(url)
soup = BeautifulSoup(response, "html.parser")
print(soup)
But the output doesn't show the entire HTML of the page, so I can't do my further work with product details.
Any help on this?
EDIT 1:
From the given answer, It shows the markup of the bot detection page. I researched a bit & found two ways to breach it :
I might need to add a header in the requests, but I couldn't understand what should be the value of header.
Use Selenium.
Now my question is, do both of the ways provide equal support?
It is better to use fake_useragent here for making things easy. A random user agent sends request via real world browser usage statistic. If you don't need dynamic content, you're almost always better off just requesting the page content over HTTP and parsing it programmatically.
import requests
from fake_useragent import UserAgent
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
ua=UserAgent()
hdr = {'User-Agent': ua.random,
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
'Accept-Encoding': 'none',
'Accept-Language': 'en-US,en;q=0.8',
'Connection': 'keep-alive'}
url = "http://www.amazon.com/dp/" + 'B004CNH98C'
response = requests.get(url, headers=hdr)
print response.content
Selenium is used for browser automation and high level web scraping for dynamic contents.
As some of the comments already suggested, if you need to somehow interact with Javascript on a page, it is better to use selenium. However, regarding your first approach using a header:
import requests
from bs4 import BeautifulSoup
url = "http://www.amazon.com/dp/" + 'B004CNH98C'
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text,"html.parser")
These headers are a bit old, but should still work. By using them you are pretending that your request is coming from a normal webbrowser. If you use requests without such a header your code is basically telling the server that the request is coming from python, which most of the servers are rejecting right away.
Another alternative for you could also be fake-useragent maybe you can also have a try with this.
try this:
import requests
from bs4 import BeautifulSoup
url = "http://www.amazon.com/dp/" + 'B004CNH98C'
r = requests.get(url)
r = r.text
##options #1
# print r.text
soup = BeautifulSoup( r.encode("utf-8") , "html.parser")
### options 2
print(soup)

Unable to scrape ASPX webpage using python?

Statement:- I am trying to scrap the url. I am trying to get the data after the post request with the parameters as set in the code.
Problem:- I am actually getting the html of the original .aspx page and the paramenters that I have set in the 'formFields' are not set. Can anyone explain where I am going wrong.
import urllib
import urllib2
uri = 'http://cbseaff.nic.in/cbse_aff/schdir_Report/userview.aspx'
headers = {
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Content-Type': 'text/html; charset=utf-8'
}
formFields = (
(r'__VIEWSTATE',r'yE41BXTcCz0TfHmpBjw1lXLQe3PFLLTPmxSVU0hSeOt6wKGfqvehmJiLNmLj2VdBWrbXtmUIr0bh8lgw8X8f/yn9T1D4zmny5nUAc5HpgouURKeWzvZasUpmJUJ8pgh4jTUo62EVRQCYkXKayQVbCwaSP81BxDO9gxrERvzCUlw8i76A4jzlleSSunjr844drlsOw/PxjgYeRZCLm/h8WAc5HZrJ+w7vLMyLxlY/mDQaYdkVAF/s4lAJAxGfnX1rlshirdphBhI1tZIuoJa+ZTNzizgrXi70PVnAR3cw0QhCWr2rrTkrvoJ+rI5pme0pYPAX+CZfmSH3Cg1BKEbY/+G+p1AsLRqsobC8EBQXHPicqnhgOR7/kx+Z54XyCzxDwXCZBFKl3npcSq4xJ2Ebi3PFS6FtT0K+wZTZ8XGCQUVudzKyqhfDxQ4UTpDWn4vR7LIF765qhXpRmNG6HCzbvgrLqNrBzt+PZ0mbpeLsIWEia5L/AIbN+zcLiNsBxTBA9zOsljZDPkbL1rWo+WDUwBfDRiu6X4ru+RKypczFFzoUUciCaWD2ciOq+//NYa7NEwZ9d7YRY/LfEhszYUJO72Xpzaotxu7+7RdGVvsxzrh1Ro8OHqoskesX6QlEdqjakgrk3yqaroXCfnnBJu1ExKYXrU6JuVFbOGz66CZfkqiJJSVHs2Ozfos/yPWWiJVyETKtPQF003o9GGKKIFIaQ6YRHoco3jVOKB+NIE/VxMJHiSzPuuExZ27MY/pX9U9rx5/4az6i/LIqYoRzQilJT7lN5eTIdVn/W5Uep0tvKtOKhR7JjX7WhO7QpROlOh7H/wbCdr5u/jXB5FpJLB1Y8oZSdVyKJK8Xx+tODFvxYxMve5NT+/fBwlMSfkh7+h7+5Zo5hHKwLF01mrS52KHehix4Ja5bQ3hy6Q2r6pz+ZVHvUsev9OpsZlO1+PPjVCbdjzeUX23At6R4BRm6Zq/o0knl2Xf/kvM6WtKtDi+AbpIa7Eyj+brqWMdDKW4AiOLGS45H4ERP2z1CeVwiXlfa22JhkPbz8GJf9J9iey+JefnmsgD5ScdcvPU1mb1i7NLv9QOYXZXZwJXhGHZcw65iQm7vrZB5sJlBp7agBhhwX2KNaKZguNGVGhlxiS84zoKrkvdBv7e52n6H9p3okMvqHR+yEe+UCuDPanO+9tTvNvOqBzEAVvPYIK80YWsuDY3R66OBPjQEKpbPrDpz5hoMKk59r1FiIq6Jd71c6QeE57Au3ei72GZEVzaLXAva0RJP/tSROnO7ZKHkvxuP0oayVlzjLuIHnO0o4zUsoHpTJqPa20Bxv9De3JwOOi8HJgYj+dZjdRIDT9zHhgbLV9UO4z0HHyK54RIS67OAS8WqMYyfdC5I5GGwy8rosVKNjCfHymMEUzbs5iHCPhrM/X0UMJTxQ7yq113/6U43p6BP4PqP/OAgRYxejrVtT9goPKWxHTwu0kDROXCVvqHo5SiQ+/X3DdTxLF+13p0k5xlXBk0qkeLJkNlSYBeTOgPyvjHxnSMUdjhjHtiA0oFCSSCYCpVU9Pe3PLQyyUjv+KhI/jWS94D3KxYqXjyHUC/nMxEwot65kzFE/scAoOsdk/MJS/uZw4PbfbGEVKWTcJLtOV8s3wHKPzmB/AexZ//iEmDv'),
(r'__VIEWSTATEGENERATOR','AC872EDD'),
(r'__EVENTVALIDATION',r'35e8G73FpRBmb1p/I2BD+iKRsX5vU8sS0eAo0dx/dcp4JGg0Jp+2ZLyg8GHhCmrVcGjQVBABi8oOAgXt+2OghWX0FyVvk0fo8XPTIL/uMJjSBFrRvgsQNnh/Ue1VCOGhBgwhmrBt+VbjmmkA3hafVR2lAVjy/RYswgFcmyloWPyxQDu9UuktSiCNfyfJcF/xZrJDTeHBkNmB9P8QGNHLVnoaZjsmks/yB+Nll5ZLiq0WvDRhsbq1I/rrRWytnleHAMNutsEqdU9OvycQ/insfM871Jww+TyIvmxEVRYcAH6MnYl0nqUaderHN4R37aH8CH8B/NUxGDYLSdlsvJGJxXEZh9EVzzDprJuz7sJUxolBhjv6YNfJDAThrLCip2QEY20SztPZ/j8KnWgnX7Xs6sxjZofKnQxNlB44dZG0okIPitb9zjWB6kC2xDmN69vfDayDOCkcPJG3q/HMP6ewQvV/YheqUbuBwC77WPIKXrc='),
(r'ctl00$ContentPlaceHolder1$optlist', 'State+Wise'),
(r'ctl00$ContentPlaceHolder1$ddlitem' , '22'),
(r'ctl00$ContentPlaceHolder1$optlistChoice', 'List+of+All+Schools'),
(r'ctl00$ContentPlaceHolder1$search', 'Search'),
(r'__EVENTTARGET','ctl00$ContentPlaceHolder1$search')
)
encodedFields = urllib.urlencode(formFields)
req = urllib2.Request(uri, encodedFields, headers)
f= urllib2.urlopen(req)
try:
fout = open('temp.htm', 'w')
except:
print('Could not open output file\n')
fout.writelines(f.readlines())
fout.close()
The easiest way is to switch to requests library and try something like below. You should go for the below approach only when you have no time to code and you are in a hurry. You can see the same result in the website when you select ASSAM from dropdown and press the search button.
import requests
from bs4 import BeautifulSoup
URL = "http://cbseaff.nic.in/cbse_aff/schdir_Report/userview.aspx"
payload="__EVENTTARGET=&__EVENTARGUMENT=&__LASTFOCUS=&__VIEWSTATE=sR0epHdr4qQ9SeEuqmUUBWzZHpFAyMj5Xr%2BhGagE8Be7zDbBqQgdx1%2BqrCMY8JfcRJ%2BBhmH8eEKMS%2F3VeboGFTjN0LZz1p2mx3FfY2OVkzs353bmDLWhtvBlOVnyFgcCXGyI9Li3Wp22e5txxKQwtdrKTWheuZLatRvI9wztvyeGueD9ZmEl8gIQHT77fIyt3N%2Bi3dn%2FhUdvvi%2FRqHR%2FaE1YfW3RmmSECDepwYfmBlSH3e4zTDQskW6XoOD7jykryp8L845QWclBE37ttOc5zXndfqkE%2FKCwOxiCxzRVPWCIFD9mBfELERlDOzq0cKwg6hHcE%2B4ZgyxJKBwNbW%2FDBtl%2BzpATY8arhjAg0CCf5CGmYMG1w2aaYz%2FSQ6bKQYWAaiH%2B0t9lrV291vPNj5Gfy1rWwrUhmw%2F6LcvX7uof41OUHtfGt%2Bn1IlxODsPu8i89hjiu5rqprmltTKSzoOZXJuQ1sfyRneShA7WeSIO0M4UmzUVdnjsNlimBmcc7vwAgvF8E6SJ8DaGCbPaLLH6tsI%2BISTBWPWw97o7KQyS%2FPicpZQA2Q5kzCHkAeroJWWrVILlCxEAEuOmzZffW7qIiiUwI3Urbixw40YDkjMPgVZMbb74EiEvmMGr1vNOMcjoICgM54sY5cViGP945D86fRlbS%2B6yYxNH951EZNrmRWnASiT2D%2FEbGMAkOwcYKrOWBrtMXEp8GPsswxryCZp2eANe7ajSioa0utT2cmGG5T7uiypveyfM0LBzgVTMqr5q7oMyUI9ML%2Bx8LHUFSh6SrjP1Nj7xs3%2BFPr2WaXoIs57hhUKfpE7u7Du9EGd1iXapf2WYKbvsl38Ov5u3FcPJ8L%2BLv%2FtksFE7Y7K2ZQ59p0Jqr%2BF%2FbMaKZeRRPp%2BMpW1TblAI8RVP1u%2BTDAHz19jrdpsralLWYoweP3wS5kdDcx68IukuaeGl0ESXcbNYWdFF0pEolfWLBSWcbOpR3YPpxVQD5Eqq3z%2FsYmk%2B%2B%2ByRXOH%2FizUrCqDqQjwbTw4UqRJY5HewB0q0Xp6jrPoT0%2FztPwteGCb6xonx1rzVkr84aOUw9IBUkCcvEtyi0S8VcUyVyii6KK9CThtyTHjbgut%2BdjuIOBlXPL9bD6u2frJiGGacVziPZInp6%2FtKcmnYn89Kcr0Ec1o3VZ6PD6xe8SUyMkupf%2FruNhbK6r6ZAXKZ3zMhkiDsMZYVGUihU43gNj3c8X%2BFD8ONb1827AfaDydgS1JxMtA9z4eOTqmoiPQ61vSr2j8IXokWJSDNg%2FhIn1uChG8BoEst2qZyVoQify0g%2FDMN%2B%2BPMMK7KPqnAfT4P17QI4Gn3mg2kFzfuQdnsQgs7aB0zAt0jrMgCoTDxuwbNvZ1w5BMF0bFbbfxi9QvbCXdpifAWbgAyutlo3wCbD1lIv3NmzLFQ61Mpih5zIOU2z9bpBeH4nClXcAN7QQVQIq8w%3D&__VIEWSTATEGENERATOR=AC872EDD&__VIEWSTATEENCRYPTED=&__EVENTVALIDATION=B7%2FH7uAZc%2FaW7NlReiB%2BrW3ac6DmQz82BOha1KMrmjcJR7%2F3qZByA7RNvmBBI36ivFrYkmYQ8SptltVOT45oBSWn4HG%2FRBbAWPHtcWY7qtl4okgXZ831Q1MTlxdIOkn2uPcoQOsostEzjJ8LVZHkcx%2FVjr6Fb1zImxNbSPDRDVJ1QLmYSCaf2RbJkzmP7ZiqR3w9MXn7GliipkRdhVHlJaDrh7eFy9zOjEcG2Ed2v0NxA5lnpnrXFcE42f9W%2BnLwNfUPR%2BiB95TtvY52ucsD5CgjWqlm9uMrDzHL1kl3WGzg6eU%2BIA9J744%2FRM2TD5JfhPykP6DB9E3E9%2BWzSSJowqwSzOwLNCjcbC%2BvBUb3GPXPwadz%2Fg3pEGTiBWtqdBCeOUiKnkDeDOrno8fS1RPu%2BVx%2F6M1LGWddW2CBUa8m3CizqMfLTGP7HVj4VpnSU0fttCuY26UTZzzMmplPmCjZziEJHd%2F5jc%2Byyf517tk%2BfHA%3D&ctl00%24ContentPlaceHolder1%24optlist=State+Wise&ctl00%24ContentPlaceHolder1%24ddlitem=2+&ctl00%24ContentPlaceHolder1%24optlistChoice=List+of+All+Schools&ctl00%24ContentPlaceHolder1%24search=Search"
with requests.Session() as s:
s.headers={"User-Agent":"Mozilla/5.0"}
s.headers.update({'Content-Type': 'application/x-www-form-urlencoded'})
res = s.post(URL,data=payload)
soup = BeautifulSoup(res.text,"lxml")
for item in soup.select("#hyp,#Hyperlink1"):
print(item.text)
Partial results:
Abanti Kumar Mahanta Educational Institution
adarsh vidyalaya
adarsh vidyalaya
adarsh vidyalaya dahalapara

How to use urllib and re to retrieve live price data with Python

I am attempting to request the price data from dukascopy.com but I am running into a similar problem to this user, where the price data itself is not a part of the html. Therefore, when I run my basic urllib code to extract the data:
import urllib.request
url = 'https://www.dukascopy.com'
headers = {'User-Agent':'Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.96 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)'}
req = urllib.request.Request(url, headers = headers)
resp = urllib.request.urlopen(req)
respData = resp.read()
print(str(respData))
the price data cannot be found. Referring back to this post, the user Mark found another url that the data was called from. Can this be applied to collect the data here as well?
Try with dryscape. You can scrape JavaScript rendered pages with it. Don't parse web pages with regex module. It's not a good idea. Read this why you should not parse HTML pages with regex: HTML with regex. Use Beautiful for parsing.
import dryscrape
from bs4 import BeautifulSoup
url = 'https://www.dukascopy.com'
session = dryscrape.Session()
session.visit(url)
response = session.body()
soup=BeautifulSoup(response)
print soup

Difficulty in Making HTTP Login Request with Requests Library(Python)

This is my first time posting on stack so I'll do my best to get right to the point regarding my issue. I've just begun delving into the formulation of HTTP requests for web scraping purposes and I decided to choose one site to practice logging in using the requests library in python. I already took the liberty of extracting the csrfKey from the html on the first get but after the post, I still end up on the login page with the fields filled out but I do not successfully log in. Any help would be much appreciated as I'm completely stumped on what I should try next. Thank you all!
import requests
from bs4 import BeautifulSoup
headers = {'User-Agent':'User-Agent:Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.106 Safari/537.36'}
payload = {
'auth':'username',
'password':'pass',
'login_standard_submitted:':'1',
'remember_me':'0',
'remember_me_checkbox':'1',
'signin_anonymous':'0'
}
s = requests.Session()
r = s.get("http://www.gaming-asylum.com/forums/index.php?/",headers=headers)
soup = BeautifulSoup(r.content)
payload['csrfKey'] = str(soup.find('input',{'name':'csrfKey'}).get('value'))
headers['Content-Type'] = 'application/x-www-form-urlencoded'
headers['Referer'] = 'http://www.gaming-asylum.com/forums/?_fromLogout=1&_fromLogin=1'
headers['Upgrade-Insecure-Requests']='1'
headers['Origin']='http://www.gaming-asylum.com'
r= s.post("http://www.gaming-asylum.com/forums/index.php?/login/",headers=headers,data=payload)

Python's Requests or Mechanize to login to a site

I want to start off by apologizing. I know this has more than likely been done more than enough times, and I'm just beating a dead horse, but I'd really like to know how to get this to work. I am trying to use the Requests module for python in order to login to a website and verify that it works. I'm also using BeautifulSoup in the code in order to find some strings that I have to use to process the request.
I'm getting hung up on how to properly form the header. What exactly is necessary in the header information?
import requests
from bs4 import BeautifulSoup
session = requests.session()
requester = session.get('http://iweb.craven.k12.nc.us/index.php')
soup = BeautifulSoup(requester.text)
ps = soup.find_all('input')
def getCookieInfo():
result = []
for item in ps:
if (item.attrs['name'] == 'return' and item.attrs['type'] == 'hidden'):
strcom = item.attrs['value']
sibling = item.next_sibling.next_sibling.attrs['name']
result.append(strcom)
result.append(sibling)
return result
cookiedInfo=getCookieInfo()
payload = [('username','myUsername'),
('password','myPassword'),
('Submit','Log in'),
('option','com_users'),
('task','user.login'),
('return', cookiedInfo[0]),
(cookiedInfo[1], '1')
]
headers = {
'Connection': 'keep-alive',
'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Origin':'http://iweb.craven.k12.nc.us',
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64)'
}
r = session.post('http://iweb.craven.k12.nc.us/index.php', data=payload, headers=headers)
r = session.get('http://iweb.craven.k12.nc.us')
soup = BeautifulSoup(r.text)
Also if it would be better/more pythonic to utilize the mechanize module I would be open to suggestions.

Categories

Resources