Scraping phone number from Olx ad - python

I'm trying to create a scraper for olx website (www.olx.pl) using requests and beautifulsoup. I don't have any problems with most of data, but the phone number is hidden (One has to first click it). I've already tried to use chrome inspect to see what is happening in the "Network" tab when I click it manually.
There is an ajax request with this information "?pt=5d1480fbad0a1f2006e865bfdf7a6fb07f244b82e17ab0ea4c5eaddc43f9da391b098e1926642564ffb781655d55be270c6913f7526a08298f43b24c0169636b"
This is the phoneToken which may be found in the website source (it changes on each page load).
I tried to send this kind of request using requests library, but I got "000 000 000" in response.
I can get the phone number using Selenium, but it is so slow to load.
The question is:
Is there a way to get around those security phone tokens?
or
How to speed up Selenium to scrape phone number in let's say 1-2sec?
Ad example:
https://www.olx.pl/561666735
EDIT:
Actually, now in response I get the message that my IP address is blocked. (But only using requests, ip is not blocked when I load page manually).
Unfortunately I made some changes and I can't reproduce the code, to get '000 000 000' in response. This is part of my code right now.
def scrape_phone(id):
s = requests.Session()
url = "https://www.olx.pl/{}".format(id)
response = s.get(url, headers=headers)
page_text = response.text
# getting short id
index_of_short_id = page_text.index("'id':'")
short_id = page_text[index_of_short_id:index_of_short_id+11].split("'")[-1]
# getting phone token
index_of_token = page_text.index("phoneToken")
phone_token = page_text[index_of_token+10:index_of_token+150].split("'")[1]
url = "https://www.olx.pl/ajax/misc/contact/phone/{}".format(short_id)
data = {
'pt': phone_token
}
response = s.post(url, data=data, headers=headers)
print(response.text)
scrape_phone(540006276)

Related

Unable to get HTML code when web scraping Instagram Error 429

I am trying to web scrape Instagram accounts to find out how many posts they have (visible on all public and private profiles). I am writing a script to do so.
I tried using the requests library to obtain the HTML code, but when running this below:
import requests
url = 'https://www.instagram.com/instagram/'
r = requests.get(url)
print(r)
print(r.text)
I keep getting the 429 too many requests error even though I have sent less than 10 total (I am manually running the script). I also tried following https://stackoverflow.com/questions/22786068/how-to-avoid-http-error-429-too-many-requests-python#:~:text=Writing%20this%20piece,your%20bot%200.1'%7D) and using requests.get(link, headers = {'User-agent': 'your bot 0.1'}) but that also does not work.

BeautifulSoup not returning the title of page

I tried to getting the title of a web page by web scraping using Beautifulsoup4 python module and it's returning a string "Not Acceptable!" as the title, but when I open the webpage via browser the title is different. I tried looping through list of links and extract titles of all the webpages but it's returning the same string "Not Acceptable!" for all the links.
here is the python code
from bs4 import BeautifulSoup
import requests
URL = 'https://insights.blackcoffer.com/how-is-login-logout-time-tracking-for-employees-in-office-done-by-ai/'
result = requests.get(URL)
doc = BeautifulSoup(result.text, 'html.parser')
tag = doc.title
print(tag.get_text())
here is link to the corresponding web page webpage link
I don't know if it is a problem with Beautifulsoup4 or with requests library, is it because the site has enabled bot protection and not returning the HTML when sending the requests?
The server expects the User-Agent header. Interestingly, it is happy with any User-Agent, even a fictitious one:
result = requests.get(URL, headers = {'User-Agent': 'My User Agent 1.0'})
An easy way to debug this kind of issue is to print (or write to a file) the request.text. This is because some servers don't allow scraping. Some websites generate HTML using JavaScript at runtime (e.g. YouTube). These are some of the scenarios where the request.text can be different than the source HTML we see in the browser. The below text has been returned by the server.
<head><title>Not Acceptable!</title></head><body><h1>Not Acceptable!</h1><p>An appropriate representation of the requested resource could not be found on this server. This error was generated by Mod_Security.</p></body></html>
Edit:
As pointed by DYZ, this is a 406 error and User Agent in the request header was missing.
https://www.exai.com/blog/406-not-acceptable
The 406 Not Acceptable status code is a client-side error. It's part
of the HTTP response status codes in the 4xx category, which are
considered client error responses

Webscraping with python using request issue with geolocation

I am currently scraping the amazon.com website with request and the use of proxies.
I'm using the request module with python3.
The issue that I am facing is that amazon returns me the results based on the geolocation of the proxy rather than the results for the amazon.com for the US.
How can I make sure that I will get only the results for the amazon.com from the US for the US only?
Here is my code:
import requests
url = 'https://www.amazon.com/s?k=a%20circuit'
proxy_dict = # Proxy details in the dictionary form
response = requests.get(url, proxies=proxy_dict, timeout=(3, 3))
htmlText = response.text
# I am than saving the details in a text file.
The code works fine, but it does return the results for the location of the proxy.
How can I make sure that the results will be only for the US?
Shell I perhaps add something on the url or something on the header?

How can I set the cookie by using requests in python?

HELLO I'm now trying to get information from the website that needs log in.
But I already get 200 response in the reqeustURL where I should POST some ID, passwords and requests.
headers dict have requests_headers that can be seen in the chrome developer network tap. form data dict have the ID and passwords.
login_site = requests.post(requestUrl, headers=headers, data=form_data)
status_code = login_site.status_code print(status_code)
I got 200
The code below is the way I've tried.
1. Session.
when I tried to set cookies with session, I failed. I've heard that session could set the cookies when I scrape other pages that need log-in.
session = requests.Session()
session.post(requestUrl, headers=headers, data=form_data)
test = session.get('~~') #the website that I want to scrape
print(test.status_code)
I got 403
2. Manually set cookie
I manually made the cookie dict that I can get
cookies = {'wcs_bt':'...','_production_session_id':'...'}
r = requests.post('http://engoo.co.kr/dashboard', cookies = cookies)
print(r.status_code)
I also got 403
Actually, I don't know what should I write in the cookies dict. when I get,'wcs_bt=AAA; _production_session_id=BBB; _ga=CCC;',should I change it to dict {'wcs_bt':'AAA'.. }?
When I get cookies
login_site = requests.post(requestUrl, headers=headers, data=form_data)
print(login_site.cookies)
in this code, I only can get
RequestsCookieJar[Cookie _production_session_id=BBB]
Somehow, I failed it also.
How can I scrape it with the cookie?
Scraping a modern (circa 2017 or later) Web site that requires a login can be very tricky, because it's likely that some important portion of the login process is implemented in Javascript.
Unless you execute that Javascript exactly as a browser would, you won't be able to complete the login. Unfortunately, the basic Python libraries won't help.
Consider Selenium with Python, which is used for testing Web sites but can be used to automate any interaction with a Web site.

Why BeautifulSoup and lxml don't work?

I'm using mechanize library to log in website. I checked, it works well. But problem is i can't use response.read() with BeautifulSoup and 'lxml'.
#BeautifulSoup
response = browser.open(url)
source = response.read()
soup = BeautifulSoup(source) #source.txt doesn't work either
for link in soup.findAll('a', {'class':'someClass'}):
some_list.add(link)
This doesn't work, actually doesn't find any tag. It works well when i use requests.get(url).
#lxml->html
response = browser.open(url)
source = response.read()
tree = html.fromstring(source) #souce.txt doesn't work either
print tree.text
like_pages = buyers = tree.xpath('//a[#class="UFINoWrap"]') #/text() doesn't work either
print like_pages
Doesn't print anything. I know it has problem with return type of response, since it works well with requests.open(). What could i do? Could you, please, provide sample code where response.read() used in html parsing?
By the way, what is difference between response and requests objects?
Thank you!
I found solution. It is because mechanize.browser is emulated browser, and it gets only raw html. The page i wanted to scrape adds class to tag with help of JavaScript, so those classes were not on raw html. Best option is to use webdriver. I used Selenium for Python. Here is code:
from selenium import webdriver
profile = webdriver.FirefoxProfile()
profile.set_preference('network.http.phishy-userpass-length', 255)
driver = webdriver.Firefox(firefox_profile=profile)
driver.get(url)
list = driver.find_elements_by_xpath('//a[#class="someClass"]')
Note: You need to have Firefox installed. Or you can choose another profile according to browser you want to use.
A request is what a web client sends to a server, with details about what URL the client wants, what http verb to use (get / post, etc), and if you are submitting a form the request typically contains the data you put in the form.
A response is what a web server sends back in reply to a request from a client. The response has a status code which indicates if the request was successful (code 200 usually if there were no problems, or an error code like 404 or 500). The response usually contains data, like the html in a page, or the binary data in a jpeg. The response also has headers that give more information about what data is in the response (e.g. the "Content-Type" header which says what format the data is in).
Quote from #davidbuxton's answer on this link.
Good luck!

Categories

Resources