Webscraping with python using request issue with geolocation - python

I am currently scraping the amazon.com website with request and the use of proxies.
I'm using the request module with python3.
The issue that I am facing is that amazon returns me the results based on the geolocation of the proxy rather than the results for the amazon.com for the US.
How can I make sure that I will get only the results for the amazon.com from the US for the US only?
Here is my code:
import requests
url = 'https://www.amazon.com/s?k=a%20circuit'
proxy_dict = # Proxy details in the dictionary form
response = requests.get(url, proxies=proxy_dict, timeout=(3, 3))
htmlText = response.text
# I am than saving the details in a text file.
The code works fine, but it does return the results for the location of the proxy.
How can I make sure that the results will be only for the US?
Shell I perhaps add something on the url or something on the header?

Related

Log into a website using Requests module in Python

I am new in Python and web scraping, but I keep learning. I have managed to get some exciting results using BeautifulSoup and Requests libraries and my next goal is to log into a website that allows remote access to my heating system to do some web scraping and maybe extend its capabilities further.
Unfortunately, I got stuck. I have used Mozilla's Web Dev Tools to see the url that the form posts to, and the name attributes of the username and password fields. The webpage url is https://emodul.pl/login and the Request payload looks as follows:
{"username":"my_username","password":"my_password","rememberMe":false,"languageId":"en","remote":false}
I am using requests.Session() instance to make a post request to the login url and using the above-mentioned payload:
import requests
url = 'https://emodul.pl/login'
payload = {'username':'my_username','password':'my_password','rememberMe':False,'languageId':'en','remote':False}
with requests.Session() as s:
p = s.post(url, data=payload)
print(p.text)
Apparently I'm doing something wrong because I'm getting the "error":"Sorry, something went wrong. Try again." response.
Any advice will be much appreciated.

Why does using Amazon API gateway give the wrong HTML page when using requests.get(URL)

I'm currently building a web scraper and have run into the issue of being IP blocked. To get around this issue I'm trying to use the requests_ip_rotator which use AWS API Gateway's large IP pool as a proxy to generate pseudo-infinite IPs for web scraping. Following this answer I've implemented it into my code which is below:
import requests
from bs4 import BeautifulSoup
from requests_ip_rotator import ApiGateway, EXTRA_REGIONS
url = "https://secure.runescape.com/m=hiscore_oldschool_ironman/a=13/group-ironman/?groupSize=5&page=1"
page1 = requests.get(url)
soup1 = BeautifulSoup(page1.content, "html.parser")
gateway = ApiGateway("https://secure.runescape.com/",access_key_id="****",access_key_secret="****")
gateway.start()
session = requests.Session()
session.mount("https://secure.runescape.com/", gateway)
page2 = session.get(url)
gateway.shutdown()
soup2 = BeautifulSoup(page2.content, "html.parser")
print("\n"+page1.url)
print(page2.url)
print(soup1.head.title==soup2.head.title)
input()
output:
Starting API gateways in 10 regions.
Using 10 endpoints with name 'https://secure.runescape.com/ - IP Rotate API' (10 new).
Deleting gateways for site 'https://secure.runescape.com'.
Deleted 10 endpoints with for site 'https://secure.runescape.com'.
https://secure.runescape.com/m=hiscore_oldschool_ironman/a=13/group-ironman/?groupSize=5&page=1
https://6kesqk9t6d.execute-api.eu-central-1.amazonaws.com/ProxyStage/m=hiscore_oldschool_ironman/a=13/overall
False
So both times I use the .get(url) method I am using the same url but receiving different pages. Request.get(url) is giving me the page I want but when I use the amazon gateway with session.get(url) it is not giving me the same page as before but a different page from the same site. I'm stumped for what the issue could be so any help would be greatly appreciated!
When making get requests to the "https://secure.runescape.com" domain using the AWS gateway I noticed that if the URL path is: "a=13/group-ironman/?groupSize=5&page=x" for any x then I get a 302 response (redirect response) which redirects me to the URL path "/a=13/overall".
This leads me to believe that the runescape server is redirecting AWS IP's for some URL's but fortunately its not redirecting my own IP.
So my workaround is to use requests.get() without the AWS gateway for URL's that are being redirected and for other URL's of the same site the AWS gateway is not being redirected so I am still using it to avoid being IP blocked.

How can I set the cookie by using requests in python?

HELLO I'm now trying to get information from the website that needs log in.
But I already get 200 response in the reqeustURL where I should POST some ID, passwords and requests.
headers dict have requests_headers that can be seen in the chrome developer network tap. form data dict have the ID and passwords.
login_site = requests.post(requestUrl, headers=headers, data=form_data)
status_code = login_site.status_code print(status_code)
I got 200
The code below is the way I've tried.
1. Session.
when I tried to set cookies with session, I failed. I've heard that session could set the cookies when I scrape other pages that need log-in.
session = requests.Session()
session.post(requestUrl, headers=headers, data=form_data)
test = session.get('~~') #the website that I want to scrape
print(test.status_code)
I got 403
2. Manually set cookie
I manually made the cookie dict that I can get
cookies = {'wcs_bt':'...','_production_session_id':'...'}
r = requests.post('http://engoo.co.kr/dashboard', cookies = cookies)
print(r.status_code)
I also got 403
Actually, I don't know what should I write in the cookies dict. when I get,'wcs_bt=AAA; _production_session_id=BBB; _ga=CCC;',should I change it to dict {'wcs_bt':'AAA'.. }?
When I get cookies
login_site = requests.post(requestUrl, headers=headers, data=form_data)
print(login_site.cookies)
in this code, I only can get
RequestsCookieJar[Cookie _production_session_id=BBB]
Somehow, I failed it also.
How can I scrape it with the cookie?
Scraping a modern (circa 2017 or later) Web site that requires a login can be very tricky, because it's likely that some important portion of the login process is implemented in Javascript.
Unless you execute that Javascript exactly as a browser would, you won't be able to complete the login. Unfortunately, the basic Python libraries won't help.
Consider Selenium with Python, which is used for testing Web sites but can be used to automate any interaction with a Web site.

Using Python 3.5 to Login, Navigate, and Scrape Without Using a Browser

I'm trying to scrape multiple financial websites (Wells Fargo, etc.) to pull my transaction history for data analysis purposes. I can do the scraping part once I get to the page I need; the problem I'm having is getting there. I don't know how to pass my username and password and then navigate from there. I would like to do this without actually opening a browser.
I found Michael Foord's article "HOWTO Fetch Internet Resources Using The urllib Package" and tried to adapt one of the examples to meet my needs but can't get it to work (I've tried adapting to several other search results as well). Here's my code:
import bs4
import urllib.request
import urllib.parse
##Navigate to the website.
url = 'https://www.wellsfargo.com/'
values = {'j_username':'USERNAME', 'j_password':'PASSWORD'}
data = urllib.parse.urlencode(values)
data = data.encode('ascii')
req = urllib.request.Request(url, data)
with urllib.request.urlopen(req) as response:
the_page = response.read()
soup = bs4.BeautifulSoup(the_page,"html.parser")
The 'j_username' and 'j_password' both come from inspecting the text boxes on the login page.
I just don't think I'm pointing to the right place or passing my credentials correctly. The URL I'm using is just the login page so is it actually logging me in? When I print the URL from response it returns https://wellsfargo.com/. If I'm ever able to successfully login, it just takes me to a summary page of my accounts. I would then need to follow another link to my checking, savings, etc.
I really appreciate any help you can offer.

Why BeautifulSoup and lxml don't work?

I'm using mechanize library to log in website. I checked, it works well. But problem is i can't use response.read() with BeautifulSoup and 'lxml'.
#BeautifulSoup
response = browser.open(url)
source = response.read()
soup = BeautifulSoup(source) #source.txt doesn't work either
for link in soup.findAll('a', {'class':'someClass'}):
some_list.add(link)
This doesn't work, actually doesn't find any tag. It works well when i use requests.get(url).
#lxml->html
response = browser.open(url)
source = response.read()
tree = html.fromstring(source) #souce.txt doesn't work either
print tree.text
like_pages = buyers = tree.xpath('//a[#class="UFINoWrap"]') #/text() doesn't work either
print like_pages
Doesn't print anything. I know it has problem with return type of response, since it works well with requests.open(). What could i do? Could you, please, provide sample code where response.read() used in html parsing?
By the way, what is difference between response and requests objects?
Thank you!
I found solution. It is because mechanize.browser is emulated browser, and it gets only raw html. The page i wanted to scrape adds class to tag with help of JavaScript, so those classes were not on raw html. Best option is to use webdriver. I used Selenium for Python. Here is code:
from selenium import webdriver
profile = webdriver.FirefoxProfile()
profile.set_preference('network.http.phishy-userpass-length', 255)
driver = webdriver.Firefox(firefox_profile=profile)
driver.get(url)
list = driver.find_elements_by_xpath('//a[#class="someClass"]')
Note: You need to have Firefox installed. Or you can choose another profile according to browser you want to use.
A request is what a web client sends to a server, with details about what URL the client wants, what http verb to use (get / post, etc), and if you are submitting a form the request typically contains the data you put in the form.
A response is what a web server sends back in reply to a request from a client. The response has a status code which indicates if the request was successful (code 200 usually if there were no problems, or an error code like 404 or 500). The response usually contains data, like the html in a page, or the binary data in a jpeg. The response also has headers that give more information about what data is in the response (e.g. the "Content-Type" header which says what format the data is in).
Quote from #davidbuxton's answer on this link.
Good luck!

Categories

Resources