Want to send a request get in python from different country - python

So I want to scrape details from https://bookdepository.com
The problem is that it detects the country and change the prices.
I want it to be a different country.
This is my cost, I run it on real.it and I need the book depository website to think I'm from Israel.
headers = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36"}
bookdepo_url = 'https://www.bookdepository.com/search?search=Find+book&searchTerm=' + "0671646788".replace(' ', "+")
search_result = requests.get(bookdepo_url, headers = headers)
soup = BeautifulSoup(search_result.text, 'html.parser')
result_divs = soup.find_all("div", class_= "book-item")

You would either need to route your requests through a proxy server, a VPN, or you would need to execute your code on a machine based in Israel.
That being said, the following works (as of the time of this writing):
import pprint
from bs4 import BeautifulSoup
import requests
def make_proxy_entry(proxy_ip_port):
val = f"http://{proxy_ip_port}"
return dict(http=val, https=val)
headers = {
"User-Agent": (
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 '
'(KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36')
}
bookdepo_url = (
'https://www.bookdepository.com/search?search=Find+book&searchTerm='
'0671646788'
)
ip_opts = ['82.166.105.66:44081', '82.81.32.165:3128', '82.81.169.142:80',
'81.218.45.159:8080', '82.166.105.66:43926', '82.166.105.66:58774',
'31.154.189.206:8080', '31.154.189.224:8080', '31.154.189.211:8080',
'213.8.208.233:8080', '81.218.45.231:8888', '192.116.48.186:3128',
'185.138.170.204:8080', '213.151.40.43:8080', '81.218.45.141:8080']
search_result = None
for ip_port in ip_opts:
proxy_entry = make_proxy_entry(ip_port)
try:
search_result = requests.get(bookdepo_url, headers=headers,
proxies=proxy_entry)
pprint.pprint('Successfully gathered results')
break
except Exception as e:
pprint.pprint(f'Failed to connect to endpoint, with proxy {ip_port}.\n'
f'Details: {pprint.saferepr(e)}')
else:
pprint.pprint('Never made successful connection to end-point!')
search_result = None
if search_result:
soup = BeautifulSoup(search_result.text, 'html.parser')
result_divs = soup.find_all("div", class_= "book-item")
pprint.pprint(result_divs)
This solution makes use of the request library's proxies parameter. I scraped a list of proxies from one of the many free proxy-list sites: http://spys.one/free-proxy-list/IL/
The list of proxy IP addresses and ports was created using the following JavaScript snippet to scrape data off the page via my browser's Dev Tools:
console.log(
"['" +
Array.from(document.querySelectorAll('td>font.spy14'))
.map(e=>e.parentElement)
.filter(e=>e.offsetParent !== null)
.filter(e=>window.getComputedStyle(e).display !== 'none')
.filter(e=>e.innerText.match(/\s*(\d{1,3}\.){3}\d{1,3}\s*:\s*\d+\s*/))
.map(e=>e.innerText)
.join("', '") +
"']"
)
Note: Yes, that JavaScript is ugly and gross, but it got the job done.
At the end of the Python script's execution, I do see that the final currency resolves, as desired, to Israeli New Shekel (ILS), based on elements like the following in the resultant HTML:
<a ... data-currency="ILS" data-isbn="9780671646783" data-price="57.26" ...>

Related

Beautiful soup returning empty in PythonAnywhere

I have a bs4 app that would in this context prints the most recent post on igg-games.com
Code:
from bs4 import BeautifulSoup
import requests
def get_new():
new = {}
for i in BeautifulSoup(requests.get('https://igg-games.com/').text, features="html.parser").find_all('article'):
elem = i.find('a', class_='uk-link-reset')
new[elem.get_text()] = (elem.get('href'), ", ".join([x.get_text() for x in i.find_all('a', rel = 'category tag')]), i.find('time').get_text())
return new
current = get_new()
new_item = list(current.items())[0]
print(f"Title: {new_item[0]}\nLink: {new_item[1][0]}\nCatagories: {new_item[1][1]}\nAdded: {new_item[1][2]}")
Output on my machine:
Title: Beholder�s Lair Free Download
Link: https://igg-games.com/beholders-lair-free-download.html
Catagories: Action, Adventure
Added: January 7, 2021
I know it works. However, my end goal is to turn this into rss feed entries. So I plugged it all into a premium PythonAnywhere container. However, my function get_new() returns {}. Is there something I need to do that I'm missing?
Solved thanks to the help of Dmytro O.
Since it was likely that PythonAnywhere was blocked as a client, setting the user agent allowed me to receive a response from my intended site.
#the fix
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
response = requests.get(url, headers=headers)
when placed in my code
def get_new():
new = {}
for i in BeautifulSoup(requests.get('https://igg-games.com/', headers={'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}).text, features="html.parser").find_all('article'):
elem = i.find('a', class_='uk-link-reset')
new[elem.get_text()] = (elem.get('href'), ", ".join([x.get_text() for x in i.find_all('a', rel = 'category tag')]), i.find('time').get_text())
return new
This method was provided to me through this stack overflow post: How to use Python requests to fake a browser visit a.k.a and generate User Agent?

Use Post to change page

I've been using Selenium for some time to scrape a website but for some reasons it doesn't work anymore. I was using Selenium because you need to interact with the site to flip through pages (ie: click on a next button).
As a solution, I was thinking of using Post method from Requests. I'm not sure if its doable since I've never used the Post method, and since I not familiar with what it does (though I kind of understand the general idea).
My code would look something like that:
import requests
from bs4 import BeautifulSoup
headers = {"User-Agent":
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10 11 5) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/50.0.2661.102 Safari/537.36"}
url = "https://www.centris.ca/fr/propriete~a-vendre?view=Thumbnail"
def infinity():
while True:
yield
c = 0
urls = []
for i in infinity():
c += 1
page = list(str(soup.find("li",{"class":"pager-current"}).text).split())
pageTot = int("".join(page[-2:])) # Check the total number of page
if c <= pageTot: # Scrape the first page
if c <= 1:
req = requests.get(url, headers=headers)
else:
pass
# This is where I'm stuck but ideally I'd be using Post method in some way
soup = BeautifulSoup(req.content,"lxml")
for link in soup.find_all("a",{"class":"a-more-detail"}):
try: # For each page scrape ads url
urls.append("https://www.centris.ca" + link["href"])
except KeyError:
pass
else: # When all pages are scrape exit the loop
break
for url in list(dict.fromkeys(urls)):
pass # do stuff
This is what is going on when you click next on the webpage:
This is the Request (the startPosition begins at 0 on page 1 and increase by leaps of 12)
And this is part of the Reponse:
{"d":{"Message":"","Result":{"html": [...], "count":34302,"inscNumberPerPage":12,"title":""},"Succeeded":true}}
With that information is it possible to use the Post method to scrape every pages ? And how could I do that ?
The following should do the trick. I've added duplicate filtering logic to avoid printing duplicate links. The script should break once there are no more results left to scrape.
import requests
from bs4 import BeautifulSoup
base = 'https://www.centris.ca{}'
post_link = 'https://www.centris.ca/Property/GetInscriptions'
url = 'https://www.centris.ca/fr/propriete~a-vendre?view=Thumbnail'
unique_links = set()
payload = {"startPosition":0}
with requests.Session() as s:
s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36'
s.headers['content-type'] = 'application/json; charset=UTF-8'
s.get(url) #Sent this requests to get the cookies
while True:
r = s.post(post_link,json=payload)
if not len(r.json()['d']['Result']['html']):break
soup = BeautifulSoup(r.json()['d']['Result']['html'],"html.parser")
for item in soup.select(".thumbnailItem a.a-more-detail"):
unique_link = base.format(item.get("href"))
if unique_link not in unique_links:
print(unique_link)
unique_links.add(unique_link)
payload['startPosition']+=12

How do I add multithreading this?

I don't know how to web scrape that much, I wrote this code but it is running really slowly, this code is used to get the search results from a google chrome query. I want to try to add multithreading but I don't really know how. Can somebody tell me how to multithread? Also which function am I supposed to multithread?
import urllib
import requests
from bs4 import BeautifulSoup
from multiprocessing import Pool
# desktop user-agent
def get_listing(url):
headers = {
'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.143 Safari/537.36'}
html = None
links = None
r = requests.get(url, headers=headers, timeout=10)
if r.status_code == 200:
html = r.text
soup = BeautifulSoup(html, 'lxml')
listing_section = soup.select('#offers_table table > tbody > tr > td > h3 > a')
links = [link['href'].strip() for link in listing_section]
return links
def scrapeLinks(query_string):
USER_AGENT = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:65.0) Gecko/20100101 Firefox/65.0"
query = query_string
query = query.replace(' ', '+')
URL = f"https://google.com/search?q={query}"
headers = {"user-agent": USER_AGENT}
resp = requests.get(URL, headers=headers)
if resp.status_code == 200:
soup = BeautifulSoup(resp.content, "html.parser")
results = []
for g in soup.find_all('div', class_='r'):
anchors = g.find_all('a')
if anchors:
link = anchors[0]['href']
title = g.find('h3').text
item = {
"title": title,
"link": link
}
results.append(item)
return results
def getFirst5Results(query_string):
list = scrapeLinks(query_string)
return [list[0]["link"], list[1]["link"], list[2]["link"], list[3]["link"], list[4]["link"]]
Few things about multithreading
You can use it for the code that required network calls. For
instance, invoking an api.
When the code would run for a longer
duration of time, and you want to run the process in the background.
In the case you've stated the web scraping is a long running tasks,
as it involves network call to google api and parsing of the results
after we get the results back. Assuming that you're using
scrapeLinks function for scraping.
Here's some code :
import threading
t1 = threading.Thread(target = scrapeLinks, args = (query_string,)
t1.start()
In order to retrieve results from the thread use:
t1.join()

I am not able to scrape the web data from the given website using python

Hi I ans trying to scrape the data from the site https://health.usnews.com/doctors/city-index/new-jersey . I want all the city name and again from the link scrape the data. But using requests library in python something is going wrong. There are some session or cookies or something which is stopping to crawl the data. please help me out.
>>> import requests
>>> url = 'https://health.usnews.com/doctors/city-index/new-jersey'
>>> html_content = requests.get(url)
>>> html_content.status_code
403
>>> html_content.content
'<HTML><HEAD>\n<TITLE>Access Denied</TITLE>\n</HEAD><BODY>\n<H1>Access Denied</H1>\n \nYou don\'t have permission to access "http://health.usnews.com/doctors/city-index/new-jersey" on this server.<P>\nReference #18.7d70b17.1528874823.3fac5589\n</BODY>\n</HTML>\n'
>>>
Here is the error I am getting.
You need to add header in your request so that the site think you are a genuine user.
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'}
html_content = requests.get(url, headers=headers)
First of all, Like the previous answer suggested I would recommend you to add a header to your code, so your code should look something like this:
import requests
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.13; rv:60.0) Gecko/20100101 Firefox/60.0'}
url = 'https://health.usnews.com/doctors/city-index/new-jersey'
html_content = requests.get(url, headers=headers)
html_content.status_code
print(html_content.text)

Session not transferring over to next requests

I'm finally learning how to use class and __init__, however I'm having an issue with session. It seems like session is not carrying over to the next request. I made a simple script for testing, it adds an item, then I make another request to see if the Bag contains any value (eg. Bag(1)). The problem is that the item is adding but I'm getting Bag(0) when I make the second request. All I can think of is that there might be an issue with session on my part, but I can't figure it out. Here's the script:
import requests, re
from bs4 import BeautifulSoup
class Test():
def __init__(self):
self.s = requests.Session()
self.userAgent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.98 Safari/537.36'
def cart(self):
headers = {'User-Agent': self.userAgent}
r = self.s.get('http://undefeated.com/store/index.php?api=1&rowid=130007&qty=1', headers=headers)
print(r.text)
if re.findall('Added', r.text):
r = self.s.get('http://undefeated.com/store/cart/pg', headers=headers).text
soup = BeautifulSoup(r, 'lxml')
bag = soup.find('li', {'class': 'leaf cart'}).text
print(bag)
start = Test().cart()

Categories

Resources