I'm finally learning how to use class and __init__, however I'm having an issue with session. It seems like session is not carrying over to the next request. I made a simple script for testing, it adds an item, then I make another request to see if the Bag contains any value (eg. Bag(1)). The problem is that the item is adding but I'm getting Bag(0) when I make the second request. All I can think of is that there might be an issue with session on my part, but I can't figure it out. Here's the script:
import requests, re
from bs4 import BeautifulSoup
class Test():
def __init__(self):
self.s = requests.Session()
self.userAgent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.98 Safari/537.36'
def cart(self):
headers = {'User-Agent': self.userAgent}
r = self.s.get('http://undefeated.com/store/index.php?api=1&rowid=130007&qty=1', headers=headers)
print(r.text)
if re.findall('Added', r.text):
r = self.s.get('http://undefeated.com/store/cart/pg', headers=headers).text
soup = BeautifulSoup(r, 'lxml')
bag = soup.find('li', {'class': 'leaf cart'}).text
print(bag)
start = Test().cart()
Related
This is the code, it took too long to get the data, plus never retrieved the data.
import requests
from bs4 import BeautifulSoup
print("started")
url="https://www.analog.com/en/products.html#"
def get_data(url):
r=requests.get(url)
soup=BeautifulSoup(r.text,"html.parser")
return soup
def parse(soup):
datas=soup.find_all("div",{"class":"product-row row"})
print(len(datas))
return
print("started")
soup=get_data(url)
print("got data")
parse(soup)
You will need to provide a User-Agent to you request header, just add
header = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
at the top of your file and then add the "headers" parameter to your request, as follows
r=requests.get(url,headers=header)
You can read more at this question: How to use Python requests to fake a browser visit a.k.a and generate User Agent?
I have a bs4 app that would in this context prints the most recent post on igg-games.com
Code:
from bs4 import BeautifulSoup
import requests
def get_new():
new = {}
for i in BeautifulSoup(requests.get('https://igg-games.com/').text, features="html.parser").find_all('article'):
elem = i.find('a', class_='uk-link-reset')
new[elem.get_text()] = (elem.get('href'), ", ".join([x.get_text() for x in i.find_all('a', rel = 'category tag')]), i.find('time').get_text())
return new
current = get_new()
new_item = list(current.items())[0]
print(f"Title: {new_item[0]}\nLink: {new_item[1][0]}\nCatagories: {new_item[1][1]}\nAdded: {new_item[1][2]}")
Output on my machine:
Title: Beholder�s Lair Free Download
Link: https://igg-games.com/beholders-lair-free-download.html
Catagories: Action, Adventure
Added: January 7, 2021
I know it works. However, my end goal is to turn this into rss feed entries. So I plugged it all into a premium PythonAnywhere container. However, my function get_new() returns {}. Is there something I need to do that I'm missing?
Solved thanks to the help of Dmytro O.
Since it was likely that PythonAnywhere was blocked as a client, setting the user agent allowed me to receive a response from my intended site.
#the fix
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
response = requests.get(url, headers=headers)
when placed in my code
def get_new():
new = {}
for i in BeautifulSoup(requests.get('https://igg-games.com/', headers={'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}).text, features="html.parser").find_all('article'):
elem = i.find('a', class_='uk-link-reset')
new[elem.get_text()] = (elem.get('href'), ", ".join([x.get_text() for x in i.find_all('a', rel = 'category tag')]), i.find('time').get_text())
return new
This method was provided to me through this stack overflow post: How to use Python requests to fake a browser visit a.k.a and generate User Agent?
I'm building a Twitter bot using Tweepy and BeautifulSoup4. I'd like to save in a list the results of a request but my script isn't working anymore (but it was working days ago). I've been looking at it and I don't understand. Here is my function:
import requests
import tweepy
from bs4 import BeautifulSoup
import urllib
import os
from tweepy import StreamListener
from TwitterEngine import TwitterEngine
from ConfigEngine import TwitterAPIConfig
import urllib.request
import emoji
import random
# desktop user-agent
USER_AGENT = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:65.0) Gecko/20100101 Firefox/65.0"
# mobile user-agent
MOBILE_USER_AGENT = "Mozilla/5.0 (Linux; Android 7.0; SM-G930V Build/NRD90M) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.125 Mobile Safari/537.36"
# Récupération des liens
def parseLinks(url):
headers = {"user-agent": USER_AGENT}
resp = requests.get(url, headers=headers)
if resp.status_code == 200:
soup = BeautifulSoup(resp.content, "html.parser")
results = []
for g in soup.find_all('div', class_='r'):
anchors = g.find_all('a')
if anchors:
link = anchors[0]['href']
results.append(link)
return results
The "url" parameter is 100% correct in the rest of the code. As an output, I get a "None". To be more precise, the execution stops right after line "results = []" (so it doesn't enter into the for).
Any idea?
Thank you so much in advance!
It seems that Google changed the HTML markup on the page. Try to change the search from class="r" to class="rc":
import requests
from bs4 import BeautifulSoup
USER_AGENT = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:65.0) Gecko/20100101 Firefox/65.0"
def parseLinks(url):
headers = {"user-agent": USER_AGENT}
resp = requests.get(url, headers=headers)
if resp.status_code == 200:
soup = BeautifulSoup(resp.content, "html.parser")
results = []
for g in soup.find_all('div', class_='rc'): # <-- change 'r' to 'rc'
anchors = g.find_all('a')
if anchors:
link = anchors[0]['href']
results.append(link)
return results
url = 'https://www.google.com/search?q=tree'
print(parseLinks(url))
Prints:
['https://en.wikipedia.org/wiki/Tree', 'https://simple.wikipedia.org/wiki/Tree', 'https://www.britannica.com/plant/tree', 'https://www.treepeople.org/tree-benefits', 'https://books.google.sk/books?id=yNGrqIaaYvgC&pg=PA20&lpg=PA20&dq=tree&source=bl&ots=_TP8PqSDlT&sig=ACfU3U16j9xRJgr31RraX0HlQZ0ryv9rcA&hl=sk&sa=X&ved=2ahUKEwjOq8fXyKjsAhXhAWMBHToMDw4Q6AEwG3oECAcQAg', 'https://teamtrees.org/', 'https://www.woodlandtrust.org.uk/trees-woods-and-wildlife/british-trees/a-z-of-british-trees/', 'https://artsandculture.google.com/entity/tree/m07j7r?categoryId=other']
So I want to scrape details from https://bookdepository.com
The problem is that it detects the country and change the prices.
I want it to be a different country.
This is my cost, I run it on real.it and I need the book depository website to think I'm from Israel.
headers = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36"}
bookdepo_url = 'https://www.bookdepository.com/search?search=Find+book&searchTerm=' + "0671646788".replace(' ', "+")
search_result = requests.get(bookdepo_url, headers = headers)
soup = BeautifulSoup(search_result.text, 'html.parser')
result_divs = soup.find_all("div", class_= "book-item")
You would either need to route your requests through a proxy server, a VPN, or you would need to execute your code on a machine based in Israel.
That being said, the following works (as of the time of this writing):
import pprint
from bs4 import BeautifulSoup
import requests
def make_proxy_entry(proxy_ip_port):
val = f"http://{proxy_ip_port}"
return dict(http=val, https=val)
headers = {
"User-Agent": (
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 '
'(KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36')
}
bookdepo_url = (
'https://www.bookdepository.com/search?search=Find+book&searchTerm='
'0671646788'
)
ip_opts = ['82.166.105.66:44081', '82.81.32.165:3128', '82.81.169.142:80',
'81.218.45.159:8080', '82.166.105.66:43926', '82.166.105.66:58774',
'31.154.189.206:8080', '31.154.189.224:8080', '31.154.189.211:8080',
'213.8.208.233:8080', '81.218.45.231:8888', '192.116.48.186:3128',
'185.138.170.204:8080', '213.151.40.43:8080', '81.218.45.141:8080']
search_result = None
for ip_port in ip_opts:
proxy_entry = make_proxy_entry(ip_port)
try:
search_result = requests.get(bookdepo_url, headers=headers,
proxies=proxy_entry)
pprint.pprint('Successfully gathered results')
break
except Exception as e:
pprint.pprint(f'Failed to connect to endpoint, with proxy {ip_port}.\n'
f'Details: {pprint.saferepr(e)}')
else:
pprint.pprint('Never made successful connection to end-point!')
search_result = None
if search_result:
soup = BeautifulSoup(search_result.text, 'html.parser')
result_divs = soup.find_all("div", class_= "book-item")
pprint.pprint(result_divs)
This solution makes use of the request library's proxies parameter. I scraped a list of proxies from one of the many free proxy-list sites: http://spys.one/free-proxy-list/IL/
The list of proxy IP addresses and ports was created using the following JavaScript snippet to scrape data off the page via my browser's Dev Tools:
console.log(
"['" +
Array.from(document.querySelectorAll('td>font.spy14'))
.map(e=>e.parentElement)
.filter(e=>e.offsetParent !== null)
.filter(e=>window.getComputedStyle(e).display !== 'none')
.filter(e=>e.innerText.match(/\s*(\d{1,3}\.){3}\d{1,3}\s*:\s*\d+\s*/))
.map(e=>e.innerText)
.join("', '") +
"']"
)
Note: Yes, that JavaScript is ugly and gross, but it got the job done.
At the end of the Python script's execution, I do see that the final currency resolves, as desired, to Israeli New Shekel (ILS), based on elements like the following in the resultant HTML:
<a ... data-currency="ILS" data-isbn="9780671646783" data-price="57.26" ...>
I don't know how to web scrape that much, I wrote this code but it is running really slowly, this code is used to get the search results from a google chrome query. I want to try to add multithreading but I don't really know how. Can somebody tell me how to multithread? Also which function am I supposed to multithread?
import urllib
import requests
from bs4 import BeautifulSoup
from multiprocessing import Pool
# desktop user-agent
def get_listing(url):
headers = {
'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.143 Safari/537.36'}
html = None
links = None
r = requests.get(url, headers=headers, timeout=10)
if r.status_code == 200:
html = r.text
soup = BeautifulSoup(html, 'lxml')
listing_section = soup.select('#offers_table table > tbody > tr > td > h3 > a')
links = [link['href'].strip() for link in listing_section]
return links
def scrapeLinks(query_string):
USER_AGENT = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:65.0) Gecko/20100101 Firefox/65.0"
query = query_string
query = query.replace(' ', '+')
URL = f"https://google.com/search?q={query}"
headers = {"user-agent": USER_AGENT}
resp = requests.get(URL, headers=headers)
if resp.status_code == 200:
soup = BeautifulSoup(resp.content, "html.parser")
results = []
for g in soup.find_all('div', class_='r'):
anchors = g.find_all('a')
if anchors:
link = anchors[0]['href']
title = g.find('h3').text
item = {
"title": title,
"link": link
}
results.append(item)
return results
def getFirst5Results(query_string):
list = scrapeLinks(query_string)
return [list[0]["link"], list[1]["link"], list[2]["link"], list[3]["link"], list[4]["link"]]
Few things about multithreading
You can use it for the code that required network calls. For
instance, invoking an api.
When the code would run for a longer
duration of time, and you want to run the process in the background.
In the case you've stated the web scraping is a long running tasks,
as it involves network call to google api and parsing of the results
after we get the results back. Assuming that you're using
scrapeLinks function for scraping.
Here's some code :
import threading
t1 = threading.Thread(target = scrapeLinks, args = (query_string,)
t1.start()
In order to retrieve results from the thread use:
t1.join()