Use Post to change page - python

I've been using Selenium for some time to scrape a website but for some reasons it doesn't work anymore. I was using Selenium because you need to interact with the site to flip through pages (ie: click on a next button).
As a solution, I was thinking of using Post method from Requests. I'm not sure if its doable since I've never used the Post method, and since I not familiar with what it does (though I kind of understand the general idea).
My code would look something like that:
import requests
from bs4 import BeautifulSoup
headers = {"User-Agent":
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10 11 5) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/50.0.2661.102 Safari/537.36"}
url = "https://www.centris.ca/fr/propriete~a-vendre?view=Thumbnail"
def infinity():
while True:
yield
c = 0
urls = []
for i in infinity():
c += 1
page = list(str(soup.find("li",{"class":"pager-current"}).text).split())
pageTot = int("".join(page[-2:])) # Check the total number of page
if c <= pageTot: # Scrape the first page
if c <= 1:
req = requests.get(url, headers=headers)
else:
pass
# This is where I'm stuck but ideally I'd be using Post method in some way
soup = BeautifulSoup(req.content,"lxml")
for link in soup.find_all("a",{"class":"a-more-detail"}):
try: # For each page scrape ads url
urls.append("https://www.centris.ca" + link["href"])
except KeyError:
pass
else: # When all pages are scrape exit the loop
break
for url in list(dict.fromkeys(urls)):
pass # do stuff
This is what is going on when you click next on the webpage:
This is the Request (the startPosition begins at 0 on page 1 and increase by leaps of 12)
And this is part of the Reponse:
{"d":{"Message":"","Result":{"html": [...], "count":34302,"inscNumberPerPage":12,"title":""},"Succeeded":true}}
With that information is it possible to use the Post method to scrape every pages ? And how could I do that ?

The following should do the trick. I've added duplicate filtering logic to avoid printing duplicate links. The script should break once there are no more results left to scrape.
import requests
from bs4 import BeautifulSoup
base = 'https://www.centris.ca{}'
post_link = 'https://www.centris.ca/Property/GetInscriptions'
url = 'https://www.centris.ca/fr/propriete~a-vendre?view=Thumbnail'
unique_links = set()
payload = {"startPosition":0}
with requests.Session() as s:
s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36'
s.headers['content-type'] = 'application/json; charset=UTF-8'
s.get(url) #Sent this requests to get the cookies
while True:
r = s.post(post_link,json=payload)
if not len(r.json()['d']['Result']['html']):break
soup = BeautifulSoup(r.json()['d']['Result']['html'],"html.parser")
for item in soup.select(".thumbnailItem a.a-more-detail"):
unique_link = base.format(item.get("href"))
if unique_link not in unique_links:
print(unique_link)
unique_links.add(unique_link)
payload['startPosition']+=12

Related

Site parsing myip.ms

Writing a parser for the site https://myip.ms/ And here for this page https://myip.ms/browse/sites/1/ipID/23.227.38.0/ipIDii/23.227.38.255/own/376714 Everything works fine with this link, but if you go to another page https://myip.ms/browse/sites/2/ipID/23.227.38.0/ipIDii/23.227.38.255/own/376714 It does not output any data, although the site structure is the same. I think that this may be due to the fact that the site has a limit on views, or because you need to register, but I can't find what request you need to send to log in to your account. Tell me what to do?
import requests
from bs4 import BeautifulSoup
import time
link_list = []
URL = 'https://myip.ms/browse/sites/2/ipID/23.227.38.0/ipIDii/23.227.38.255/own/376714'
HEADERS = {'user-agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 YaBrowser/20.12.2.105 Yowser/2.5 Safari/537.36','accept':'*/*'}
#HOST =
def get_html(url,params=None):
r = requests.get(url,headers=HEADERS,params=params)
return r
def get_content(html):
soup = BeautifulSoup(html,'html.parser')
items = soup.find_all('td',class_='row_name')
for item in items:
links = item.find('a').get('href')
link_list.append({
'link': links
})
def parser():
print(URL)
html = get_html(URL)
if html.status_code == 200:
get_content(html.text)
else:
print('Error')
parser()
print(link_list)
Use SessionID with your request. It will allow you at least 50 requests per day.
If you use proxy that support cookies this number might be even higher.
So the process is as follows:
load the page with your browser.
find session id in the request inside your Dev Tools.
use this session id in your request, no headers or additional info is required.
enjoy results for 50 requests per day.
repeat in 24 hours.

Want to send a request get in python from different country

So I want to scrape details from https://bookdepository.com
The problem is that it detects the country and change the prices.
I want it to be a different country.
This is my cost, I run it on real.it and I need the book depository website to think I'm from Israel.
headers = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36"}
bookdepo_url = 'https://www.bookdepository.com/search?search=Find+book&searchTerm=' + "0671646788".replace(' ', "+")
search_result = requests.get(bookdepo_url, headers = headers)
soup = BeautifulSoup(search_result.text, 'html.parser')
result_divs = soup.find_all("div", class_= "book-item")
You would either need to route your requests through a proxy server, a VPN, or you would need to execute your code on a machine based in Israel.
That being said, the following works (as of the time of this writing):
import pprint
from bs4 import BeautifulSoup
import requests
def make_proxy_entry(proxy_ip_port):
val = f"http://{proxy_ip_port}"
return dict(http=val, https=val)
headers = {
"User-Agent": (
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 '
'(KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36')
}
bookdepo_url = (
'https://www.bookdepository.com/search?search=Find+book&searchTerm='
'0671646788'
)
ip_opts = ['82.166.105.66:44081', '82.81.32.165:3128', '82.81.169.142:80',
'81.218.45.159:8080', '82.166.105.66:43926', '82.166.105.66:58774',
'31.154.189.206:8080', '31.154.189.224:8080', '31.154.189.211:8080',
'213.8.208.233:8080', '81.218.45.231:8888', '192.116.48.186:3128',
'185.138.170.204:8080', '213.151.40.43:8080', '81.218.45.141:8080']
search_result = None
for ip_port in ip_opts:
proxy_entry = make_proxy_entry(ip_port)
try:
search_result = requests.get(bookdepo_url, headers=headers,
proxies=proxy_entry)
pprint.pprint('Successfully gathered results')
break
except Exception as e:
pprint.pprint(f'Failed to connect to endpoint, with proxy {ip_port}.\n'
f'Details: {pprint.saferepr(e)}')
else:
pprint.pprint('Never made successful connection to end-point!')
search_result = None
if search_result:
soup = BeautifulSoup(search_result.text, 'html.parser')
result_divs = soup.find_all("div", class_= "book-item")
pprint.pprint(result_divs)
This solution makes use of the request library's proxies parameter. I scraped a list of proxies from one of the many free proxy-list sites: http://spys.one/free-proxy-list/IL/
The list of proxy IP addresses and ports was created using the following JavaScript snippet to scrape data off the page via my browser's Dev Tools:
console.log(
"['" +
Array.from(document.querySelectorAll('td>font.spy14'))
.map(e=>e.parentElement)
.filter(e=>e.offsetParent !== null)
.filter(e=>window.getComputedStyle(e).display !== 'none')
.filter(e=>e.innerText.match(/\s*(\d{1,3}\.){3}\d{1,3}\s*:\s*\d+\s*/))
.map(e=>e.innerText)
.join("', '") +
"']"
)
Note: Yes, that JavaScript is ugly and gross, but it got the job done.
At the end of the Python script's execution, I do see that the final currency resolves, as desired, to Israeli New Shekel (ILS), based on elements like the following in the resultant HTML:
<a ... data-currency="ILS" data-isbn="9780671646783" data-price="57.26" ...>

Code runs for 20min+, then stops with no output, what's the problem?

I'm trying to get the 'src' from 500 profile pictures on Transfermarkt, the pictures on each players profile that is, not the small picture from the list. I've managed to store each players URL to a list. Now when I'm trying to iterate through it, the code just runs and runs, then stops after 20 minutes something, without any error or output from my print command. As I said, I want the source (src) for each players picture on their respective profile.
I'm not really sure what's wrong with the code, since I don't get any error messages. I've built it with help from different posts here on stackoverflow.
from bs4 import BeautifulSoup
import requests
import pandas as pd
playerID = []
playerImgSrc = []
result = []
for page in range(1, 21):
r = requests.get("https://www.transfermarkt.com/spieler-statistik/wertvollstespieler/marktwertetop?land_id=0&ausrichtung=alle&spielerposition_id=alle&altersklasse=alle&jahrgang=0&kontinent_id=0&plus=1",
params= {"page": page},
headers= {"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:74.0) Gecko/20100101 Firefox/74.0"}
)
soup = BeautifulSoup(r.content, "html.parser")
links = soup.select('a.spielprofil_tooltip')
for i in range(len(links)):
playerID.append(links[i].get('id'))
playerProfile = ["https://www.transfermarkt.com/josh-maja/profil/spieler/" + x for x in playerID]
for p in playerProfile:
html = requests.get(p).text
soup = BeautifulSoup(html, "html.parser")
link = soup.select('div.dataBild')
for i in range(len(link)):
playerImgSrc.append(link[i].get('src'))
print(playerImgSrc)
Basically, the website navigation is using AJAX technology, Which is really quick enough, the same as you browsing a folder in your local machine.
Therefore, the data displayed within the UI(User Interface) is actually coming from a background of XHR request to specific directory within the host which is marktwertetop where it's using AJAX.
I've been able to locate the XHR request been made to it, Then I called it directly with the required parameters while looping over the pages.
I figured out the difference between small and large photo is actually one different location of direction which is small and header, So I've replaced it within in the url itself.
Also i considered been under antibiotic protection (😋) meant under requests.Session() to maintain the Session during my loop and downloading the pics, which means to prevent TCP layer security from blocking/refusing/dropping my packet/request while Scraping/Downloading.
Imagine, that you already open a browser, where you navigate between the same website pages, there's a cookies session created which established as long as you connected to the site, and if idle it's refresh itself.
But the way you were doing it, is just you are open a browser, then close it, then open it again and close it, AND SO ON ! where the server side count it as DDOS attack ?! or flood behavior. and that's a very basics of firewall action.
import requests
from bs4 import BeautifulSoup
site = "https://www.transfermarkt.com/spieler-statistik/wertvollstespieler/marktwertetop?ajax=yw1&page={}"
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:74.0) Gecko/20100101 Firefox/74.0'
}
def main(url):
with requests.Session() as req:
allin = []
for item in range(1, 21):
print(f"Collecting Links From Page# {item}")
r = req.get(url.format(item), headers=headers)
soup = BeautifulSoup(r.content, 'html.parser')
img = [item.get("src") for item in soup.findAll(
"img", class_="bilderrahmen-fixed")]
convert = [item.replace("small", "header") for item in img]
allin.extend(convert)
return allin
def download():
urls = main(site)
with requests.Session() as req:
for url in urls:
r = req.get(url, headers=headers)
name = url[52:]
name = name.split('?')[0]
print(f"Saving {name}")
with open(f"{name}", 'wb') as f:
f.write(r.content)
download()
UPDATE per user comment:
import requests
from bs4 import BeautifulSoup
import csv
site = "https://www.transfermarkt.com/spieler-statistik/wertvollstespieler/marktwertetop?ajax=yw1&page={}"
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:74.0) Gecko/20100101 Firefox/74.0'
}
def main(url):
with requests.Session() as req:
allin = []
names = []
for item in range(1, 21):
print(f"Collecting Links From Page# {item}")
r = req.get(url.format(item), headers=headers)
soup = BeautifulSoup(r.content, 'html.parser')
img = [item.get("src") for item in soup.findAll(
"img", class_="bilderrahmen-fixed")]
convert = [item.replace("small", "header") for item in img]
name = [name.text for name in soup.findAll(
"a", class_="spielprofil_tooltip")][:-5]
allin.extend(convert)
names.extend(name)
with open("data.csv", 'w', newline="", encoding="UTF-8") as f:
writer = csv.writer(f)
writer.writerow(["Name", "IMG"])
data = zip(names, allin)
writer.writerows(data)
main(site)
Output: view online

How to crawl several review pages using Python?

I have a question about web-crawler.
I want to get several review pages using Python.
Here my code for web-crawler.
URL = 'https://www.example.co.kr/users/sign_in'
user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.117 Safari/537.36'
headers = {'Content-type': 'application/json', 'Accept': 'text/plain', 'User-Agent':user_agent}
login_data = {'user':{'email':'id', 'password':'password', 'remember_me':'true'}}
client = requests.session()
login_response = client.post(URL, json = login_data, headers = headers)
print(login_response.content.decode('utf-8'))
jre = 'https://www.example.co.kr/companies/reviews/ent?page=1'
index = client.get(jre)
html = index.content.decode('utf-8')
print(html)
This code only gets page=1, but I want to get page=1, page=2, page3 .... using format method. How can I achieve that?
You should use a while o a for loop on each page, depending on your necessities.
Try a pattern like this:
page = 1
while page <= MAX_PAGE or not REACHED_STOPPING_CONDITION:
# Compose page url
jre = f'https://www.example.co.kr/companies/reviews/ent?page={page}'
# Get page url
index = client.get(jre)
# Do stuff...
# Increment page counter
page += 1
I think that once you had access to website you do not have any necessity to perform login again. If it is needed, you should insert login part into the loop.
Another way to navigate website pages is to find a sort of "Next page" or "Previous page" reference in the document and then interact with them:
# Compose page url
jre = 'https://www.example.co.kr/companies/reviews/ent?page=1'
# Get page
index = client.get(jre)
while page <= MAX_PAGE or not REACHED_STOPPING_CONDITION:
# Do stuff...
# Search next page element (ex. by CSS selector)
jre.find_element_by_css_selector('next-page').click()
# Increment page counter
page += 1

I get nothing when trying to scrape a table

So I want to extract the number 45.5 from here: https://www.myscore.com.ua/match/I9pSZU2I/#odds-comparison;over-under;1st-qrt
But when I try to find the table I get nothing. Here's my code:
import requests
from bs4 import BeautifulSoup
url = 'https://www.myscore.com.ua/match/I9pSZU2I/#odds-comparison;over-under;1st-qrt'
headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux armv7l) AppleWebKit/537.36 (KHTML, like Gecko) Raspbian Chromium/65.0.3325.181 Chrome/65.0.3325.181 Safari/537.36'}
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.content, 'html.parser')
text = soup.find_all('table', class_ = 'odds sortable')
print(text)
Can anybody help me to extract the number and store it's value into a variable?
You can try to do this without Selenium by recreating the dynamic request that loads the table.
Looking around in the network tab of the page, i saw this XMLHTTPRequest: https://d.myscore.com.ua/x/feed/d_od_I9pSZU2I_ru_1_eu
Try to reproduce the same parameters as the request.
To access the network tab: Click right->inspect element->Network tab->Select XHR and find the second request.
The final code would be like this:
headers = {'x-fsign' : 'SW9D1eZo'}
page =
requests.get('https://d.myscore.com.ua/x/feed/d_od_I9pSZU2I_ru_1_eu',
headers=headers)
You should check if the x=fisgn value is different based on your browser/ip.

Categories

Resources