Site parsing myip.ms - python

Writing a parser for the site https://myip.ms/ And here for this page https://myip.ms/browse/sites/1/ipID/23.227.38.0/ipIDii/23.227.38.255/own/376714 Everything works fine with this link, but if you go to another page https://myip.ms/browse/sites/2/ipID/23.227.38.0/ipIDii/23.227.38.255/own/376714 It does not output any data, although the site structure is the same. I think that this may be due to the fact that the site has a limit on views, or because you need to register, but I can't find what request you need to send to log in to your account. Tell me what to do?
import requests
from bs4 import BeautifulSoup
import time
link_list = []
URL = 'https://myip.ms/browse/sites/2/ipID/23.227.38.0/ipIDii/23.227.38.255/own/376714'
HEADERS = {'user-agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 YaBrowser/20.12.2.105 Yowser/2.5 Safari/537.36','accept':'*/*'}
#HOST =
def get_html(url,params=None):
r = requests.get(url,headers=HEADERS,params=params)
return r
def get_content(html):
soup = BeautifulSoup(html,'html.parser')
items = soup.find_all('td',class_='row_name')
for item in items:
links = item.find('a').get('href')
link_list.append({
'link': links
})
def parser():
print(URL)
html = get_html(URL)
if html.status_code == 200:
get_content(html.text)
else:
print('Error')
parser()
print(link_list)

Use SessionID with your request. It will allow you at least 50 requests per day.
If you use proxy that support cookies this number might be even higher.
So the process is as follows:
load the page with your browser.
find session id in the request inside your Dev Tools.
use this session id in your request, no headers or additional info is required.
enjoy results for 50 requests per day.
repeat in 24 hours.

Related

Getting the page source of imgur

I'm trying to get the page source of an imgur website using requests, but the results I'm getting are different from the source. I understand that these pages are rendered using JS, but that is not what I am searching for.
It seems I'm getting redirected because they detect I'm using an automated browser, but I'd prefer not to use selenium here. For example, see the following code to scrape the page source of two imgur ID's (one valid ID and one invalid ID) with different page sources.
import requests
from bs4 import BeautifulSoup
url1 = "https://i.imgur.com/ssXK5" #valid ID
url2 = "https://i.imgur.com/ssXK4" #invalid ID
def get_source(url):
headers = {
"User-Agent": "Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Mobile Safari/537.36"}
page = requests.get(url, headers = headers)
soup = BeautifulSoup(page.content, 'html.parser')
return soup
page1 = get_source(url1)
page2 = get_source(url2)
print(page1==page2)
#True
The scraped page sources are identical, so I presume it's an anti-scraping thing. I know there is an imgur API, but I'd like to know how to circumvent such a redirection, if possible. Is there any way to get the actual source code using the requests module?
Thanks.

Use Post to change page

I've been using Selenium for some time to scrape a website but for some reasons it doesn't work anymore. I was using Selenium because you need to interact with the site to flip through pages (ie: click on a next button).
As a solution, I was thinking of using Post method from Requests. I'm not sure if its doable since I've never used the Post method, and since I not familiar with what it does (though I kind of understand the general idea).
My code would look something like that:
import requests
from bs4 import BeautifulSoup
headers = {"User-Agent":
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10 11 5) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/50.0.2661.102 Safari/537.36"}
url = "https://www.centris.ca/fr/propriete~a-vendre?view=Thumbnail"
def infinity():
while True:
yield
c = 0
urls = []
for i in infinity():
c += 1
page = list(str(soup.find("li",{"class":"pager-current"}).text).split())
pageTot = int("".join(page[-2:])) # Check the total number of page
if c <= pageTot: # Scrape the first page
if c <= 1:
req = requests.get(url, headers=headers)
else:
pass
# This is where I'm stuck but ideally I'd be using Post method in some way
soup = BeautifulSoup(req.content,"lxml")
for link in soup.find_all("a",{"class":"a-more-detail"}):
try: # For each page scrape ads url
urls.append("https://www.centris.ca" + link["href"])
except KeyError:
pass
else: # When all pages are scrape exit the loop
break
for url in list(dict.fromkeys(urls)):
pass # do stuff
This is what is going on when you click next on the webpage:
This is the Request (the startPosition begins at 0 on page 1 and increase by leaps of 12)
And this is part of the Reponse:
{"d":{"Message":"","Result":{"html": [...], "count":34302,"inscNumberPerPage":12,"title":""},"Succeeded":true}}
With that information is it possible to use the Post method to scrape every pages ? And how could I do that ?
The following should do the trick. I've added duplicate filtering logic to avoid printing duplicate links. The script should break once there are no more results left to scrape.
import requests
from bs4 import BeautifulSoup
base = 'https://www.centris.ca{}'
post_link = 'https://www.centris.ca/Property/GetInscriptions'
url = 'https://www.centris.ca/fr/propriete~a-vendre?view=Thumbnail'
unique_links = set()
payload = {"startPosition":0}
with requests.Session() as s:
s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36'
s.headers['content-type'] = 'application/json; charset=UTF-8'
s.get(url) #Sent this requests to get the cookies
while True:
r = s.post(post_link,json=payload)
if not len(r.json()['d']['Result']['html']):break
soup = BeautifulSoup(r.json()['d']['Result']['html'],"html.parser")
for item in soup.select(".thumbnailItem a.a-more-detail"):
unique_link = base.format(item.get("href"))
if unique_link not in unique_links:
print(unique_link)
unique_links.add(unique_link)
payload['startPosition']+=12

Can't scrape the value of a certain field from a webpage using requests

I'm trying to scrape the value of Balance from a webpage using requests module. I've looked for the name Balance in dev tools and in page source but found nowhere. I hope there should be any way to grab the value of Balance from that webpage without using any browser simulator.
website address
Output I'm after:
I've tried with:
import requests
from bs4 import BeautifulSoup
link = 'https://tronscan.org/?fbclid=IwAR2WiSKZoTDPWX1ufaAIEg9vaA5oLj9Yd_RUfpjE6MWEQKRGBaK-L_JdtwQ#/contract/TCSPn1Lbdv62QfSCczbLdwupNoCFYAfUVL'
headers = {"User-Agent":"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.111 Safari/537.36"}
res = requests.get(link,headers=headers)
soup = BeautifulSoup(res.text,'lxml')
balance = soup.select_one("li:has(> p:contains('Balance'))").get_text(strip=True)
print(balance)
The reason the page's HTML doesn't have the balance is because the page is making AJAX requests which are sending back the information you want after the page is loaded. You can look at these requests by loading up your developer window by pressing F12 in Chrome (it might be different in other browsers), go to the Network tab and you'll see this:
Here you can see the request that you want is account?address= followed by the code that is in the URL string for the page, and mousing over that shows the complete URL for the AJAX request, highlighted in coral, and the part of the response which holds the data you want is on the right highlighted in turquoise.
You can look at response by going here and find tokenBalances.
In order to get the balance in Python you can run the following:
import requests, json
url = 'https://apilist.tronscan.org/api/account?address=TCSPn1Lbdv62QfSCczbLdwupNoCFYAfUVL'
headers = {"User-Agent":"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.111 Safari/537.36"}
response = requests.get(url, headers=headers)
response = json.loads(response.text)
balance = response['tokenBalances'][0]['balance']
print(balance)

How to visit a link and stay on it for specific seconds?

I have a problem from staying on a website link and that link got a timer for staying and don't move to another place. Here is my code:
import requests
import time
from bs4 import BeautifulSoup
url = input('Please enter your link: ')
def get_response(url, method='GET'):
response = requests.request(method, url, headers={"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win32; x86) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36"}, timeout=15)
text_response = response.text
status_code = response.status_code
return[status_code, text_response]
while True:
(status_code, text_response) = get_response(url)
parse_data = BeautifulSoup(text_response, 'html.parser')
time.sleep(20)
print('done')
exit()
The link will open but not like in browsers
The timer runs on a Javascript code on the website, the requests library doesn't run the website's Javascript code.
Use Selenium instead, It allows you to control a browser and run the website's Javascript.

I get nothing when trying to scrape a table

So I want to extract the number 45.5 from here: https://www.myscore.com.ua/match/I9pSZU2I/#odds-comparison;over-under;1st-qrt
But when I try to find the table I get nothing. Here's my code:
import requests
from bs4 import BeautifulSoup
url = 'https://www.myscore.com.ua/match/I9pSZU2I/#odds-comparison;over-under;1st-qrt'
headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux armv7l) AppleWebKit/537.36 (KHTML, like Gecko) Raspbian Chromium/65.0.3325.181 Chrome/65.0.3325.181 Safari/537.36'}
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.content, 'html.parser')
text = soup.find_all('table', class_ = 'odds sortable')
print(text)
Can anybody help me to extract the number and store it's value into a variable?
You can try to do this without Selenium by recreating the dynamic request that loads the table.
Looking around in the network tab of the page, i saw this XMLHTTPRequest: https://d.myscore.com.ua/x/feed/d_od_I9pSZU2I_ru_1_eu
Try to reproduce the same parameters as the request.
To access the network tab: Click right->inspect element->Network tab->Select XHR and find the second request.
The final code would be like this:
headers = {'x-fsign' : 'SW9D1eZo'}
page =
requests.get('https://d.myscore.com.ua/x/feed/d_od_I9pSZU2I_ru_1_eu',
headers=headers)
You should check if the x=fisgn value is different based on your browser/ip.

Categories

Resources