Web parsing without Selenium - python

I am trying to parse the following website in order to get all addresses of stores (sorry for my Russian):
http://magnit-info.ru/buyers/adds/1258/14/243795
Here are addresses just for one city at the end of the page.
The addresses are placed in the block .b-shops-list. This block is populated dynamically by POST request. When I tried to use requests module and get addresses, it does not work since the block is empty at the beginning (page source).
I am using Selenium right now, but it is really slow. To parse all cities and regions it takes about 2 hours (even with multiprocessing). I also have to use expected_conditions and wait about 4-5 seconds to be sure that POST requests are completed.
Are there any options to accelerate this process? Can I send POST requests somehow by using requests? If yes, how I figure out what kind of POST requests I should sent? This question is also related to websites which use Google maps.
Thank you!

I had a look at the AJAX request that this pages does to load the addresses and came up with this small code snippet:
import requests
data = {
'op': 'get_shops',
'SECTION_ID': 1258,
'RID': 14,
'CID': 243795,
}
res = requests.post('http://magnit-info.ru/functions/bmap/func.php', data=data)
addresses = res.json()
If you check the data dictionary you can clearly see that you could easily generate it from the URL you linked.

Related

scrape live scores from oddsportal live odds page using requests

I want to scrape inplay odds and scores.
I succeed to get live odds data using the below code, but without finding live scores:
import requests, re, time
from bs4 import BeautifulSoup
url = f"https://fb.oddsportal.com/feed/livegames/live/1/0.dat?_{int(time.time() * 1000)}"
headers = {'User-Agent': 'curl/7.64.0','Referer': 'https://www.oddsportal.com/inplay-odds/live-now/soccer/'}
r = requests.get(url, headers=headers)
live_html = re.findall(r'<table class=.*table>', r.text)[0].replace("\\","")
soup = BeautifulSoup(live_html, 'html.parser')
I tried to search from from Developper Tools > Sources > Page, but can't find any source that provide live scores
The Live-odds or score on any website come through web sockets and thus cannot be scraped using any normal approach but there are some tricks to do that given that website does have a very good authentication protocol. you can refer to the link and may try it for your use case.
https://towardsdatascience.com/websocket-retrieve-live-data-f539b1d9db1e
Where there is a will there is a way.
Have you tried using Selenium? (I've heard it might be slower however)
If not try OCR you can decipher the text for every frame change.
Use python's websocket-client-package to retrieve the LIVE data.
First, you need to copy your web browser’s header to here and use json.dumps to convert it into the string format.
Additionally, you have to do a handshake i.e., sending messages to the website and receiving messages when connecting to the websocket.
After that, create the connection to the server by using create_connection. Then, perform the handshake by sending the message, and you will be able to see the data on your side.

Using python, Is it possible to directly send form data to a website server and receive response without using a browser?

I took a programming class in python so I know the basics of the language. A project I'm currently attempting involves submiting a form repeatedly untill the request is successful. In order to achieve a faster success using the program, I thought cutting the browser out of the program by directly sending and recieving data from the server would be faster. Also the website I'm creating the program for has a tendency to crash but I'm pretty sure i could still receive and send response to the server. Currently, im just researching different resources I could use to complete the task. I understand mechanize is easy to fill forms and submit them, but it requires a browser. So my question is what would be the best resource to use within python to communicate directly with the server without a browser.
I apologize if any of my knowledge is flawed. I did take the class but I'm still relatively new to the language.
Yes, there are plenty of ways to do this, but the easiest is the third party library called requests.
With that installed, you can do for example:
requests.post("https://mywebsite/path", {"key: "value"})
You can try this below.
from urllib.parse import urlencode
from urllib.request import Request, urlopen
url = 'https://httpbin.org/post' # Set destination URL here
post_fields = {'foo': 'bar'} # Set POST fields here
request = Request(url, urlencode(post_fields).encode())
json = urlopen(request).read().decode()
print(json)
I see from your tags that you've already decided to use requests.
Here's how to perform a basic POST request using requests:
Typically, you want to send some form-encoded data — much like an HTML
form. To do this, simply pass a dictionary to the data argument. Your
dictionary of data will automatically be form-encoded when the request
is made
import requests
payload = {'key1': 'value1', 'key2': 'value2'}
response = requests.post("http://httpbin.org/post", data=payload)
print(response.text)
I took this example from requests official documentation
I suggest you to read it and try also other examples available in order to become more confident and decide what approach suits your task best.

Python requests, change IP address

I am coding a web scraper for the website with the following Python code:
import requests
def scrape(url):
req = requests.get(url)
with open('out.html', 'w') as f:
f.write(req.text)
It works a few times but then an error HTML page is returned by the website (when I open my browser, I have a captcha to complete).
Is there a way to avoid this “ban” by for example changing the IP address?
As already mentioned in the comments and from yourself, changing the IP could help. To do this quite easily have a look at vpngate.py:
https://gist.github.com/Lazza/bbc15561b65c16db8ca8
An How to is provided at the link.
You can use a proxy with the requests library. You can find some free proxies at a couple different websites like https://www.sslproxies.org/ and http://free-proxy.cz/en/proxylist/country/US/https/uptime/level3 but not all of them work and they should not be trusted with sensitive information.
example:
proxy = {
"https": 'https://158.177.252.170:3128',
"http": 'https://158.177.252.170:3128'
}
response=requests.get('https://httpbin.org/ip', proxies=proxy)
I recently answered this on another question here, but using the requests-ip-rotator library to rotate IPs through API gateway is usually the most effective way.
It's free for the first million requests per region, and it means you won't have to give your data to unreliable proxy sites.
Late answer, I found this looking for IP-spoofing, but to the OP's question - as some comments point out, you may or may not actually be getting banned. Here's two things to consider:
A soft ban: they don't like bots. Simple solution that's worked for me in the past is to add headers, so they think you're a browser, e.g.,
req = requests.get(url, headers={'User-Agent': 'Mozilla/5.0'})
On-page active elements, scripts or popups that act as content gates, not a ban per se - e.g., country/language selector, cookie config, surveys, etc. requiring user input. Not-as-simple solution: use a webdriver like Selenium + chromedriver to render the page including JS and then add "user" clicks to deal with the problems.

Extract HTML-Content from URL of Site that probably uses Cookies via Python

I recently wanted to extract data from a website that seems to use cookies to grant me access. I do not know very much about those procedures but appearently this inteferes with my method of getting the html content of the website via Python and its requests module.
The code I am running to extract the information contains the following lines:
import responses
#...
response = requests.get(url, proxies=proxies)
content = requests.text
Where the website i am referring to is http://ieeexplore.ieee.org/xpls/icp.jsp?arnumber=6675630&tag=1 and proxies is a valid dict of my proxy servers (I tested those settings on websites that seemed to work fine). However, instead of the content of the article on this site I receive the html-content of the page that you get when you do not accept cookies in your browser.
As I am not really aware of what website is really doing and lack real Web-Developement experience I could not find a solution so far, even if a similar question might have been asked before. Is there any solution to access the content of this website via Python?
startr = requests.get('https://viennaairport.com/login/')
secondr = requests.post('http://xxx/', cookies=startr.cookies)

Extracting a table from a website

I've tried many times to retrieve the table at this website:
http://www.whoscored.com/Players/845/History/Tomas-Rosicky
(the one under "Historical Participations")
import urllib2
from bs4 import BeautifulSoup
soup = BeautifulSoup(urllib2.urlopen('http://www.whoscored.com/Players/845/').read())
This is the Python code I am using to retrieve the table html, but I am getting an empty string. Help me out!
The desired table is formed via an asynchronous API call to the http://www.whoscored.com/StatisticsFeed/1/GetPlayerStatistics endpoint request to which returns a JSON response. In other words, urllib2 would return you an initial HTML content of the page without the "dynamic" part. In other words, urllib2 is not a browser.
You can study the request using browser developer tools:
Now, you need to simulate this request in your code. requests package is something you should consider using.
Here is a similar question about whoscored.com I've answered before, there is a sample working code you can use as a starting point:
XHR request URL says does not exist when attempting to parse it's content

Categories

Resources