Extracting a table from a website - python

I've tried many times to retrieve the table at this website:
http://www.whoscored.com/Players/845/History/Tomas-Rosicky
(the one under "Historical Participations")
import urllib2
from bs4 import BeautifulSoup
soup = BeautifulSoup(urllib2.urlopen('http://www.whoscored.com/Players/845/').read())
This is the Python code I am using to retrieve the table html, but I am getting an empty string. Help me out!

The desired table is formed via an asynchronous API call to the http://www.whoscored.com/StatisticsFeed/1/GetPlayerStatistics endpoint request to which returns a JSON response. In other words, urllib2 would return you an initial HTML content of the page without the "dynamic" part. In other words, urllib2 is not a browser.
You can study the request using browser developer tools:
Now, you need to simulate this request in your code. requests package is something you should consider using.
Here is a similar question about whoscored.com I've answered before, there is a sample working code you can use as a starting point:
XHR request URL says does not exist when attempting to parse it's content

Related

Webscraping with python - with interactive website

Can anybody recommend a python package to extract data from the Dutch met office website:
https://www.knmi.nl/nederland-nu/weer/waarschuwingen-en-verwachtingen/weer-en-klimaatpluim
The site shows graphs with forecasts of temperature, rainfall, etc. You can click on the graph and select that the underlying data is shown in a table.
Which python package can I use to go to the site, extract the table data for different forecasts in a dataframe.
thanks
Simply you can use BeautifulSoup
Do not listen to the others, the data on this particular page is loaded dynamically with JavaScript therefore BeautifulSoup will not be able to scrape.
Tip
Only scrape with BeautifulSoup if you have to, your first port of call should be to expose the api endpoint.
You can send a request to the api endpoint and return json containing the page data.
If you are using a chromium browser press CTRL + SHIFT + I & select the network tab. Click Clear then click Record.
When you refresh the page you will notice the table below will fill up with requests.
Search the Name column for the JSON requests and use the request url and the code below to return JSON
import requests
import json
target_url = "https://cdn.knmi.nl/knmi/json/page/weer/waarschuwingen_verwachtingen/ensemble/iPluim/260_99999.json"
r = requests.get(target_url)
weather_data = json.loads(r.text)
print(weather_data['series'])
If you can work out the the api parameters you could construct your own requests
Selenium can be useful for dynamic sites https://www.selenium.dev/documentation/

Requests.get() returns correct data on the first run and then returns html code

So I am using the code below to get json data about my Instagram account using Python and requests library.
import requests
a = requests.get('https://instagram.com/asjadmurtaza/?__a=1')
print(a.text)
When I run it for the first time, it returns the correct output(same output as returned by chrome when we put the URL in search bar), but when I run it again, it starts returning the HTML code for Instagram login page instead of json data about my account. Any idea what am I doing wrong ?

Python 404'ing on urllib.request

The basics of the code are below. I know for a fact how I'm retrieving these pages works for other URLs, as I just wrote a script scraping a different page in the same way. However with this specific URL it keeps throwing "urllib.error.HTTPError: HTTP Error 404: Not Found" in my face. I replaced the URL with a different one (https://www.premierleague.com/clubs), and it works completely fine. I'm very new to python so perhaps there's a really basic step or piece of knowledge I haven't found, but resources I've found on line relating to this didn't seem relevant. Any advice would be great, thanks.
Below is the barebones of the script:
import bs4
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
import csv
myurl = "https://www.transfermarkt.co.uk/premier-league/startseite/wettbewerb/GB1"
uClient = uReq(myurl)
The problem is most likely that the site you are trying to access is actively blocking spiders crawling; you can change the user agent to circumvent it. See this question for more information (the solution prescribed in that post seems to work for your url too).
If you want to use urllib this post tells you how to alter the user agent.
It is showing a 404 because it thinks the website doesn't exist.
You can try with a different module like requests.
This is the code for requests
import requests
resp = requests.get("https://www.transfermarkt.co.uk/premier-league/startseite/wettbewerb/GB1")
print(resp.text) # gets source code
I hope it works!

Web parsing without Selenium

I am trying to parse the following website in order to get all addresses of stores (sorry for my Russian):
http://magnit-info.ru/buyers/adds/1258/14/243795
Here are addresses just for one city at the end of the page.
The addresses are placed in the block .b-shops-list. This block is populated dynamically by POST request. When I tried to use requests module and get addresses, it does not work since the block is empty at the beginning (page source).
I am using Selenium right now, but it is really slow. To parse all cities and regions it takes about 2 hours (even with multiprocessing). I also have to use expected_conditions and wait about 4-5 seconds to be sure that POST requests are completed.
Are there any options to accelerate this process? Can I send POST requests somehow by using requests? If yes, how I figure out what kind of POST requests I should sent? This question is also related to websites which use Google maps.
Thank you!
I had a look at the AJAX request that this pages does to load the addresses and came up with this small code snippet:
import requests
data = {
'op': 'get_shops',
'SECTION_ID': 1258,
'RID': 14,
'CID': 243795,
}
res = requests.post('http://magnit-info.ru/functions/bmap/func.php', data=data)
addresses = res.json()
If you check the data dictionary you can clearly see that you could easily generate it from the URL you linked.

Extract HTML-Content from URL of Site that probably uses Cookies via Python

I recently wanted to extract data from a website that seems to use cookies to grant me access. I do not know very much about those procedures but appearently this inteferes with my method of getting the html content of the website via Python and its requests module.
The code I am running to extract the information contains the following lines:
import responses
#...
response = requests.get(url, proxies=proxies)
content = requests.text
Where the website i am referring to is http://ieeexplore.ieee.org/xpls/icp.jsp?arnumber=6675630&tag=1 and proxies is a valid dict of my proxy servers (I tested those settings on websites that seemed to work fine). However, instead of the content of the article on this site I receive the html-content of the page that you get when you do not accept cookies in your browser.
As I am not really aware of what website is really doing and lack real Web-Developement experience I could not find a solution so far, even if a similar question might have been asked before. Is there any solution to access the content of this website via Python?
startr = requests.get('https://viennaairport.com/login/')
secondr = requests.post('http://xxx/', cookies=startr.cookies)

Categories

Resources