Python 404'ing on urllib.request - python

The basics of the code are below. I know for a fact how I'm retrieving these pages works for other URLs, as I just wrote a script scraping a different page in the same way. However with this specific URL it keeps throwing "urllib.error.HTTPError: HTTP Error 404: Not Found" in my face. I replaced the URL with a different one (https://www.premierleague.com/clubs), and it works completely fine. I'm very new to python so perhaps there's a really basic step or piece of knowledge I haven't found, but resources I've found on line relating to this didn't seem relevant. Any advice would be great, thanks.
Below is the barebones of the script:
import bs4
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
import csv
myurl = "https://www.transfermarkt.co.uk/premier-league/startseite/wettbewerb/GB1"
uClient = uReq(myurl)

The problem is most likely that the site you are trying to access is actively blocking spiders crawling; you can change the user agent to circumvent it. See this question for more information (the solution prescribed in that post seems to work for your url too).
If you want to use urllib this post tells you how to alter the user agent.

It is showing a 404 because it thinks the website doesn't exist.
You can try with a different module like requests.
This is the code for requests
import requests
resp = requests.get("https://www.transfermarkt.co.uk/premier-league/startseite/wettbewerb/GB1")
print(resp.text) # gets source code
I hope it works!

Related

Cannot scrape using python.requests() but work when loading on browser

I want to scrape data from this page: https://raritysniffer.com/viewcollection/primeapeplanet
The API request works on the browser but returns 403 ERROR when I use python.requests.
requests.get("https://raritysniffer.com/api/index.php?query=fetch&collection=0x6632a9d63e142f17a668064d41a21193b49b41a0&taskId=any&norm=true&partial=true&traitCount=true")
I understand it is possible that I have to pass on specific headers to make it work, but as a python novice, I have no idea how to make it work. Please advise. Thanks!
If you check the response, you can see that the website uses Cloudfare and which indeed returns the 403. To bypass this, try cloudscraper. (be mindful)
import cloudscraper
url = 'https://raritysniffer.com/api/index.php?query=fetch&collection=0x6632a9d63e142f17a668064d41a21193b49b41a0&taskId=any&norm=true&partial=true&traitCount=true'
scraper = cloudscraper.create_scraper(browser = 'firefox')
print(scraper.get(url).text)

Fetching Data Tables

import urllib.request
with urllib.request.urlopen('https://pakstockexchange.com/stock2/index_new.php?section=research&page=show_price_table_new&symbol=ABOT') as response:
html=respnse.read()
import pandas as pd
df=pd.read_html('https://pakstockexchange.com/stock2/index_new.php?section=research&page=show_price_table_new&symbol=ABOT')
print(df)
I've used two different codes to fetch data table from a website were data is available for free. But every time I run my program I get the following error 'urllib.error.HTTPError: HTTP Error 403: Forbidden'. Moreover the links seems to be working fine from a browser. Any idea how to solve this issue?
PS: Data can be seen without authentication.
I'm not sure why the server is raising a 301, exactly, but in general using urllib directly for a high level request like this is discouraged. You should use the requests package instead.
The equivalent requests fetch:
r = requests.get("https://pakstockexchange.com/stock2/index_new.php?section=research&page=show_price_table_new&symbol=ABOT")
Works fine.
r.status_code == 200
True

Extracting a table from a website

I've tried many times to retrieve the table at this website:
http://www.whoscored.com/Players/845/History/Tomas-Rosicky
(the one under "Historical Participations")
import urllib2
from bs4 import BeautifulSoup
soup = BeautifulSoup(urllib2.urlopen('http://www.whoscored.com/Players/845/').read())
This is the Python code I am using to retrieve the table html, but I am getting an empty string. Help me out!
The desired table is formed via an asynchronous API call to the http://www.whoscored.com/StatisticsFeed/1/GetPlayerStatistics endpoint request to which returns a JSON response. In other words, urllib2 would return you an initial HTML content of the page without the "dynamic" part. In other words, urllib2 is not a browser.
You can study the request using browser developer tools:
Now, you need to simulate this request in your code. requests package is something you should consider using.
Here is a similar question about whoscored.com I've answered before, there is a sample working code you can use as a starting point:
XHR request URL says does not exist when attempting to parse it's content

Unable to get page source code in python

I'm trying to get the source code of a page by using:
import urllib2
url="http://france.meteofrance.com/france/meteo?PREVISIONS_PORTLET.path=previsionsville/750560"
page =urllib2.urlopen(url)
data=page.read()
print data
and also by using a user_agent(headers)
I did not succeed to get the source code of the page!
Have you guys any ideas what can be done?
Thanks in Advance
I tried it and the requests works, but the content that you receive says that your browser must accept cookies (in french). You could probably get around that with urllib2, but I think the easiest way would be to use the requests lib (if you don't mind having an additional dependency).
To install requests:
pip install requests
And then in your script:
import requests
url = 'http://france.meteofrance.com/france/meteo?PREVISIONS_PORTLET.path=previsionsville/750560'
response = requests.get(url)
print(response.content)
I'm pretty sure the source code of the page will be what you expect then.
requests library worked for me as Martin Maillard showed.
Also in another thread I have noticed this note by leoluk here:
Edit: It's 2014 now, and most of the important libraries have been
ported and you should definitely use Python 3 if you can.
python-requests is a very nice high-level library which is easier to
use than urllib2.
So I wrote this get_page procedure:
import requests
def get_page (website_url):
response = requests.get(website_url)
return response.content
print get_page('http://example.com')
Cheers!
I tried a lot of things, "urllib" "urllib2" and many other things, but one thing worked for me for everything I needed and solved any problem I faced. It was Mechanize .This library simulates using a real browser, so it handles a lot of issues in that area.

Loading cookies in python

I am a novice programmer attempting to access google insights using python. I can access sites which dont require cookies fine, but i cant seem to properly pass the cookies along. The cookines file was exported from mozilla firefox, is in the Z: drive which is also where im running python from.
Im also pretty sure my code for saving the file could be better done than reading and writing but I dont know how to do that either. Any helpo would be appreciated.
import urllib2
import cookielib
import os
url = "http://www.google.com/insights/search/overviewReport?q=eagles%2Ccsco&geo=US&cmpt=q&content=1&export=2"
cj = cookielib.MozillaCookieJar()
cj.load('cookies6.txt')
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
file = opener.open(url)
output = open('test2.csv','wb')
output.write(file.read())
output.close()
I haven't tested your code however:
As far as I can tell there seems to be nothing wrong with your code
I've tried the url you're searching and had no problems downloading the csv without any cookies
In my previous experience with google, you might be looking at the problem the wrong way, it is not that you don't have the right cookies but that google automatically blocks requests from bots. If this is the case you must replace the user agent http header to mimic an actual browser. Beware however that this is against googles terms of service and if you make too many requests per minute google will block all requests from your ip for about 8h.

Categories

Resources