Fetching Data Tables - python

import urllib.request
with urllib.request.urlopen('https://pakstockexchange.com/stock2/index_new.php?section=research&page=show_price_table_new&symbol=ABOT') as response:
html=respnse.read()
import pandas as pd
df=pd.read_html('https://pakstockexchange.com/stock2/index_new.php?section=research&page=show_price_table_new&symbol=ABOT')
print(df)
I've used two different codes to fetch data table from a website were data is available for free. But every time I run my program I get the following error 'urllib.error.HTTPError: HTTP Error 403: Forbidden'. Moreover the links seems to be working fine from a browser. Any idea how to solve this issue?
PS: Data can be seen without authentication.

I'm not sure why the server is raising a 301, exactly, but in general using urllib directly for a high level request like this is discouraged. You should use the requests package instead.
The equivalent requests fetch:
r = requests.get("https://pakstockexchange.com/stock2/index_new.php?section=research&page=show_price_table_new&symbol=ABOT")
Works fine.
r.status_code == 200
True

Related

Freedom of information act API. API key error

I am having some trouble running the Freedom of information act API in python. I am sure it is related to how I am implementing my API key but I am uncertain as to where I am dropping the ball. Any help is greatly appreciated.
import requests
apikey= ''
api_base_url = f"https://api.foia.gov/api/webform/submit"
endpoint = f"{api_base_url}{apikey}"
r = requests.get(endpoint)
print(r.status_code)
print(r.text)
there error I receive is requests.exceptions.InvalidSchema: No connection adapters were found for this website. Thanks again
According to the documentation, the API requires the API key to be passed as a request header parameter ("X-API-Key"). Your python code appears to be simply concatenating the API key and the URL.
The following Q&A explains how to set a request header using requests.
Using headers with the Python requests library's get method
It would be something like this:
import requests
apikey= ...
api_base_url = ...
r = requests.get(api_base_url,
headers={"X-API-Key": apikey})
print(r.status_code)
print(r.text)
Note that the documentation for the FOIA site explains what you need to do to submit a FIOA request form. It is significantly different to what your Python code is apparently trying to do. I would advise you to read the documentation. Also read the manual entry for the "curl" command so that you understand the requests that the examples show.

Python 404'ing on urllib.request

The basics of the code are below. I know for a fact how I'm retrieving these pages works for other URLs, as I just wrote a script scraping a different page in the same way. However with this specific URL it keeps throwing "urllib.error.HTTPError: HTTP Error 404: Not Found" in my face. I replaced the URL with a different one (https://www.premierleague.com/clubs), and it works completely fine. I'm very new to python so perhaps there's a really basic step or piece of knowledge I haven't found, but resources I've found on line relating to this didn't seem relevant. Any advice would be great, thanks.
Below is the barebones of the script:
import bs4
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
import csv
myurl = "https://www.transfermarkt.co.uk/premier-league/startseite/wettbewerb/GB1"
uClient = uReq(myurl)
The problem is most likely that the site you are trying to access is actively blocking spiders crawling; you can change the user agent to circumvent it. See this question for more information (the solution prescribed in that post seems to work for your url too).
If you want to use urllib this post tells you how to alter the user agent.
It is showing a 404 because it thinks the website doesn't exist.
You can try with a different module like requests.
This is the code for requests
import requests
resp = requests.get("https://www.transfermarkt.co.uk/premier-league/startseite/wettbewerb/GB1")
print(resp.text) # gets source code
I hope it works!

Python "requests" module truncating responses

When I use the python requests module, calling requests.get(url), I have found that the response from the url is being truncated.
import requests
url = 'https://gtfsrt.api.translink.com.au/Feed/SEQ'
response = requests.get(url)
print response.text
The response I get from the URL is being truncated. Is there a way to get requests to retrieve the full set of data and not truncate it?
Note: The given URL is a public transport feed which puts out a huge quantity data during the peak of day.
I ran into the same issue. The problem is not your Python code. It might be PyCharm or Utility you are using - The console has a buffer limit. You may have to increase that to see your full response.
Refer to this article for more help:
Increase output buffer when running or debugging in PyCharm
Add "gzip":true to your request options.
That fixed the issue for me.

urllib2 timeout

i'm using urllib2 library for my code, i'm using a lot of (urlopen) EDIT: loadurl
i have a problem on my network, when i'm browsing sites, sometimes my browser gets stuck on "Connecting" to a certain website and sometimes my browser returns a timeout
My question is if i use urllib2 on my code it can timeout when trying to connect for too long to a certain website or the code will get stuck on that line.
i know that urllib2 can handle timeouts without specifying it on code but it can apply for this kind of situation ?
Thanks for your time
EDIT :
def checker(self)
try:
html = self.loadurl("MY URL HERE")
if self.ip_ != html:
(...)
except Exeption, error:
html = "bad"
from my small research, the urllib2.urlopen() function is added in python 2.6
so, the timeout problem should be resolved by sending custom timeout to the urllib2.urlopen function. the code should look like this ;
response = urllib2.urlopen( "---insert url here---", None, your-timeout-value)
the your-timeout-value parameter is an optional parameter which defines the timeout in seconds.
EDIT : According to your comment, I got that you don't need the code for waiting too long then you should have the following code to not get stuck;
import socket
import urllib2
socket.setdefaulttimeout(10)
10 can be changed according to a math formula related to the connection speed & website loading time.

Unexpected behaviour with Python urllib

I am tryint to consume a JSON response but I have one very weird behaviour. The end point is a Java app running on Tomcat.
I want to load the following url
http://opendata.diavgeia.gov.gr/api/decisions?count=50&output=json_full&from=1
Using Ruby open-uri I load the json. If I hit in in the browser I still get the response. Once I try to use Python 's urllib or urllib2 I get an error
javax.servlet.ServletException: Could not resolve view with name 'jsonView' in servlet with name 'diavgeia-api'
It s quite a strange and I guess the error lies in the API server. Any hints ?
The server appears to need an 'Accept' header:
>>> print urllib2.urlopen(
... urllib2.Request(
... "http://opendata.diavgeia.gov.gr/api/decisions?count=50&output=json_full&from=1",
... headers={"accept": "*/*"})).read()[:200]
{"model":{"queryInfo":{"total":117458,"count":50,"order":"desc","from":1},"expandedDecisions":[{"metadata":{"date":1291932000000,"tags":{"tag":[]},"decisionType":{"uid":27,"label":"ΔΑΠΑΝΗ","extr
Two possibilities, neither of which hold water:
The server is only prepared to use HTTP 1.1 (which urllib apparently doesn't support, but urllib2 does)
It's doing user agent sniffing, and rejecting Python (I tried using Firefox's UA string instead, but it still gave me an error)

Categories

Resources