Pandas read_csv from URL and include request header - python

As of Pandas 0.19.2, the function read_csv() can be passed a URL. See, for example, from this answer:
import pandas as pd
url="https://raw.githubusercontent.com/cs109/2014_data/master/countries.csv"
c=pd.read_csv(url)
The URL I'd like to use is: https://moz.com/top500/domains/csv
With the above code, this URL returns an error:
urllib2.HTTPError: HTTP Error 403: Forbidden
based on this post, I can get a valid response by passing a request header:
import urllib2,cookielib
site= "https://moz.com/top500/domains/csv"
hdr = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
'Accept-Encoding': 'none',
'Accept-Language': 'en-US,en;q=0.8',
'Connection': 'keep-alive'}
req = urllib2.Request(site, headers=hdr)
try:
page = urllib2.urlopen(req)
except urllib2.HTTPError, e:
print (e.fp.read())
content = page.read()
print (content)
Is there any way to use the web URL functionality of Pandas read_csv(), but also pass a request header to make the request go through?

I would recommend you using the requests and the io library for your task. The following code should do the job:
import pandas as pd
import requests
from io import StringIO
url = "https://moz.com:443/top500/domains/csv"
headers = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:66.0) Gecko/20100101 Firefox/66.0"}
req = requests.get(url, headers=headers)
data = StringIO(req.text)
df = pd.read_csv(data)
print(df)
(If you want to add a custom header just modify the headers variable)
Hope this helps

As of pandas 1.3.0, you can now pass custom HTTP(s) headers using storage_options argument:
url = "https://moz.com:443/top500/domains/csv"
hdr = {
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
'Accept-Encoding': 'none',
'Accept-Language': 'en-US,en;q=0.8',
'Connection': 'keep-alive'
}
domains_df = pd.read_csv(url, storage_options=hdr)

Related

urllib.error.HTTPError: HTTP Error 403: Forbidden with headers

I have been doing scraping for a while, but never seen this problem without the fixed I have used working. I am trying to scrape "https://www.coolbet.com/s/sbgate/sports/recommendations/turnover?country=NO&isLive=false&language=en&layout=EUROPEAN" but havent been able to work around the "HTTP Error 403". Normally changing the headers does wonders, but not in this case.
I am starting to think it might be a cookie problem, but open for suggestion and fixes.
from urllib.request import urlopen, Request
headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
'Accept-Encoding': 'none',
'Accept-Language': 'en-US,en;q=0.8',
'Connection': 'keep-alive'}
reg_url = "https://www.coolbet.com/s/sbgate/sports/fo-category/?categoryId=62&country=NO&isMobile=0&language=en&layout=EUROPEAN&limit=15"
req = Request(url=reg_url, headers=headers)
html = urlopen(req).read()
print(html)
I have also tried using "import requests"

Data missing on Python request (AJAX request)

I am trying to scrape historical weather data from this website:
http://www.hko.gov.hk/cis/dailyExtract_uc.htm?y=2016&m=1
After some reading on the AJAX call, I found the proper way to request data is through the following code:
import requests
from bs4 import BeautifulSoup
import json
import pandas as pd
headers = {
'Accept': 'text/plain, */*; q=0.01',
'Accept-Encoding': 'gzip, deflate',
'Accept-Language': 'en-US,en;q=0.9',
'Connection': 'keep-alive',
'Host': 'www.hko.gov.hk',
'Referer': 'http://www.hko.gov.hk/cis/dailyExtract_e.htm?y=2016&m=3',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36',
'X-Requested-With': 'XMLHttpRequest'
}
with requests.Session() as s:
#request April 2015 weather data
r = s.get(r"http://www.hko.gov.hk/cis/dailyExtract/dailyExtract_201504.xml",verify = False, headers = headers)
soup = BeautifulSoup(r.content,'lxml')
data = json.loads(soup.get_text())['stn']['data'][0]['dayData'][:-2]
df = pd.DataFrame(data)
I noticed the data I retrieved does not contain the 3 columns on the right hand side, what did I miss in the get request?
Seems if you request entire year then extract month it is there
import requests
import json
with requests.Session() as s:
r = s.get(r"http://www.hko.gov.hk/cis/dailyExtract/dailyExtract_2015.xml",headers = {'User-Agent': 'Mozilla/5.0'}).json()
print(r['stn']['data'][3]['dayData'][0])
Sorry guys I have solved the issue and this is a stupid question....
Turns out the older data has a different source than the recent ones and I got confused on the format.
fix the request Url. Change:
http://www.hko.gov.hk/cis/dailyExtract/dailyExtract_201504.xml
to
http://www.hko.gov.hk/cis/dailyExtract/dailyExtract_2015.xml
then you can grab the 4th element (or some other specific month) in the list data['stn']['data']
import requests
from bs4 import BeautifulSoup
import json
import pandas as pd
headers = {
'Accept': 'text/plain, */*; q=0.01',
'Accept-Encoding': 'gzip, deflate',
'Accept-Language': 'en-US,en;q=0.9',
'Connection': 'keep-alive',
'Host': 'www.hko.gov.hk',
'Referer': 'http://www.hko.gov.hk/cis/dailyExtract_e.htm?y=2016&m=3',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36',
'X-Requested-With': 'XMLHttpRequest'
}
with requests.Session() as s:
#request April 2015 weather data
data = s.get(r"http://www.hko.gov.hk/cis/dailyExtract/dailyExtract_2015.xml",verify = False, headers = headers).json()
df = pd.DataFrame(data['stn']['data'][3]['dayData'])

Python Requests returning 503 response

I am using Requests to parse some data on a server. However, I keep getting a 503 response. The request headers have cookies in them, but my method does not seem to be handling them properly.
I am also a bit confused as to what I should be doing with cookies and when full stop. The website is http://epgservices.sky.com/nevermiss/and my code is below.
Headers and params look correct when viewed in Google Dev Tools, other than the cookies are missing when I use Requests. Any ideas?
import json
import requests
from urllib3.util import Retry
from requests.adapters import HTTPAdapter
from requests import Session, exceptions
import re
import traceback
from cookielib import LWPCookieJar
class sky_ondemand:
session = requests.Session()
jar = session.cookies
url = 'http://epgservices.sky.com'
movie_path = ''.join(movie_path)
headers = {
'Host': 'epgservices.sky.com',
'Connection': 'keep-alive',
'Accept': 'application/json, text/javascript, */*',
'X-Requested-With': 'XMLHttpRequest',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.119 Safari/537.36',
'Referer': 'http://epgservices.sky.com/never-miss/index.htm',
'Accept-Encoding': 'gzip, deflate',
'Accept-Language': 'en-GB,en-US;q=0.9,en;q=0.8'
}
params = {
'queryType': 'movieQuery',
'query': '',
'exactMatch': 'false',
'genre': '',
'startsWith': 'all',
'sortBy': 'requested',
'pageNum': '1',
'pageSize': '10',
'src': 'movieLetterButton'
}
r = session.get(url, params=params, headers=headers, cookies=jar)
data = r.content
print(data)
Sorted this if anyone is interested....was nothing to do with the cookies...the url should have been 'http://epgservices.sky.com/tvlistings-proxy/NeverMissProxy/neverMissMovieSearchRequest.json?'

How to pass arguments for get method with urllib?

The response web page is as below when to slect title and input wordpress.
Here is my python code to pass arguments for get method with python3.
import urllib.request
import urllib.parse
url = 'http://www.it-ebooks.info/'
values = {'q': 'wordpress','type': 'title'}
data = urllib.parse.urlencode(values).encode(encoding='utf-8',errors='ignore')
headers = { 'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; WOW64; rv:50.0) Gecko/20100101 Firefox/50.0' }
request = urllib.request.Request(url=url, data=data,headers=headers,method='GET')
response = urllib.request.urlopen(request)
buff = response.read()
html = buff.decode("utf8")
print(html)
I can't get the desired output web page.
How to pass arguments for get method with urllib in my example?
The data kwarg of urllib.request.Request is only used for POST requests as it modifies the request's body.
GET requests simply use URL parameters, so you should append these to the url:
params = '?q=wordpress&type=title'
url = 'http://www.it-ebooks.info/search/{}'.format(params)
You can of course take the time and generalize this into a generic function.
is better if you use the library called requests
import requests
headers = {
'DNT': '1',
'Accept-Encoding': 'gzip, deflate, sdch',
'Accept-Language': 'es-ES,es;q=0.8,en;q=0.6',
'Upgrade-Insecure-Requests': '1',
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'Referer': 'http://www.it-ebooks.info/',
'Connection': 'keep-alive',
}
r = requests.get('http://www.it-ebooks.info/search/?q=wordpress&type=title', headers=headers)
print r.content

Authentication Trouble with Python Requests [duplicate]

i want scrap the PINCODEs from "http://www.indiapost.gov.in/pin/", i am doing with following code written.
import urllib
import urllib2
headers = {
'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Origin': 'http://www.indiapost.gov.in',
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.57 Safari/537.17',
'Content-Type': 'application/x-www-form-urlencoded',
'Referer': 'http://www.indiapost.gov.in/pin/',
'Accept-Encoding': 'gzip,deflate,sdch',
'Accept-Language': 'en-US,en;q=0.8',
'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3'
}
viewstate = 'JulXDv576ZUXoVOwThQQj4bDuseXWDCZMP0tt+HYkdHOVPbx++G8yMISvTybsnQlNN76EX/...'
eventvalidation = '8xJw9GG8LMh6A/b6/jOWr970cQCHEj95/6ezvXAqkQ/C1At06MdFIy7+iyzh7813e1/3Elx...'
url = 'http://www.indiapost.gov.in/pin/'
formData = (
('__EVENTVALIDATION', eventvalidation),
('__EVENTTARGET',''),
('__EVENTARGUMENT',''),
('__VIEWSTATE', viewstate),
('__VIEWSTATEENCRYPTED',''),
('__EVENTVALIDATION', eventvalidation),
('txt_offname',''),
('ddl_dist','0'),
('txt_dist_on',''),
('ddl_state','2'),
('btn_state','Search'),
('txt_stateon',''),
('hdn_tabchoice','3')
)
from urllib import FancyURLopener
class MyOpener(FancyURLopener):
version = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.57 Safari/537.17'
myopener = MyOpener()
encodedFields = urllib.urlencode(formData)
f = myopener.open(url, encodedFields)
print f.info()
try:
fout = open('tmp.txt', 'w')
except:
print('Could not open output file\n')
fout.writelines(f.readlines())
fout.close()
i am getting response from server as "Sorry this site has encountered a serious problem, please try reloading the page or contact webmaster."
pl suggest where i am going wrong..
Where did you get the value viewstate and eventvalidation? On one hand, they shouldn't end with "...", you must have omitted something. On the other hand, they shouldn't be hard-coded.
One solution is like this:
Retrieve the page via URL "http://www.indiapost.gov.in/pin/" without any form data
Parse and retrieve the form values like __VIEWSTATE and __EVENTVALIDATION (you may take use of BeautifulSoup).
Get the search result(second HTTP request) by adding vital form-data from step 2.
UPDATE:
According to the above idea, I modify your code slightly to make it work:
import urllib
from bs4 import BeautifulSoup
headers = {
'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Origin': 'http://www.indiapost.gov.in',
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.57 Safari/537.17',
'Content-Type': 'application/x-www-form-urlencoded',
'Referer': 'http://www.indiapost.gov.in/pin/',
'Accept-Encoding': 'gzip,deflate,sdch',
'Accept-Language': 'en-US,en;q=0.8',
'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3'
}
class MyOpener(urllib.FancyURLopener):
version = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.57 Safari/537.17'
myopener = MyOpener()
url = 'http://www.indiapost.gov.in/pin/'
# first HTTP request without form data
f = myopener.open(url)
soup = BeautifulSoup(f)
# parse and retrieve two vital form values
viewstate = soup.select("#__VIEWSTATE")[0]['value']
eventvalidation = soup.select("#__EVENTVALIDATION")[0]['value']
formData = (
('__EVENTVALIDATION', eventvalidation),
('__VIEWSTATE', viewstate),
('__VIEWSTATEENCRYPTED',''),
('txt_offname', ''),
('ddl_dist', '0'),
('txt_dist_on', ''),
('ddl_state','1'),
('btn_state', 'Search'),
('txt_stateon', ''),
('hdn_tabchoice', '1'),
('search_on', 'Search'),
)
encodedFields = urllib.urlencode(formData)
# second HTTP request with form data
f = myopener.open(url, encodedFields)
try:
# actually we'd better use BeautifulSoup once again to
# retrieve results(instead of writing out the whole HTML file)
# Besides, since the result is split into multipages,
# we need send more HTTP requests
fout = open('tmp.html', 'w')
except:
print('Could not open output file\n')
fout.writelines(f.readlines())
fout.close()

Categories

Resources