Checked the other answers for similar problems, but couldn't find anything that solved this particular problem. I can't figure out why I'm getting error, because I don't believe I'm missing any values. Also, I think it's odd that it says line 1 column 1 (char 0) - any of you wonderful people have any ideas?
import json
import urllib.request
user_agent = "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.7) Gecko/2009021910 Firefox/3.0.7"
url = "http://api.usatoday.com/open/articles/topnews?encoding=json&api_key=98jv5a93qs"
headers={"User-Agent":user_agent,}
request = urllib.request.Request(url, None, headers)
parsed_json = json.loads(str(request))
for i in range(6):
title = parsed_json['stories'][i]['title']
link = parsed_json['stories'][i]['link']
print(title)
print(link)
print("-----------------------------------")
you are trying to parse the response JSON. but you didn't event sent the request.
you should send your Request and then parse the response JSON:
import json
import urllib.request
user_agent = "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.7) Gecko/2009021910 Firefox/3.0.7"
url = "http://api.usatoday.com/open/articles/topnews?encoding=json&api_key=98jv5a93qs"
headers={"User-Agent":user_agent,}
request = urllib.request.Request(url, None, headers)
res = urllib.request.urlopen(request)
parsed_json = json.loads(res.readall())
for i in range(6):
title = parsed_json['stories'][i]['title']
link = parsed_json['stories'][i]['link']
print(title)
print(link)
print("-----------------------------------")
From what I've seen in both the docs (or v. 2) and at the URL above, the issue is that you are trying to parse JSON which is not JSON. I suggest wrapping your call to json.loads in a try... except block and handle bad JSON. This is generally good practice anyway.
For good measure I looked up the source code for the json module. It looks like all errors from Py2k point to value errors, thought I could not find the specific error you mention.
Based on my read of the JSON module, you'll also be able to get more information if you use try...except and print the properties of the error module as well.
Related
Hello I am trying to scrape this url : https://www.instagram.com/cristiano/?__a=1 but I get a Value Error
url_user = "https://www.instagram.com/cristiano/?__a=1"
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36'}
response = get(url_user,headers=headers)
print(response) # 200
html_soup = BeautifulSoup(response.content, 'html.parser')
# print(html_soup)
jsondata=json.loads(str(html_soup))
ValueError: No JSON object could be decoded
Any idea why I get this error?
The reason you're getting the error is because you're trying to parse a JSON response as if it was HTML. You don't need BeautifulSoup for that.
Try this:
import json
import requests
url_user = "https://www.instagram.com/cristiano/?__a=1"
d = json.loads(requests.get(url_user).text)
print(d)
However, best practice suggests to use .json() from requests, as it'll do a better job of figuring out the encoding used.
import requests
url_user = "https://www.instagram.com/cristiano/?__a=1"
d = requests.get(url_user).json()
print(d)
You might be getting non-200 HTTP Status Code, which means that server responded with error, e.g. server might have banned your IP for frequent requests. requests library doesn't throw any errors for that. To control erroneous status codes insert after get(...) line this code:
response.raise_for_status()
Also it is enough just to do jsondata = response.json(). requests library can parse json this way without need for beautiful soup. Easy to read tutorial about main requests library features is located here.
Also if there is some parsing problem save binary content of response to file to attach it to question like this:
with open('response.dat', 'wb') as f:
f.write(response.content)
I used urllib.request.Request for the url of a memidex.com page, but the urllib.request.urlopen(url) line goes on to fail to open the url.
url = urllib.request.Request("http://www.memidex.com/" + term)
my_request = urllib.request.urlopen(url)
info = BeautifulSoup(my_request, "html.parser")
I've tried using the same code for a different website and it worked for that one so I have no idea why it's not working for memidex.com.
You need to add headers to your url request in order to overcome the error. BTW 'HTTP Error 403: Forbidden' was your error right?
Hope the below code helps you.
import urllib.request
user_agent = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.7) Gecko/2009021910 Firefox/3.0.7'
url = "http://www.memidex.com/"
headers={'User-Agent':user_agent,}
request=urllib.request.Request(url,None,headers)
response = urllib.request.urlopen(request)
data = response.read()
print(data)
My question is about the urllib module in python 3. The following piece of code
import urllib.request
import urllib.parse
url = "https://google.com/search?q=stackoverflow"
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64)'}
try:
req = urllib.request.Request(url, headers=headers)
resp = urllib.request.urlopen(req)
file = open('googlesearch.txt.', 'w')
file.write(str(resp.read()))
file.close()
except Exception as e:
print(str(e))
works as I expect and writes the content of the google search 'stackoverflow' in a file. We need to set a valid User-Agent, otherwise google does not allow the request and returns a 405 Invalid Method error.
I think the following piece of code
import urllib.request
import urllib.parse
url = "https://google.com/search"
values = {'q': 'stackoverflow'}
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64)'}
data = urllib.parse.urlencode(values)
data = data.encode('utf-8')
try:
req = urllib.request.Request(url, data=data, headers=headers)
resp = urllib.request.urlopen(req)
file = open('googlesearch.txt.', 'w')
file.write(str(resp.read()))
file.close()
except Exception as e:
print(str(e))
should produce the same output as the first one, as it is the same google search with the same User-Agent. However, this piece of code throws an exception with message: 'HTTP Error 405: Method Not Allowed'.
My question is: what is wrong with the second piece of code? Why does it not produce the same output as the first one?
You get the 405 response because you are sending a POST request instead of a GET request. Method not allowed should not have anything to do with your user-agent header. It's about sending a http request with a incorrect method (get, post, put, head, options, patch, delete).
Urllib sends a POST because you include the data argument in the Request constructor as is documented here:
https://docs.python.org/3/library/urllib.request.html#urllib.request.Request
method should be a string that indicates the HTTP request method that will be used (e.g. 'HEAD'). If provided, its value is stored in the method attribute and is used by get_method(). The default is 'GET' if data is None or 'POST' otherwise.
It's highly recommended to use the requests library instead of urllib, because it has a much more sensible api.
import requests
response = requests.get('https://google.com/search', {'q': 'stackoverflow'})
response.raise_for_status() # raise exception if status code is 4xx or 5xx
with open('googlesearch.txt', 'w') as fp:
fp.write(response.text)
https://github.com/requests/requests
https://docs.python.org/3.4/howto/urllib2.html#data
If you do not pass the data argument, urllib uses a GET request. One
way in which GET and POST requests differ is that POST requests often
have “side-effects”: they change the state of the system in some way
(for example by placing an order with the website for a hundredweight
of tinned spam to be delivered to your door).
I tried every 'User-Agent' in here, still I get urllib.error.HTTPError: HTTP Error 400: Bad Request. I also tried this, but I get urllib.error.URLError: File Not Found. I have no idea what to do, my current codes are;
from bs4 import BeautifulSoup
import urllib.request,json,ast
with open ("urller.json") as f:
cc = json.load(f) #the file I get links, you can try this link instead of this
#cc = ../games/index.php?g_id=23521&game=0RBITALIS
for x in ast.literal_eval(cc): #cc is a str(list) so I have to convert
if x.startswith("../"):
r = urllib.request.Request("http://www.game-debate.com{}".format(x[2::]),headers={'User-Agent': 'Mozilla/5.0 (Windows; U; Windows NT 5.1; it; rv:1.8.1.11) Gecko/20071127 Firefox/2.0.0.11'})
#x[2::] because I removed '../' parts from urlls
rr = urllib.request.urlopen(r).read()
soup = BeautifulSoup(rr)
for y in soup.find_all("ul",attrs={'class':['devDefSysReqList']}):
print (y.text)
Edit: If you try only 1 link probably it won't show any error, since I get the error every time at 6th link.
A quick fix is to replace the space with +:
url = "http://www.game-debate.com"
r = urllib.request.Request(url + x[2:] ,headers={'User-Agent': 'Mozilla/5.0 (Windows; U; Windows NT 5.1; it; rv:1.8.1.11) Gecko/20071127 Firefox/2.0.0.11'})
A better option may be to let urllib quote the params:
from bs4 import BeautifulSoup
import urllib.request,json,ast
from urllib.parse import quote, urljoin
with open ("urller.json") as f:
cc = json.load(f) #the file I get links, you can try this link instead of this
url = "http://www.game-debate.com"
for x in ast.literal_eval(cc): # cc is a str(list) so I have to convert
if x.startswith("../"):
r = urllib.request.Request(urljoin(url, quote(x.lstrip("."))), headers={
'User-Agent': 'Mozilla/5.0 (Windows; U; Windows NT 5.1; it; rv:1.8.1.11) Gecko/20071127 Firefox/2.0.0.11'})
rr = urllib.request.urlopen(r).read()
soup = BeautifulSoup(rr)
print(rr.decode("utf-8"))
for y in soup.find_all("ul", attrs={'class':['devDefSysReqList']}):
print (y.text)
Spaces in a url are not valid and need to be percent encoded as %20 or replaced with +.
For some reasons this part where I fetch JSON data from following url will only works sometimes. And sometimes it will return 404 error, and complain about missing header attribute. It will work 100% of the time if I paste it onto a web browser. So I'm sure the link is not broken or something.
I get the following error in Python:
AttributeError: 'HTTPError' object has no attribute 'header'
What's the reason for this and can it be fixed?
Btw I removed API key since it is private.
try:
url = "http://api.themoviedb.org/3/search/person?api_key=API-KEY&query=natalie+portman"
header = { 'User-Agent': 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.2.16) Gecko/20110319 Firefox/3.6.16' }
req = urllib2.Request(url, None, header)
f = urllib2.urlopen(req)
except urllib2.HTTPError, e:
print e.code
print e.msg
print e.header
print e.fp.read()
As is documented here, you need to explicitly accept JSON. Just add the second line after the first.
req = urllib2.Request(url, None, header)
req.add_header('Accept', 'application/json')