I tried every 'User-Agent' in here, still I get urllib.error.HTTPError: HTTP Error 400: Bad Request. I also tried this, but I get urllib.error.URLError: File Not Found. I have no idea what to do, my current codes are;
from bs4 import BeautifulSoup
import urllib.request,json,ast
with open ("urller.json") as f:
cc = json.load(f) #the file I get links, you can try this link instead of this
#cc = ../games/index.php?g_id=23521&game=0RBITALIS
for x in ast.literal_eval(cc): #cc is a str(list) so I have to convert
if x.startswith("../"):
r = urllib.request.Request("http://www.game-debate.com{}".format(x[2::]),headers={'User-Agent': 'Mozilla/5.0 (Windows; U; Windows NT 5.1; it; rv:1.8.1.11) Gecko/20071127 Firefox/2.0.0.11'})
#x[2::] because I removed '../' parts from urlls
rr = urllib.request.urlopen(r).read()
soup = BeautifulSoup(rr)
for y in soup.find_all("ul",attrs={'class':['devDefSysReqList']}):
print (y.text)
Edit: If you try only 1 link probably it won't show any error, since I get the error every time at 6th link.
A quick fix is to replace the space with +:
url = "http://www.game-debate.com"
r = urllib.request.Request(url + x[2:] ,headers={'User-Agent': 'Mozilla/5.0 (Windows; U; Windows NT 5.1; it; rv:1.8.1.11) Gecko/20071127 Firefox/2.0.0.11'})
A better option may be to let urllib quote the params:
from bs4 import BeautifulSoup
import urllib.request,json,ast
from urllib.parse import quote, urljoin
with open ("urller.json") as f:
cc = json.load(f) #the file I get links, you can try this link instead of this
url = "http://www.game-debate.com"
for x in ast.literal_eval(cc): # cc is a str(list) so I have to convert
if x.startswith("../"):
r = urllib.request.Request(urljoin(url, quote(x.lstrip("."))), headers={
'User-Agent': 'Mozilla/5.0 (Windows; U; Windows NT 5.1; it; rv:1.8.1.11) Gecko/20071127 Firefox/2.0.0.11'})
rr = urllib.request.urlopen(r).read()
soup = BeautifulSoup(rr)
print(rr.decode("utf-8"))
for y in soup.find_all("ul", attrs={'class':['devDefSysReqList']}):
print (y.text)
Spaces in a url are not valid and need to be percent encoded as %20 or replaced with +.
Related
I used urllib.request.Request for the url of a memidex.com page, but the urllib.request.urlopen(url) line goes on to fail to open the url.
url = urllib.request.Request("http://www.memidex.com/" + term)
my_request = urllib.request.urlopen(url)
info = BeautifulSoup(my_request, "html.parser")
I've tried using the same code for a different website and it worked for that one so I have no idea why it's not working for memidex.com.
You need to add headers to your url request in order to overcome the error. BTW 'HTTP Error 403: Forbidden' was your error right?
Hope the below code helps you.
import urllib.request
user_agent = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.7) Gecko/2009021910 Firefox/3.0.7'
url = "http://www.memidex.com/"
headers={'User-Agent':user_agent,}
request=urllib.request.Request(url,None,headers)
response = urllib.request.urlopen(request)
data = response.read()
print(data)
How can i choose the position in which the downloaded file is stored? My code:
import urllib.request
user_agent = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.7) Gecko/2009021910 Firefox/3.0.7'
headers = {'User-Agent':user_agent,}
request = urllib.request.Request("http://download.thinkbroadband.com/5MB.zip",None,headers)
response = urllib.request.urlopen(request)
data = response.read()
You're almost there. So you've got data:
ofile = open(where_you_want_to_store_the_data,"wb")
ofile.write(data)
ofile.close()
For a cleaner way you can use urlretrieve:
urlretrieve(url, "/path/to/something.txt")
Checked the other answers for similar problems, but couldn't find anything that solved this particular problem. I can't figure out why I'm getting error, because I don't believe I'm missing any values. Also, I think it's odd that it says line 1 column 1 (char 0) - any of you wonderful people have any ideas?
import json
import urllib.request
user_agent = "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.7) Gecko/2009021910 Firefox/3.0.7"
url = "http://api.usatoday.com/open/articles/topnews?encoding=json&api_key=98jv5a93qs"
headers={"User-Agent":user_agent,}
request = urllib.request.Request(url, None, headers)
parsed_json = json.loads(str(request))
for i in range(6):
title = parsed_json['stories'][i]['title']
link = parsed_json['stories'][i]['link']
print(title)
print(link)
print("-----------------------------------")
you are trying to parse the response JSON. but you didn't event sent the request.
you should send your Request and then parse the response JSON:
import json
import urllib.request
user_agent = "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.7) Gecko/2009021910 Firefox/3.0.7"
url = "http://api.usatoday.com/open/articles/topnews?encoding=json&api_key=98jv5a93qs"
headers={"User-Agent":user_agent,}
request = urllib.request.Request(url, None, headers)
res = urllib.request.urlopen(request)
parsed_json = json.loads(res.readall())
for i in range(6):
title = parsed_json['stories'][i]['title']
link = parsed_json['stories'][i]['link']
print(title)
print(link)
print("-----------------------------------")
From what I've seen in both the docs (or v. 2) and at the URL above, the issue is that you are trying to parse JSON which is not JSON. I suggest wrapping your call to json.loads in a try... except block and handle bad JSON. This is generally good practice anyway.
For good measure I looked up the source code for the json module. It looks like all errors from Py2k point to value errors, thought I could not find the specific error you mention.
Based on my read of the JSON module, you'll also be able to get more information if you use try...except and print the properties of the error module as well.
I'm trying to scrape google headlines for a given keyword (eg. Blackrock) for a given period (eg. 7-jan-2012 to 14-jan-2012).
I'm trying to do this by constructing the url and then using urllib2 as shown in the code below. if I put the constructed url in a browser, it gives me the correct result. however, if I use it through python, I get news results for the right keyword but for the current period.
here'e the code. Can someone tell me what I'm doing wrong and how I can correct it?
import urllib
import urllib2
import json
from bs4 import BeautifulSoup
import requests
url = 'https://www.google.com/search?q=Blackrock&hl=en&gl=uk&authuser=0&source=lnt&tbs=cdr%3A1%2Ccd_min%3A7%2F1%2F2012%2Ccd_max%3A14%2F1%2F2012&tbm=nws'
req = urllib2.Request(url)
req.add_header('User-Agent', 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.9.0.3 Gecko/2008092417 Firefox/3.0.3')
response = urllib2.urlopen(req)
html = response.read()
soup = BeautifulSoup(html)
text = soup.text
start = text.index('000 results')+11
end = text.index('NextThe selection')
text = text[start:end]
print text
The problem is with your user-agent, it works for me with:
req.add_header('User-Agent', 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.71 Safari/537.36')
You are using a user-agent for Firefox 3, which is about 6 years old.
I'm looking for a quick way to get an http response code from a url. If code is 200' then download the images. Can i get response code with MyOpener`? tahnks
from urllib import FancyURLopener
class MyOpener(FancyURLopener):
version = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; it; rv:1.8.1.11) Gecko/20071127 Firefox/2.0.0.11'
myopener = MyOpener()
myopener.retrieve('http://upload.wikimedia.org/wikipedia/en/4/44/Zindagi1976.jpg', 'Zindagi1976.jpg')
UPDATE:
>>> import urllib
>>> resp = urllib.urlopen("http://upload.wikimedia.org/wikipedia/en/4/44/Zindagi1976.jpg")
>>> print resp.getcode()
403
What's wrong with this or I got your question wrong.
>>> import urllib
>>> resp = urllib.urlopen("http://docs.python.org/library/urllib.html")
>>> if resp.getcode() == 200:
... print "do my stuff"
...
do my stuff
>>>
It's nice that you have worked your way around the problem. There is a reason that wikimedia gives 403 as a response code. The reason is as soon as you send a request to access the content of wikimedia it realizes that this request is not send by a browser so it throws a 403 error.
Websites do this type of checking to make sure contents are not being accessed by bots. There are many other checks and User-Agent is one of them.
So, to make it like a browser is sending a request you can add User-Agent to your python code.
>>> import urllib2
>>> req = urllib2.Request('http://upload.wikimedia.org/wikipedia/en/4/44/Zindagi1976.jpg')
>>> useragent = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; it; rv:1.8.1.11) Gecko/20071127 Firefox/2.0.0.11'
>>> req.add_header('User-Agent',useragent)
>>> resp = urllib2.urlopen(req)
>>> resp.getcode()
200
>>> data = resp.read()
>>> with open("image.jpg","wb") as f:
... f.write(data)
...
>>>