urllib.request.urlopen not working for a specific website - python

I used urllib.request.Request for the url of a memidex.com page, but the urllib.request.urlopen(url) line goes on to fail to open the url.
url = urllib.request.Request("http://www.memidex.com/" + term)
my_request = urllib.request.urlopen(url)
info = BeautifulSoup(my_request, "html.parser")
I've tried using the same code for a different website and it worked for that one so I have no idea why it's not working for memidex.com.

You need to add headers to your url request in order to overcome the error. BTW 'HTTP Error 403: Forbidden' was your error right?
Hope the below code helps you.
import urllib.request
user_agent = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.7) Gecko/2009021910 Firefox/3.0.7'
url = "http://www.memidex.com/"
headers={'User-Agent':user_agent,}
request=urllib.request.Request(url,None,headers)
response = urllib.request.urlopen(request)
data = response.read()
print(data)

Related

How to know what user-agent should I use? (404 Client Error Not Found for url)

I'm trying to download data from a specific URL and I get the "404 Client Error: Not Found for URL" Error. The website I'm trying to access is an FTP server of a university.
From searching the web I understand that a user-agent must be configured but even after configuration I still get the same error...
The URL I'm trying to access is this- https://idcftp.files.com/files/Users%20Folders/yoav.yair/WWLLN%20Data/December2018/
(You need a password to access it, but this is information I can't give).
The code I'm trying to use is this-
from requests.auth import HTTPBasicAuth
from bs4 import BeautifulSoup
def get_url_paths(url, header_list, ext='', params={}):
response = requests.get(url, params=params, headers=header_list, auth=HTTPBasicAuth('<user_name>', '<password>'))
# response = requests.get(url, params=params)
if response.ok:
response_text = response.text
else:
return response.raise_for_status()
soup = BeautifulSoup(response_text, 'html.parser')
parent = [url + node.get('href') for node in soup.find_all('a') if node.get('href').endswith(ext)]
return parent
def main():
# url = 'https://www.ncei.noaa.gov/data/total-solar-irradiance/access/monthly/'
header_list = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36'}
url = 'https://idcftp.files.com/files/Users%20Folders/yoav.yair/WWLLN%20Data/December2018/'
ext = 'mat'
result = get_url_paths(url, header_list, ext)
for file in result:
f_name = file[-19:-13]
print(f_name)
if __name__ == '__main__':
main()
I've tried using all kinds of user agents, but nothing works. How can I find what user agent this website uses?
Thank you,
Karin.

Web Scraping with urllib and fixing 403: Forbidden

I used this cod to get a web page and it worked well
but now isnt work
 I try so many headers but still geting 403 error
this cod work for most sites but i cant get for example
this page
def get_page(addr):
headers = {}
headers['User-Agent'] = "Mozilla/5.0 (X11; Ubuntu; Linux i686; rv:48.0) Gecko/20100101 Firefox/48.0"
req = urllib.request.Request(addr, headers=headers)
html = urllib.request.urlopen(req).read()
return str(html)
Try Selenium:
from selenium import webdriver
import os
# initialise browser
browser = webdriver.Chrome(os.getcwd() + '/chromedriver')
browser.get('https://www.fragrantica.com/perfume/Victorio-Lucchino/No-4- Evasion-Exotica-50418.html')
# get page html
html = browser.page_source

Python - How do I wait for server response using requests

I use the following code to retrieve a web page.
import requests
payload = {'name': temp} #I extract temp from another page.
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; rv:49.0) Gecko/20100101 Firefox/49.0','Accept': 'text/html, */*; q=0.01','Accept-Language': 'en-US,en;q=0.5', 'X-Requested-With': 'XMLHttpRequest' }
full_url = url.rstrip() + '/test/log?'
r = requests.get(full_url, params=payload, headers=headers, stream=True)
for line in r.iter_lines():
if line:
print line
However for some reason the http response is lacking the text inside tags.
I found out that if I send the request to Burp, intercept it and wait for 3 secs before forwarding it, then I get the complete html page containing the text inside the tags....
I still could not find the cause. Ideas?
From requests documentation:
By default, when you make a request, the body of the response is
downloaded immediately. You can override this behaviour and defer
downloading the response body until you access the Response.content
attribute with the stream parameter:
Body Content Workflow
In other words try removing stream=True in your requests.get()
or
You will have all the content when you access r.content, where r is the response.

ValueError: Expecting value: line 1 column 1 (char 0)

Checked the other answers for similar problems, but couldn't find anything that solved this particular problem. I can't figure out why I'm getting error, because I don't believe I'm missing any values. Also, I think it's odd that it says line 1 column 1 (char 0) - any of you wonderful people have any ideas?
import json
import urllib.request
user_agent = "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.7) Gecko/2009021910 Firefox/3.0.7"
url = "http://api.usatoday.com/open/articles/topnews?encoding=json&api_key=98jv5a93qs"
headers={"User-Agent":user_agent,}
request = urllib.request.Request(url, None, headers)
parsed_json = json.loads(str(request))
for i in range(6):
title = parsed_json['stories'][i]['title']
link = parsed_json['stories'][i]['link']
print(title)
print(link)
print("-----------------------------------")
you are trying to parse the response JSON. but you didn't event sent the request.
you should send your Request and then parse the response JSON:
import json
import urllib.request
user_agent = "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.7) Gecko/2009021910 Firefox/3.0.7"
url = "http://api.usatoday.com/open/articles/topnews?encoding=json&api_key=98jv5a93qs"
headers={"User-Agent":user_agent,}
request = urllib.request.Request(url, None, headers)
res = urllib.request.urlopen(request)
parsed_json = json.loads(res.readall())
for i in range(6):
title = parsed_json['stories'][i]['title']
link = parsed_json['stories'][i]['link']
print(title)
print(link)
print("-----------------------------------")
From what I've seen in both the docs (or v. 2) and at the URL above, the issue is that you are trying to parse JSON which is not JSON. I suggest wrapping your call to json.loads in a try... except block and handle bad JSON. This is generally good practice anyway.
For good measure I looked up the source code for the json module. It looks like all errors from Py2k point to value errors, thought I could not find the specific error you mention.
Based on my read of the JSON module, you'll also be able to get more information if you use try...except and print the properties of the error module as well.

Why isn't my Python program working? It uses HTTP Headers

EDIT: I changed the code and it still doesn't work! I used the links from the answer to do it but it didn't work!
Why does this not work? When I run it takes a long time to run and never finishes!
import urllib
import urllib2
url = 'https://www.locationary.com/index.jsp?ACTION_TOKEN=tile_loginBar_jsp$JspView$LoginAction'
values = {'inUserName' : 'USER',
'inUserPass' : 'PASSWORD'}
data = urllib.urlencode(values)
req = urllib2.Request(url, data)
req.add_header('Host', 'www.locationary.com')
req.add_header('User-Agent', 'Mozilla/5.0 (Windows NT 6.1; rv:8.0) Gecko/20100101 Firefox/8.0')
req.add_header('Accept', 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8')
req.add_header('Accept-Language', 'en-us,en;q=0.5')
req.add_header('Accept-Encoding','gzip, deflate')
req.add_header('Accept-Charset','ISO-8859-1,utf-8;q=0.7,*;q=0.7')
req.add_header('Connection','keep-alive')
req.add_header('Referer','http://www.locationary.com/')
req.add_header('Cookie','site_version=REGULAR; __utma=47547066.1079503560.1321924193.1322707232.1324693472.36; __utmz=47547066.1321924193.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none); nickname=jacob501; locaCountry=1033; locaState=1795; locaCity=Montreal; jforumUserId=1; PMS=1; TurnOFfTips=true; Locacookie=enable; __utma=47547066.1079503560.1321924193.1322707232.1324693472.36; __utmz=47547066.1321924193.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none); nickname=jacob501; PMS=1; __utmb=47547066.15.10.1324693472; __utmc=47547066; JSESSIONID=DC7F5AB08264A51FBCDB836393CB16E7; PSESSIONID=28b334905ab6305f7a7fe051e83857bc280af1a9; __utmc=47547066; __utmb=47547066.15.10.1324693472; ACTION_RESULT_CODE=ACTION_RESULT_FAIL; ACTION_ERROR_TEXT=java.lang.NullPointerException')
req.add_header('Content-Type','application/x-www-form-urlencoded')
#user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
#headers = { 'User-Agent' : user_agent }
response = urllib2.urlopen(req)
page = response.read()
print page
The remote server (the one at www.locationary.com) is waiting for the content of your HTTP post request, based on the Content-Type and Content-Length headers. Since you're never actually sending said awaited data, the remote server waits — and so does read() — until you do so.
I need to know how to send the content of my http post request.
Well, you need to actually send some data in the request. See:
urllib2 - The Missing Manual
How do I send a HTTP POST value to a (PHP) page using Python?
Final, "working" version:
import urllib
import urllib2
url = 'https://www.locationary.com/index.jsp?ACTION_TOKEN=tile_loginBar_jsp$JspView$LoginAction'
values = {'inUserName' : 'USER',
'inUserPass' : 'PASSWORD'}
data = urllib.urlencode(values)
req = urllib2.Request(url, data)
req.add_header('Host', 'www.locationary.com')
req.add_header('User-Agent', 'Mozilla/5.0 (Windows NT 6.1; rv:8.0) Gecko/20100101 Firefox/8.0')
req.add_header('Accept', 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8')
req.add_header('Accept-Language', 'en-us,en;q=0.5')
req.add_header('Accept-Charset','ISO-8859-1,utf-8;q=0.7,*;q=0.7')
req.add_header('Connection','keep-alive')
req.add_header('Referer','http://www.locationary.com/')
req.add_header('Cookie','site_version=REGULAR; __utma=47547066.1079503560.1321924193.1322707232.1324693472.36; __utmz=47547066.1321924193.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none); nickname=jacob501; locaCountry=1033; locaState=1795; locaCity=Montreal; jforumUserId=1; PMS=1; TurnOFfTips=true; Locacookie=enable; __utma=47547066.1079503560.1321924193.1322707232.1324693472.36; __utmz=47547066.1321924193.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none); nickname=jacob501; PMS=1; __utmb=47547066.15.10.1324693472; __utmc=47547066; JSESSIONID=DC7F5AB08264A51FBCDB836393CB16E7; PSESSIONID=28b334905ab6305f7a7fe051e83857bc280af1a9; __utmc=47547066; __utmb=47547066.15.10.1324693472; ACTION_RESULT_CODE=ACTION_RESULT_FAIL; ACTION_ERROR_TEXT=java.lang.NullPointerException')
req.add_header('Content-Type','application/x-www-form-urlencoded')
response = urllib2.urlopen(req)
page = response.read()
print page
Don't explicitly set the Content-Length header
Remove the req.add_header('Accept-Encoding','gzip, deflate') line, so that the response doesn't have to be decompressed (or — exercise left to the reader — ungzip it yourself)

Categories

Resources