404 Not Found when use request.urlretrieve - python

I get some response from external service. Then need to get url from this response and by this url download a file.
When i pass url to urlretrieve from response.text, urlretrieve return an Error.
But when i manually copy the url. Then set variable in python. url = 'https://My_service_site.com/9a57v4db5_2023-02-14.csv.gz'.
urlretrieve works fine and download the file to computer by this link.
response = requests.post(url, json=payload, headers=headers)
#method 1 - get error
url = response.text[17:-2] #get the link like 'https://my_provide_name.com/csv_exports/5704d5.csv.gz'
urlrtv = urllib.request.urlretrieve(url=url, filename='C:\\Users\\UserName\\Downloads\\test4.csv.gz')
>>return error: HTTP Error 404 Not Found
#method 2 - works fine
url2 = 'https://my_provide_name.com/csv_exports/5704d5.csv.gz'
urlrtv=urllib.request.urlretrieve(url=url2, filename='C:\\Users\\UserName\\Downloads\\test4.csv.gz')
>>works fine
When i copy url from method 1 and put in browser. It works fine.
Edit:
To be more precise i have tried to get url not like that response.text[17:-2]. Insted use json.loads to parse url from response. But still got the error
a = json.loads(response.text)
>>{'csv_file_url': 'https://service_name.com/csv_exports/746d6.csv.gz'}
url = a['csv_file_url']
print(url)
>>https://service_name.com/csv_exports/746d6.csv.gz
Solved: Just add time.sleep(3) before downloading file.
url = response.json()['csv_file_url']
time.sleep(3)
urlrtv = urllib.request.urlretrieve(url=url, filename=f'{storage_path}{filename}')

Related

how to download this pdf link python - 302 status code

I am trying to download pdf file. I use request. Here is the code.
url = 'https://unistream.ru/upload/iblock/230/230283e15180d590198137eba4e70644.PDF'
r = requests.get(url, allow_redirects=False)
pdf_url = r.url
with open('C:\\Users\\piolv\\Desktop\\pdf_folder\\work_file.PDF', 'wb') as f:
f.write(r.content)
print(r.content)
It is this link that opens in the browser, but it is not possible to download it. As I found out, the site does a redirect - status code 302 . But in the standard way allow_redirects=False , I can’t get this file to be downloaded. What am I doing wrong? Where is the mistake? Thanks

How to GET responde status code from get request?

Hi I am very new to python programming. Here I'm trying to write a python script which will get a status code using GET request. I can able to do it for single URL but how to do it for multiple URL's in a single script.
Here is the basic code I have written which will get response code from a url.
import requests
import json
import jsonpath
#API URL
url = "https://reqres.in/api/users?page=2"
#Send Get Request
response = requests.get(url)
if response:
print('Response OK')
else:
print('Response Failed')
# Display Response Content
print(response.content)
print(response.headers)
#Parse response to json format
json_response = json.loads(response.text)
print(json_response)
#Fetch value using Json Path
pages = jsonpath.jsonpath(json_response,'total_pages')
print(pages[0])
try this code.
import requests
with open("list_urls.txt") as f:
for url in f:
response = requests.get(url)
print ("The url is ",url,"and status code is",response.status_code)
I hope this helps.
You can acess to the status code with response.status_code
You can put your code in a function like this
def treat_url(url):
response = requests.get(url)
if response:
print('Response OK')
else:
print('Response Failed')
# Display Response Content
print(response.content)
print(response.headers)
#Parse response to json format
json_response = json.loads(response.text)
print(json_response)
#Fetch value using Json Path
pages = jsonpath.jsonpath(json_response,'total_pages')
print(pages[0])
And have a list of urls and iterate throw it:
url_list=["www.google.com","https://reqres.in/api/users?page=2"]
for url in url_list:
treat_url(url)
A couple of suggestions, the question itself is not very clear, so a good articulation would be useful for all the contributors over here :) ...
Now coming to what I was able to comprehend, there are few modifications that you can do:
response = requests.get(url) You will always get a response object, I think you might want to check the status code here, which you can do by response.status_code and based upon what you get, you say whether or not you got a success response.
and regarding looping, you can check the last page from response JSON as response_json['last_page'] and run a for loop on range(2, last_page + 1) and append the page number in URI to fetch individual pages response
You can directly fetch JSON from response object response.json()
Please refer to requests doc here

Request returns hidden characters

I am using requests.get to read a JSON object. The string downloaded is just a URL to download. I try to feed it in using requests.get(), but I get a 404 error. However, when I hardcode the value and run a requests.get(), I get a 200 response. Here is the pseudocode:
response = requests.get(repository, headers=headers, data=data)
pod_map = json.loads(response.text)['locationMap']
for key in pod_map.keys():
url = pod_map["key"] #url should be something like http://mylink.com
response = requests.get(url)
print response.status_code
The problem is that I when I run the code like this, I get a 404. However, when I just copy/paste url into a variable, I get a 200. Is there something I am missing with regards to encoding/decoding the JSON?

How to get HTML content of 404 error page using python?

I am using python to get HTML data from multiple pages at a URL. I found that urllib throws an exception when a URL does not exist. How do I retrieve the HTML of that custom 404 error page (the page where it says something like "Page is not found.")
Current code:
try:
req = Request(URL, headers={'User-Agent': 'Mozilla/5.0'})
client = urlopen(req)
#downloading html data
page_html = client.read()
#closing connection
client.close()
except:
print("The following URL was not found. Program terminated.\n" + URL)
break
Have you tried the requests library?
Just install the library with pip
pip install requests
And use it like this
import requests
response = requests.get('https://stackoverflow.com/nonexistent_path')
print(response.status_code) # 404
print(response.text) # Prints the raw HTML response
To preserve the comment that also answers the question, and also because it's what I was looking for, a way to do this without going outside urllib:
By t.m.adam at Nov 4, 2018 at 10:07
See HTTPError. It has a .read() method which returns the response content. –

Responds from Http request is different from Python and browser

I am testing the Python library request to see if it is suitable for my work. Here is my sample code for reference:
import requests
url = "http://www.genenetwork.org/webqtl/main.py?cmd=sch&gene=Grin2b&tissue=hip&format=text"
print url
print requests.get(url)
My Output:
http://www.genenetwork.org/webqtl/main.py?cmd=sch&gene=Grin2b&tissue=hip&format=text
Response [200]
Output that I get from my browser & my expected result:
What made the differences? How can I get my expected results? I wanted to process the data inside the webpage.
Your code is currently printing the status code of your GET request. You can access the requested content via the text attribute of the Response class returned by the get method.
import requests
r = requests.get("http://www.genenetwork.org/webqtl/main.py?cmd=sch&gene=Grin2b&tissue=hip&format=text")
r.text

Categories

Resources