urllib2 gives inconsistent output - often part of the webpage is downloaded - python

I am using urllib2 to open and save a webpage. However, often only a part of the webpage is downloaded, whereas sometime the full page is downloaded.
import urllib2
import time
import numpy as np
from itertools import izip
outFiles = ["outFile1.html", "outFile2.html", "outFile3.html", "outFile4.html"]
urls=["http://www.guardian.co.uk/commentisfree/2011/sep/06/cameron-nhs-bill-parliament?commentpage=all",
"http://www.guardian.co.uk/commentisfree/2011/sep/06/tory-scotland-conservative-murdo-fraser?commentpage=all",
"http://www.guardian.co.uk/commentisfree/2011/sep/06/palestine-statehood-united-nations?commentpage=all",
"http://www.guardian.co.uk/commentisfree/2011/sep/05/in-praise-of-understanding-riots?commentpage=all"]
opener = urllib2.build_opener()
user_agent = 'Mozilla/5.0 (Ubuntu; X11; Linux x86_64; rv:9.0.1) Gecko/20100101 Firefox/9.0.1'
opener.addheaders = [('User-agent', user_agent)]
urllib2.install_opener(opener)
for fileName, url in izip(outFiles,urls):
response = urllib2.urlopen(url)
responseBody = response.read()
fp=open(fileName,'w')
fp.write(responseBody)
fp.close()
time.sleep(np.random.randint(20,40))
Depending on different runs, the outputFile.html are of different size. Sometime the files are of size > 200Kb upto 1MB whereas some other time they are around 140KB. What can lead to this difference?
When the files are smaller the comment section is missing, however the file is never incomplete. Sometime the whole page including the comments are also download. I checked with curl and still had similar problem. What I don't understand is what can lead to this inconsistency.

Related

How to read a huge directory of HDF5 files directly from a URL to Python?

I have a dataset that is 1.2 TB. It is a directory of several folders and is nested in a URL. The directory looks like the image below (Li et al. 2021 - a synthetic building operation dataset).
No matter how I request to get data with urllib, the Google Collab crashes. I changed the RunTime Type as well, and it didn't work. I was wondering whether are there any methods that I can use to read this directory without purchasing the Pro versions of Google Collab?
I have used two methods to get the data.
import urllib.request
url = 'https://oedi-data-lake.s3-us-west-2.amazonaws.com/building_synthetic_dataset/A_Synthetic_Building_Operation_Dataset.h5'
with urllib.request.urlopen(url) as response:
html = response.read()
And
import urllib.request
import xarray as xr
import io
url = 'https://oedi-data-lake.s3-us-west-2.amazonaws.com/building_synthetic_dataset/A_Synthetic_Building_Operation_Dataset.h5'
req = urllib.request.Request(url)
with urllib.request.urlopen(req) as resp:
ds = xr.open_dataset(io.BytesIO(resp.read()))
I suggest using the requests module to stream the content as follows:
import requests
import io
url='https://oedi-data-lake.s3-us-west-2.amazonaws.com/building_synthetic_dataset/A_Synthetic_Building_Operation_Dataset.h5'
with requests.Session() as session:
r = session.get(url, stream=True)
r.raise_for_status()
with open('dataset.hd5', 'wb') as hd5:
for chunk in r.iter_content(chunk_size=io.DEFAULT_BUFFER_SIZE):
hd5.write(chunk)

How to download files from website using PHP with Python

I have a Python script that crawls various webistes and downloads files form them. My problem is, that some of the websites seem to be using PHP, at least that's my theory since the URLs look like this: https://www.portablefreeware.com/download.php?dd=1159
The problem is that I can't get any file names or endings from a link like this and therefore can't save the file. Currently I'm only saving the URLs.
Is there any way to get to the actual file name behind the link?
This is my stripped down download code:
r = requests.get(url, allow_redirects=True)
file = open("name.something", 'wb')
file.write(r.content)
file.close()
Disclaimer: I've never done any work with PHP so please forgive any incorrect terminolgy or understanding I have of that. I'm happy to learn more though
import requests
import mimetypes
response = requests.get('https://www.portablefreeware.com/download.php?dd=1159')
content=response.content
content_type = response.headers['Content-Type']
ext= mimetypes.guess_extension(content_type)
print(content)# [ZipBinary]
print(ext)# .zip
print(content_type)#application/zip, application/octet-stream
with open("newFile."+ext, 'wb') as f:
f.write(content)
f.close()
With your use of the allow_redirects=True option, requests.get would automatically follow the URL in the Location header of the response to make another request, losing the headers of the first response as a result, which is why you can't find the file name information anywhere.
You should instead use the allow_redirects=False option so that you can the Location header, which contains the actual download URL:
import requests
url = 'https://www.portablefreeware.com/download.php?dd=1159'
r = requests.get(url, allow_redirects=False)
print(r.headers['Location'])
This outputs:
https://www.diskinternals.com/download/Linux_Reader.exe
Demo: https://replit.com/#blhsing/TrivialLightheartedLists
You can then make another request to the download URL, and use os.path.basename to obtain the name of the file to which the content will be written:
import os
url = r.headers['Location']
with open(os.path.basename(url), 'w') as file:
r = requests.get(url)
file.write(r.content)
You're using requests for downloading. This doesn't work with downloads of this kind.
Try urllib instead:
import urllib.request
urllib.request.urlretrieve(url, filepath)
You can download the file with file name get from response header.
Here's my code for a download with a progress bar and a chunk size buffer:
To display a progress bar, use tqdm. pip install tqdm
In this, chunk write is used to save memory during downloading.
import os
import requests
import tqdm
url = "https://www.portablefreeware.com/download.php?dd=1159"
response_header = requests.head(url)
file_path = response_header.headers["Location"]
file_name = os.path.basename(file_path)
with open(file_name, "wb") as file:
response = requests.get(url, stream=True)
total_length = int(response.headers.get("content-length"))
for chunk in tqdm.tqdm(response.iter_content(chunk_size=1024), total=total_length / 1024, unit="KB"):
if chunk:
file.write(chunk)
file.flush()
Progress output:
6%|▌ | 2848/46100.1640625 [00:04<01:11, 606.90KB/s]
redirectable can be bounced via DNS distributed Network any where. So the example answers above show https://www but in my case they will be resolved to Europe so my fastest local source is coming in as
https://eu.diskinternals.com/download/Linux_Reader.exe
by far the simplest is to raw curl first if its good no need to inspect or scrape
without bothering to resolve anything,
curl -o 1159.tmp https://www.portablefreeware.com/download.php?dd=1159
however I know in this case that not the expected result, so next level is
curl -I https://www.portablefreeware.com/download.php?dd=1159 |find "Location"
and that gives the result as shown by others
https://www.diskinternals.com/download/Linux_Reader.exe
but that's not the fuller picture since if we back feed that
curl.exe -K location.txt
we get
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>301 Moved Permanently</title>
</head><body>
<h1>Moved Permanently</h1>
<p>The document has moved here.</p>
</body></html>
hence the nested redirects to
https://eu.diskinternals.com/download/Linux_Reader.exe
all of that can be command line scripted to run in loops in a line or two but I don't use Python so you will need to write perhaps a dozen lines to do similar
C:\Users\WDAGUtilityAccount\Desktop>curl -O https://eu.diskinternals.com/download/Linux_Reader.exe
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 44.9M 100 44.9M 0 0 3057k 0 0:00:15 0:00:15 --:--:-- 3640k
C:\Users\WDAGUtilityAccount\Desktop>dir /b lin*.*
Linux_Reader.exe
and from the help file yesterdays extra update (Sunday, ‎September ‎4, ‎2022) Link
curl -O https://eu.diskinternals.com/download/Uneraser_Setup.exe

Python requests - how to save a gif scraped with requests? [duplicate]

I'm trying to download and save an image from the web using python's requests module.
Here is the (working) code I used:
img = urllib2.urlopen(settings.STATICMAP_URL.format(**data))
with open(path, 'w') as f:
f.write(img.read())
Here is the new (non-working) code using requests:
r = requests.get(settings.STATICMAP_URL.format(**data))
if r.status_code == 200:
img = r.raw.read()
with open(path, 'w') as f:
f.write(img)
Can you help me on what attribute from the response to use from requests?
You can either use the response.raw file object, or iterate over the response.
To use the response.raw file-like object will not, by default, decode compressed responses (with GZIP or deflate). You can force it to decompress for you anyway by setting the decode_content attribute to True (requests sets it to False to control decoding itself). You can then use shutil.copyfileobj() to have Python stream the data to a file object:
import requests
import shutil
r = requests.get(settings.STATICMAP_URL.format(**data), stream=True)
if r.status_code == 200:
with open(path, 'wb') as f:
r.raw.decode_content = True
shutil.copyfileobj(r.raw, f)
To iterate over the response use a loop; iterating like this ensures that data is decompressed by this stage:
r = requests.get(settings.STATICMAP_URL.format(**data), stream=True)
if r.status_code == 200:
with open(path, 'wb') as f:
for chunk in r:
f.write(chunk)
This'll read the data in 128 byte chunks; if you feel another chunk size works better, use the Response.iter_content() method with a custom chunk size:
r = requests.get(settings.STATICMAP_URL.format(**data), stream=True)
if r.status_code == 200:
with open(path, 'wb') as f:
for chunk in r.iter_content(1024):
f.write(chunk)
Note that you need to open the destination file in binary mode to ensure python doesn't try and translate newlines for you. We also set stream=True so that requests doesn't download the whole image into memory first.
Get a file-like object from the request and copy it to a file. This will also avoid reading the whole thing into memory at once.
import shutil
import requests
url = 'http://example.com/img.png'
response = requests.get(url, stream=True)
with open('img.png', 'wb') as out_file:
shutil.copyfileobj(response.raw, out_file)
del response
How about this, a quick solution.
import requests
url = "http://craphound.com/images/1006884_2adf8fc7.jpg"
response = requests.get(url)
if response.status_code == 200:
with open("/Users/apple/Desktop/sample.jpg", 'wb') as f:
f.write(response.content)
I have the same need for downloading images using requests. I first tried the answer of Martijn Pieters, and it works well. But when I did a profile on this simple function, I found that it uses so many function calls compared to urllib and urllib2.
I then tried the way recommended by the author of requests module:
import requests
from PIL import Image
# python2.x, use this instead
# from StringIO import StringIO
# for python3.x,
from io import StringIO
r = requests.get('https://example.com/image.jpg')
i = Image.open(StringIO(r.content))
This much more reduced the number of function calls, thus speeded up my application.
Here is the code of my profiler and the result.
#!/usr/bin/python
import requests
from StringIO import StringIO
from PIL import Image
import profile
def testRequest():
image_name = 'test1.jpg'
url = 'http://example.com/image.jpg'
r = requests.get(url, stream=True)
with open(image_name, 'wb') as f:
for chunk in r.iter_content():
f.write(chunk)
def testRequest2():
image_name = 'test2.jpg'
url = 'http://example.com/image.jpg'
r = requests.get(url)
i = Image.open(StringIO(r.content))
i.save(image_name)
if __name__ == '__main__':
profile.run('testUrllib()')
profile.run('testUrllib2()')
profile.run('testRequest()')
The result for testRequest:
343080 function calls (343068 primitive calls) in 2.580 seconds
And the result for testRequest2:
3129 function calls (3105 primitive calls) in 0.024 seconds
This might be easier than using requests. This is the only time I'll ever suggest not using requests to do HTTP stuff.
Two liner using urllib:
>>> import urllib
>>> urllib.request.urlretrieve("http://www.example.com/songs/mp3.mp3", "mp3.mp3")
There is also a nice Python module named wget that is pretty easy to use. Found here.
This demonstrates the simplicity of the design:
>>> import wget
>>> url = 'http://www.futurecrew.com/skaven/song_files/mp3/razorback.mp3'
>>> filename = wget.download(url)
100% [................................................] 3841532 / 3841532>
>> filename
'razorback.mp3'
Enjoy.
Edit: You can also add an out parameter to specify a path.
>>> out_filepath = <output_filepath>
>>> filename = wget.download(url, out=out_filepath)
Following code snippet downloads a file.
The file is saved with its filename as in specified url.
import requests
url = "http://example.com/image.jpg"
filename = url.split("/")[-1]
r = requests.get(url, timeout=0.5)
if r.status_code == 200:
with open(filename, 'wb') as f:
f.write(r.content)
There are 2 main ways:
Using .content (simplest/official) (see Zhenyi Zhang's answer):
import io # Note: io.BytesIO is StringIO.StringIO on Python2.
import requests
r = requests.get('http://lorempixel.com/400/200')
r.raise_for_status()
with io.BytesIO(r.content) as f:
with Image.open(f) as img:
img.show()
Using .raw (see Martijn Pieters's answer):
import requests
r = requests.get('http://lorempixel.com/400/200', stream=True)
r.raise_for_status()
r.raw.decode_content = True # Required to decompress gzip/deflate compressed responses.
with PIL.Image.open(r.raw) as img:
img.show()
r.close() # Safety when stream=True ensure the connection is released.
Timing both shows no noticeable difference.
As easy as to import Image and requests
from PIL import Image
import requests
img = Image.open(requests.get(url, stream = True).raw)
img.save('img1.jpg')
This is how I did it
import requests
from PIL import Image
from io import BytesIO
url = 'your_url'
files = {'file': ("C:/Users/shadow/Downloads/black.jpeg", open('C:/Users/shadow/Downloads/black.jpeg', 'rb'),'image/jpg')}
response = requests.post(url, files=files)
img = Image.open(BytesIO(response.content))
img.show()
Here is a more user-friendly answer that still uses streaming.
Just define these functions and call getImage(). It will use the same file name as the url and write to the current directory by default, but both can be changed.
import requests
from StringIO import StringIO
from PIL import Image
def createFilename(url, name, folder):
dotSplit = url.split('.')
if name == None:
# use the same as the url
slashSplit = dotSplit[-2].split('/')
name = slashSplit[-1]
ext = dotSplit[-1]
file = '{}{}.{}'.format(folder, name, ext)
return file
def getImage(url, name=None, folder='./'):
file = createFilename(url, name, folder)
with open(file, 'wb') as f:
r = requests.get(url, stream=True)
for block in r.iter_content(1024):
if not block:
break
f.write(block)
def getImageFast(url, name=None, folder='./'):
file = createFilename(url, name, folder)
r = requests.get(url)
i = Image.open(StringIO(r.content))
i.save(file)
if __name__ == '__main__':
# Uses Less Memory
getImage('http://www.example.com/image.jpg')
# Faster
getImageFast('http://www.example.com/image.jpg')
The request guts of getImage() are based on the answer here and the guts of getImageFast() are based on the answer above.
I'm going to post an answer as I don't have enough rep to make a comment, but with wget as posted by Blairg23, you can also provide an out parameter for the path.
wget.download(url, out=path)
This is the first response that comes up for google searches on how to download a binary file with requests. In case you need to download an arbitrary file with requests, you can use:
import requests
url = 'https://s3.amazonaws.com/lab-data-collections/GoogleNews-vectors-negative300.bin.gz'
open('GoogleNews-vectors-negative300.bin.gz', 'wb').write(requests.get(url, allow_redirects=True).content)
my approach was to use response.content (blob) and save to the file in binary mode
img_blob = requests.get(url, timeout=5).content
with open(destination + '/' + title, 'wb') as img_file:
img_file.write(img_blob)
Check out my python project that downloads images from unsplash.com based on keywords.
You can do something like this:
import requests
import random
url = "https://images.pexels.com/photos/1308881/pexels-photo-1308881.jpeg? auto=compress&cs=tinysrgb&dpr=1&w=500"
name=random.randrange(1,1000)
filename=str(name)+".jpg"
response = requests.get(url)
if response.status_code.ok:
with open(filename,'w') as f:
f.write(response.content)
Agree with Blairg23 that using urllib.request.urlretrieve is one of the easiest solutions.
One note I want to point out here. Sometimes it won't download anything because the request was sent via script (bot), and if you want to parse images from Google images or other search engines, you need to pass user-agent to request headers first, and then download the image, otherwise, the request will be blocked and it will throw an error.
Pass user-agent and download image:
opener=urllib.request.build_opener()
opener.addheaders=[('User-Agent','Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582')]
urllib.request.install_opener(opener)
urllib.request.urlretrieve(URL, 'image_name.jpg')
Code in the online IDE that scrapes and downloads images from Google images using requests, bs4, urllib.requests.
Alternatively, if your goal is to scrape images from search engines like Google, Bing, Yahoo!, DuckDuckGo (and other search engines), then you can use SerpApi. It's a paid API with a free plan.
The biggest difference is that there's no need to figure out how to bypass blocks from search engines or how to extract certain parts from the HTML or JavaScript since it's already done for the end-user.
Example code to integrate:
import os, urllib.request
from serpapi import GoogleSearch
params = {
"api_key": os.getenv("API_KEY"),
"engine": "google",
"q": "pexels cat",
"tbm": "isch"
}
search = GoogleSearch(params)
results = search.get_dict()
print(json.dumps(results['images_results'], indent=2, ensure_ascii=False))
# download images
for index, image in enumerate(results['images_results']):
# print(f'Downloading {index} image...')
opener=urllib.request.build_opener()
opener.addheaders=[('User-Agent','Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582')]
urllib.request.install_opener(opener)
# saves original res image to the SerpApi_Images folder and add index to the end of file name
urllib.request.urlretrieve(image['original'], f'SerpApi_Images/original_size_img_{index}.jpg')
-----------
'''
]
# other images
{
"position": 100, # 100 image
"thumbnail": "https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcQK62dIkDjNCvEgmGU6GGFZcpVWwX-p3FsYSg&usqp=CAU",
"source": "homewardboundnj.org",
"title": "pexels-helena-lopes-1931367 - Homeward Bound Pet Adoption Center",
"link": "https://homewardboundnj.org/upcoming-event/black-cat-appreciation-day/pexels-helena-lopes-1931367/",
"original": "https://homewardboundnj.org/wp-content/uploads/2020/07/pexels-helena-lopes-1931367.jpg",
"is_product": false
}
]
'''
Disclaimer, I work for SerpApi.
Here is a very simple code
import requests
response = requests.get("https://i.imgur.com/ExdKOOz.png") ## Making a variable to get image.
file = open("sample_image.png", "wb") ## Creates the file for image
file.write(response.content) ## Saves file content
file.close()
for download Image
import requests
Picture_request = requests.get(url)

downloading a file, not the contents

I am trying to automate downloading a .Z file from a website, but the file I get is 2kb when it should be around 700 kb and it contains a list of the contents of the page (ie: all the files available for download). I am able to download it manually without a problem. I have tried urllib and urllib2 and different configurations of each, but each does the same thing. I should add that the urlVar and fileName variables are generated in a different part of the code, but I have given an example of each here to demonstrate.
import urllib2
urlVar = "ftp://www.ngs.noaa.gov/cors/rinex/2014/100/txga/txga1000.14d.Z"
fileName = txga1000.14d.Z
downFile = urllib2.urlopen(urlVar)
with open(fileName, "wb") as f:
f.write(downFile.read())
At least the urllib2documentation suggest you should use the Requestobject. This works with me:
import urllib2
req = urllib2.Request("ftp://www.ngs.noaa.gov/cors/rinex/2014/100/txga/txga1000.14d.Z")
response = urllib2.urlopen(req)
data = response.read()
Data length seems to be 740725.
I was able to download what seems like the correct size for your file with the following python2 code:
import urllib2
filename = "txga1000.14d.Z"
url = "ftp://www.ngs.noaa.gov/cors/rinex/2014/100/txga/{}".format(filename)
reply = urllib2.urlopen(url)
buf = reply.read()
with open(filename, "wb") as fh:
fh.write(buf)
Edit: The post above me was answered faster and is much better.. I thought I'd post since I tested and wrote this out anyways.

Python PIL: IOError: cannot identify image file

I am trying to get the image from the following URL:
image_url = http://www.eatwell101.com/wp-content/uploads/2012/11/Potato-Pancakes-recipe.jpg?b14316
When I navigate to it in a browser, it sure looks like an image. But I get an error when I try:
import urllib, cStringIO, PIL
from PIL import Image
img_file = cStringIO.StringIO(urllib.urlopen(image_url).read())
image = Image.open(img_file)
IOError: cannot identify image file
I have copied hundreds of images this way, so I'm not sure what's special here. Can I get this image?
when I open the file using
In [3]: f = urllib.urlopen('http://www.eatwell101.com/wp-content/uploads/2012/11/Potato-Pancakes-recipe.jpg')
In [9]: f.code
Out[9]: 403
This is not returning an image.
You could try specifing a user-agent header to see if you can trick the server into thinking you are a browser.
Using requests library (because it is easier to send header information)
In [7]: f = requests.get('http://www.eatwell101.com/wp-content/uploads/2012/11/Potato-Pancakes-recipe.jpg', headers={'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:16.0) Gecko/20100101 Firefox/16.0,gzip(gfe)'})
In [8]: f.status_code
Out[8]: 200
The problem exists not in the image.
>>> urllib.urlopen(image_url).read()
'\n<?xml version="1.0" encoding="utf-8"?>\n<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"\n "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">\n<html>\n <head>\n <title>403 You are banned from this site. Please contact via a different client configuration if you believe that this is a mistake.</title>\n </head>\n <body>\n <h1>Error 403 You are banned from this site. Please contact via a different client configuration if you believe that this is a mistake.</h1>\n <p>You are banned from this site. Please contact via a different client configuration if you believe that this is a mistake.</p>\n <h3>Guru Meditation:</h3>\n <p>XID: 1806024796</p>\n <hr>\n <p>Varnish cache server</p>\n </body>\n</html>\n'
Using user agent header will solve the problem.
opener = urllib2.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
response = opener.open(image_url)
img_file = cStringIO.StringIO(response.read())
image = Image.open(img_file)
To get some image , you can first save the image and then , load it to PIL . for example:
import urllib2,PIL
opener = urllib2.build_opener(urllib2.HTTPRedirectHandler(), urllib2.HTTPCookieProcessor())
image_content = opener.open("http://www.eatwell101.com/wp-content/uploads/2012/11/Potato-Pancakes-recipe.jpg?b14316").read()
opener.close()
save_dir = r"/some/folder/to/save/image.jpg"
f = open(save_dir,'wb')
f.write(image_content)
f.close()
image = Image.open(save_dir)
...

Categories

Resources